Within-group quotas (Top N per group)

from the Artful Common Queries page


A table has multiple rows per key value, and you need to retrieve, say, the first or earliest two rows per key. For example:
DROP TABLE IF EXISTS test;
CREATE TABLE test( id INT, entrydate DATE );
INSERT INTO test VALUES
( 1, '2007-5-01' ),( 1, '2007-5-02' ),
( 1, '2007-5-03' ),( 1, '2007-5-04' ),
( 1, '2007-5-05' ),( 1, '2007-5-06' ),
( 2, '2007-6-01' ),( 2, '2007-6-02' ),
( 2, '2007-6-03' ),( 2, '2007-6-04' ),
( 3, '2007-7-01' ),( 3, '2007-7-02' ),
( 3, '2007-7-03' );
In MySQL since version 8.0.2 and MariaDB since version 10.2.2, windowing with Row_Number() allows this straightforward solution:
SELECT id, entrydate, `rank`
FROM ( 
  SELECT 
    id, entrydate, 
    ROW_NUMBER() 
      OVER( PARTITION BY id 
            ORDER BY id,entrydate
          ) AS `rank`
  FROM test 
  ORDER BY id 
) AS tmp 
WHERE tmp.`rank` <= 2 
ORDER BY id, entrydate; 
+------+------------+------+
| id   | entrydate  | rank |
+------+------------+------+
|    1 | 2007-05-01 |    1 |
|    1 | 2007-05-02 |    2 |
|    2 | 2007-06-01 |    1 |
|    2 | 2007-06-02 |    2 |
|    3 | 2007-07-01 |    1 |
|    3 | 2007-07-02 |    2 |
+------+------------+------+
Note that since there's now a Rank() function, rank is a reserved word needing backticks round it. What if you can't yet get to MySQL 8 or MariaDB 10? One approach is to rank rows with user variables and pick off the top two for each key in the WHERE clause:
SELECT id, entrydate, `rank`
FROM (
  SELECT
    id, entrydate,
    IF( @prev <> id, 
        @rownum := 1, 
        @rownum := @rownum+1 
      ) AS `rank`,
    @prev := id
  FROM test.test
  JOIN (SELECT @rownum:=NULL,@prev:=0) AS r
  ORDER BY id, entrydate
) AS tmp
WHERE tmp.`rank` <= 2
ORDER BY id, entrydate;
This is pretty much the same query pattern as the user variable Row_Number() emulation method described in the "Row_Number()" entry. The join in the subquery is just a device for resetting the variables after reading a row. How do Row_Number() and the user variable method compare in performance? They both return correct results for 1,000 rows in less than a hundredth of a second on a modest machine; Row_Number() is slightly faster but you'll need to be reading millions of rows to notice the difference.

If the groups are fairly small, another feasible approach is to self-join and count. With appropriate ordering, the first two rows per ID are the rows which, for a given ID, have two or fewer rows with earlier dates. If we use an inequality join with the COUNT(*) function to find the earlier rows per ID ...

SELECT t1.id, t1.entrydate, COUNT(*) AS earlier
FROM test AS t1
JOIN test AS t2 
  ON t1.id=t2.id 
  AND t1.entrydate >= t2.entrydate
GROUP BY t1.id, t1.entrydate
+------+------------+---------+
| id   | entrydate  | earlier |
+------+------------+---------+
|    1 | 2007-05-01 |       1 |
|    1 | 2007-05-02 |       2 |
|    1 | 2007-05-03 |       3 |
|    1 | 2007-05-04 |       4 |
|    1 | 2007-05-05 |       5 |
|    1 | 2007-05-06 |       6 |
|    2 | 2007-06-01 |       1 |
|    2 | 2007-06-02 |       2 |
|    2 | 2007-06-03 |       3 |
|    2 | 2007-06-04 |       4 |
|    3 | 2007-07-01 |       1 |
|    3 | 2007-07-02 |       2 |
|    3 | 2007-07-03 |       3 |
+------+------------+---------+
... then we get our result by removing rows where the 'earlier' count exceeds 2:
SELECT t1.id, t1.entrydate, count(*) AS earlier
FROM test AS t1
JOIN test AS t2 
  ON t1.id=t2.id 
  AND t1.entrydate >= t2.entrydate
GROUP BY t1.id, t1.entrydate
HAVING earlier <= 2;
+------+------------+---------+
| id   | entrydate  | earlier |
+------+------------+---------+
|    1 | 2007-05-01 |       1 |
|    1 | 2007-05-02 |       2 |
|    2 | 2007-06-01 |       1 |
|    2 | 2007-06-02 |       2 |
|    3 | 2007-07-01 |       1 |
|    3 | 2007-07-02 |       2 |
+------+------------+---------+
This is about as efficient as the first method with a small table, but it compares every within-group row to every other within-group row. As the size N of a group increases, execution time increases by N2. If the query takes one minute for groups of 1,000, it will take 16 minutes for groups of 4,000, and more than four hours for groups for 16,000. The solution does not scale.

What to do? Forget GROUP BY! Manually assemble the desired query results in a temporary table from simple indexed queries, in this case, two rows per ID:

DROP TEMPORARY TABLE IF EXISTS earliers;
CREATE TEMPORARY TABLE earliers( id INT, entrydate DATE);
INSERT INTO earliers 
  SELECT id,entrydate 
  FROM test 
  WHERE id=1 
  ORDER BY entrydate LIMIT 2;
INSERT INTO earliers 
  SELECT id,entrydate 
  FROM test 
  WHERE id=2 
  ORDER BY entrydate LIMIT 2;
INSERT INTO earliers 
  SELECT id,entrydate 
  FROM test 
  WHERE id=3 
  ORDER BY entrydate LIMIT 2;
You need one INSERT statement per grouping value. To print the result, just query the earliers table:
SELECT * FROM earliers
ORDER BY id, entrydate;
+------+------------+
| id   | entrydate  |
+------+------------+
|    1 | 2007-05-01 |
|    1 | 2007-05-02 |
|    2 | 2007-06-01 |
|    2 | 2007-06-02 |
|    3 | 2007-07-01 |
|    3 | 2007-07-02 |
+------+------------+
DROP TEMPORARY TABLE earliers;
Most useful reports run again and again. If that's the case for yours, automate it in a stored procedure: using a cursor and a prepared statement, auto-generate an INSERT statement for every grouping value, and return the result:
DROP PROCEDURE IF EXISTS listearliers;
DELIMITER go
CREATE PROCEDURE listearliers()
BEGIN
  DECLARE curdone, vid INT DEFAULT 0;
  DECLARE idcur CURSOR FOR 
    SELECT DISTINCT id FROM test;
  DECLARE CONTINUE HANDLER 
    FOR SQLSTATE '02000' SET curdone = 1;
  DROP TEMPORARY TABLE IF EXISTS earliers;
  CREATE TEMPORARY TABLE earliers( 
    id INT, entrydate DATE);
  SET @sql = 
  'INSERT INTO earliers 
    SELECT id,entrydate 
    FROM test 
    WHERE id=? ORDER BY  entrydate LIMIT 2';
  OPEN idcur;
  REPEAT
    FETCH idcur INTO vid;
    IF NOT curdone THEN
      BEGIN
        SET @vid = vid;
        PREPARE stmt FROM @sql;
        EXECUTE stmt USING @vid;
        DROP PREPARE stmt;
      END;
    END IF;
  UNTIL curdone END REPEAT;
  CLOSE idcur;
  SELECT * FROM earliers 
  ORDER BY id,entrydate;
  DROP TEMPORARY TABLE earliers;
END;
go
DELIMITER ;
CALL listearliers();

Last updated 16 Aug 2019