Within-group quotas (Top N per group)
A table has multiple rows per key value, and you need to retrieve, say, the first or earliest two rows per key. For example:
DROP TABLE IF EXISTS test;
CREATE TABLE test( id INT, entrydate DATE );
INSERT INTO test VALUES
( 1, '2024-5-01' ),( 1, '2024-5-02' ),( 1, '2024-5-03' ),( 1, '2024-5-04' ),
( 1, '2024-5-05' ),( 1, '2024-5-06' ),( 2, '2024-6-01' ),( 2, '2024-6-02' ),
( 2, '2024-6-03' ),( 2, '2024-6-04' ),( 3, '2024-7-01' ),( 3, '2024-7-02' ),
( 3, '2024-7-03' );
One approach is to rank rows with user variables and pick off the top two for each key in the WHERE
clause:
SELECT tmp.ID, tmp.entrydate
FROM (
SELECT
ID, entrydate,
IF( @prev <> ID, @rownum := 1, @rownum := @rownum+1 ) AS rank,
@prev := ID
FROM test t
JOIN (SELECT @rownum := NULL, @prev := 0) AS r
ORDER BY t.ID
) AS tmp
WHERE tmp.rank <= 2
ORDER BY ID, entrydate;
+------+------------+
| ID | entrydate |
+------+------------+
| 1 | 2007-05-01 |
| 1 | 2007-05-02 |
| 2 | 2007-06-03 |
| 2 | 2007-06-04 |
| 3 | 2007-07-01 |
| 3 | 2007-07-02 |
+------+------------+
This is pretty much the same query pattern as the user variable method of emulating Row_Number(). The join in the subquery is just a device for resetting the variables after reading a row.
If the groups are fairly small, another feasible approach is to self-join and count. With appropriate ordering, the first two rows per ID
are the rows which, for a given ID
, have two or fewer rows with earlier dates. If we use an inequality join with the COUNT(*)
function to find the earlier rows per ID
...
SELECT t1.id, t1.entrydate, COUNT(*) AS earlier
FROM test AS t1
JOIN test AS t2 ON t1.id=t2.id AND t1.entrydate >= t2.entrydate
GROUP BY t1.id, t1.entrydate
+------+------------+---------+
| id | entrydate | earlier |
+------+------------+---------+
| 1 | 2007-05-01 | 1 |
| 1 | 2007-05-02 | 2 |
| 1 | 2007-05-03 | 3 |
| 1 | 2007-05-04 | 4 |
| 1 | 2007-05-05 | 5 |
| 1 | 2007-05-06 | 6 |
| 2 | 2007-06-01 | 1 |
| 2 | 2007-06-02 | 2 |
| 2 | 2007-06-03 | 3 |
| 2 | 2007-06-04 | 4 |
| 3 | 2007-07-01 | 1 |
| 3 | 2007-07-02 | 2 |
| 3 | 2007-07-03 | 3 |
+------+------------+---------+
... then we get our result by removing rows where the 'earlier' count exceeds 2:
SELECT t1.id, t1.entrydate, count(*) AS earlier
FROM test AS t1
JOIN test AS t2 ON t1.id=t2.id AND t1.entrydate >= t2.entrydate
GROUP BY t1.id, t1.entrydate
HAVING earlier <= 2;
+------+------------+---------+
| id | entrydate | earlier |
+------+------------+---------+
| 1 | 2007-05-01 | 1 |
| 1 | 2007-05-02 | 2 |
| 2 | 2007-06-01 | 1 |
| 2 | 2007-06-02 | 2 |
| 3 | 2007-07-01 | 1 |
| 3 | 2007-07-02 | 2 |
+------+------------+---------+
This is about as efficient as the first method with a small table, but it compares every within-group row to every other within-group row. As the size N of a group increases, execution time increases by N*N. If the query takes one minute for groups of 1,000, it will take 16 minutes for groups of 4,000, and more than four hours for groups for 16,000. The solution does not scale.
What to do? Forget GROUP BY
! Manually assemble the desired query results in a temporary table from simple indexed queries, in this case, two rows per ID
:
DROP TEMPORARY TABLE IF EXISTS earliers;
CREATE TEMPORARY TABLE earliers( id INT, entrydate DATE);
INSERT INTO earliers
SELECT id,entrydate FROM test WHERE id=1 ORDER BY entrydate LIMIT 2;
INSERT INTO earliers
SELECT id,entrydate FROM test WHERE id=2 ORDER BY entrydate LIMIT 2;
INSERT INTO earliers
SELECT id,entrydate FROM test WHERE id=3 ORDER BY entrydate LIMIT 2;
You need one INSERT
statement per grouping value. To print the result, just query the earliers
table:
SELECT * FROM earliers
ORDER BY id, entrydate;
+------+------------+
| id | entrydate |
+------+------------+
| 1 | 2007-05-01 |
| 1 | 2007-05-02 |
| 2 | 2007-06-01 |
| 2 | 2007-06-02 |
| 3 | 2007-07-01 |
| 3 | 2007-07-02 |
+------+------------+
DROP TEMPORARY TABLE earliers;
Most useful reports run again and again. If that's the case for yours, automate it in a stored procedure: using a cursor and a prepared statement, auto-generate an INSERT
statement for every grouping value, and return the result:
DROP PROCEDURE IF EXISTS listearliers;
DELIMITER |
CREATE PROCEDURE listearliers()
BEGIN
DECLARE curdone, vid INT DEFAULT 0;
DECLARE idcur CURSOR FOR SELECT DISTINCT id FROM test;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET curdone = 1;
DROP TEMPORARY TABLE IF EXISTS earliers;
CREATE TEMPORARY TABLE earliers( id INT, entrydate DATE);
SET @sql = 'INSERT INTO earliers SELECT id,entrydate FROM test WHERE id=? ORDER BY entrydate LIMIT 2';
OPEN idcur;
REPEAT
FETCH idcur INTO vid;
IF NOT curdone THEN
BEGIN
SET @vid = vid;
PREPARE stmt FROM @sql;
EXECUTE stmt USING @vid;
DROP PREPARE stmt;
END;
END IF;
UNTIL curdone END REPEAT;
CLOSE idcur;
SELECT * FROM earliers ORDER BY id,entrydate;
DROP TEMPORARY TABLE earliers;
END;
|
DELIMITER ;
CALL listearliers();