Within-group aggregates with a wrinkle

from the Artful Common Queries page


We have a wages table for wage rates by waiter and startdate, and a tips table to track daily waiters' hours worked and tips received. The requirement is to report wages and concurrent tips per waiter per day.
DROP TABLE IF EXISTS wages,tips;
CREATE TABLE wages( 
  id int, waiter int, start date, 
  rate decimal(6,2)
);
INSERT INTO wages VALUES
( 1, 4, '2005-01-01', 5.00 ),
( 2, 4, '2005-03-01', 6.00 ),
( 3, 5, '2007-01-05', 7.00 ),
( 4, 5, '2008-03-20', 8.00 ),
( 5, 5, '2008-04-01', 9.00 );
CREATE TABLE tips(
  id int, 
  date date, 
  waiter int, 
  hours_worked smallint, 
  tabs smallint, 
  tips decimal(6,2)
);
INSERT INTO tips VALUES
  ( 1, '2008-02-29', 4, 6.50, 21, 65.25 ),
  ( 2, '2008-03-06', 5, 6.00, 15, 51.75 ),
  ( 3, '2008-03-21', 4, 2.50, 5, 17.85 ),
  ( 4, '2008-03-22', 5, 5.25, 10, 39.00 );
SELECT * FROM wages;
+------+--------+------------+------+
| id   | waiter | start      | rate |
+------+--------+------------+------+
|    1 |      4 | 2005-01-01 | 5.00 |
|    2 |      4 | 2005-03-01 | 6.00 |
|    3 |      5 | 2007-01-05 | 7.00 |
|    4 |      5 | 2008-03-20 | 8.00 |
|    5 |      5 | 2008-04-01 | 9.00 |
+------+--------+------------+------+
SELECT * FROM tips;
+------+------------+--------+--------------+------+-------+
| id   | date       | waiter | hours_worked | tabs | tips  |
+------+------------+--------+--------------+------+-------+
|    1 | 2008-02-29 |      4 |            7 |   21 | 65.25 |
|    2 | 2008-03-06 |      5 |            6 |   15 | 51.75 |
|    3 | 2008-03-21 |      4 |            3 |    5 | 17.85 |
|    4 | 2008-03-22 |      5 |            5 |   10 | 39.00 |
+------+------------+--------+--------------+------+-------+
For the above dataset, the result which correctly matches wages and tips would be:
+------+------------+------+------+-------+------+--------+------+------------+
| tid  | Date       | Hrs  | tabs | tips  | wid  | waiter | rate | start      |
+------+------------+------+------+-------+------+--------+------+------------+
|    1 | 2008-02-29 |    7 |   21 | 65.25 |    2 |      4 | 6.00 | 2005-03-01 |
|    2 | 2008-03-06 |    6 |   15 | 51.75 |    3 |      5 | 7.00 | 2007-01-05 |
|    3 | 2008-03-21 |    3 |    5 | 17.85 |    2 |      4 | 6.00 | 2005-03-01 |
|    4 | 2008-03-22 |    5 |   10 | 39.00 |    4 |      5 | 8.00 | 2008-03-20 |
+------+------------+------+------+-------+------+--------+------+------------+
Why is this different from an ordinary within-groups aggregate? The correct wage rate for a tips row is not the wages row for that waiter with the latest date; it is the wages row having the latest date before the date in the given tips row.

One way to proceed is to build a temporary table from a join of wages to tips on waiter and wages.start < tips.date, then exclusion-join that result to itself to remove all rows except those with the latest wage rate per tips row. A two-step ...

-- wages-tips join removing later wage changes:
DROP TABLE IF EXISTS tmp;
CREATE TABLE tmp
SELECT 
  t.id AS tid, t.date AS Date, 
  t.hours_worked AS Hrs,t.tabs,t.tips,
  w.id AS wid, w.waiter, w.rate, w.start
FROM tips t
JOIN wages w ON w.waiter=t.waiter AND w.start <= t.date;
-- self-exclusion join to remove obsolete wage rows:
SELECT t1.*
FROM tmp t1
LEFT JOIN tmp t2 
  ON t1.tid=t2.tid and t1.start < t2.start
WHERE t2.waiter is null
ORDER BY t1.Date;
+------+------------+------+------+-------+------+--------+------+------------+
| tid  | Date       | Hrs  | tabs | tips  | wid  | waiter | rate | start      |
+------+------------+------+------+-------+------+--------+------+------------+
|    1 | 2008-02-29 |    7 |   21 | 65.25 |    2 |      4 | 6.00 | 2005-03-01 |
|    2 | 2008-03-06 |    6 |   15 | 51.75 |    3 |      5 | 7.00 | 2007-01-05 |
|    3 | 2008-03-21 |    3 |    5 | 17.85 |    2 |      4 | 6.00 | 2005-03-01 |
|    4 | 2008-03-22 |    5 |   10 | 39.00 |    4 |      5 | 8.00 | 2008-03-20 |
+------+------------+------+------+-------+------+--------+------+------------+
DROP TABLE tmp;
As of MySQL 8 and MariaDB 10.2, the intermediary table code can be a Common Table Expression ...
WITH tmp AS (
  SELECT 
    t.id AS tid, t.date AS Date, 
    t.hours_worked AS Hrs,t.tabs,t.tips,
    w.id AS wid, w.waiter, w.rate, w.start
  FROM tips t
  JOIN wages w ON w.waiter=t.waiter AND w.start <= t.date
)
SELECT t1.*
FROM tmp t1
LEFT JOIN tmp t2 ON t1.tid=t2.tid and t1.start < t2.start
WHERE t2.waiter is null
ORDER BY t1.Date;
Can we skip the temp table, event the virtual one? Yes---by adding the condition wages.start <= tips.date to each side of the exclusion join ...
SELECT 
  t.id AS tid, t.date, t.hours_worked AS Hrs,
  t.tabs,t.tips,w.id AS wid, w.waiter, w.rate, 
  w.start
FROM tips t
JOIN wages w ON w.waiter=t.waiter AND w.start <= t.date
LEFT JOIN wages w2 
  ON w.waiter=w2.waiter 
  AND w2.start <= t.date 
  AND w.start < w2.start
WHERE w2.id IS NULL
ORDER BY t.date;
Simpler, gives the same result as the CTE, and slightly faster.

Last updated 7 Apr 2020




Return to the Artful Common Queries page