Within-group aggregates

from the Artful Common Queries page


You have a products table with columns item, supplier, price. Multiple suppliers offer various prices for the same item. You need to find the supplier with the lowest price for each item.
DROP TABLE IF EXISTS products;
CREATE TABLE products(
  item int,supplier int,price decimal(6,2)
);
INSERT INTO products VALUES
  (1,1,10),(1,2,15),(2,2,20),
  (2,1,21),(2,2,18);
SELECT * FROM products;
+------+----------+-------+
| item | supplier | price |
+------+----------+-------+
|    1 |        1 | 10.00 |
|    1 |        2 | 15.00 |
|    2 |        2 | 20.00 |
|    2 |        1 | 21.00 |
|    2 |        2 | 18.00 |
+------+----------+-------+
Your first thought may be to GROUP BY item, but that is not guaranteed to return the correct supplier value for each minimum item price. Grouping by both item and supplier will return more information than you want. Nor can you write WHERE price=MIN(...) because the query engine will evaluate the WHERE clause before it knows the MIN value.

This is the problem of aggregating within aggregates. It is sometimes called the 'groupwise aggregates' problem, but the term 'groupwise' is ambiguous. We think better names for it are subaggregates, inner aggregates, or within-group aggregates.

It's easy to show that the within-group aggregates problem is a form of the problem of returning values from non-grouping columns in a GROUP BY query. Suppose you write ...

SELECT item, supplier, Min(price)
FROM products
GROUP BY item;
Will this reliably return the correct supplier per item? No. Unless there is exactly one supplier per item, the supplier value returned will be arbitrary. To retrieve the correct supplier for each item, you need more query logic.

Here are six solutions. As you'll see, some are very slow, a couple are quite fast, and some work for only some versions of the problem:

1. Self-exclusion join solution: One way to model within-group minima or maxima is via an left self exclusion join...

SELECT p1.item, p1.supplier, p1.price
FROM products AS p1
LEFT JOIN products AS p2 
  ON p1.item=p2.item AND p1.price>p2.price
WHERE p2.item IS NULL;
...because in the resultset built by joining on left item = right item and left price > right price, the left-sided rows for which no smaller right-sided price exists are precisely the per-item rows with the smallest prices. The query runs more than an order of magnitude faster with an index on (item, supplier).

2. Intermediate table/CTE solution: Another solution is to derive an intermediate table of aggregated minimum prices, then query that table. Before MySQL 8 and MariaDB 10.2, this requires explicit code for the intermediary table:

CREATE TEMPORARY TABLE tmp (
  item INT,
  minprice DECIMAL DEFAULT 0.0
);
LOCK TABLES products READ;
INSERT INTO tmp 
  SELECT item, MIN(price) 
  FROM products 
  GROUP BY item;
to which you then join the products table to find the matching suppliers:
SELECT products.item, supplier, products.price 
FROM products 
JOIN tmp ON products.item = tmp.item
WHERE products.price=tmp.minprice;
UNLOCK TABLES;
DROP TABLE tmp;
Since MySQL 8 and MariaDB 10.2, this can be written more concisely with a Common Table Expression (CTE):
WITH tmp AS ( 
  SELECT item, MIN(price) as minprice
  FROM products 
  GROUP BY item
)
SELECT products.item, supplier, products.price 
FROM products 
JOIN tmp ON products.item = tmp.item
WHERE products.price=tmp.minprice;

3. Correlated subquery solution: From MySQL 4.1 on, the temporary table can be a correlated subquery. This is the most intuitively obvious syntax for the problem. It's also the slowest—up to a hundred times slower than the exclusion join, whether the queries are compared with or without indexes:

SELECT item, supplier, price
FROM products AS p1
WHERE price = (
  SELECT MIN(p2.price)
  FROM products AS p2
  WHERE p1.item = p2.item
);

4. FROM clause subquery solution: If we move the aggregating subquery from the WHERE clause to the FROM clause, the query improves from a hundred times slower than the self-exclusion join to twenty times faster:

SELECT p.item, p.supplier, p.price
FROM products AS p
JOIN (
  SELECT item, MIN(price) AS minprice
  FROM products
  GROUP BY item
) AS pm ON p.item=pm.item AND p.price=pm.minprice;
Some users have trouble mapping elements of this model to their instance of the problem. The model has five elements (or sets of them):

(i) a table, which might be a view, a single physical table, or a table derived from joins

(ii) one or more grouping columns,

(iii) one or more columns to aggregate,

(iv) one or more columns not mentioned in the GROUP BY clause,

(v) an aggregating job to do, typically MIN() or MAX().

In the product/minimum price solution above:
(i) table tbl = product
(ii) grouping column grouping_col = item
(iii) column to aggregate = col_to_aggregate = price
(iv) non-aggregated columns other_detail, ...etc... = supplier
(v) aggregating function = MIN().

Here's an interesting variation on the problem. A simple table tracks company branches and the kinds of stock they carry:

DROP TABLE IF EXISTS branch_stock;
CREATE TABLE branch_stock( 
  stock_id INT PRIMARY KEY AUTO_INCREMENT KEY,
  branch_id INT NOT NULL,
  stock_item VARCHAR(12) NOT NULL
);
INSERT INTO branch_stock (branch_id,stock_item) 
  VALUES (1,'trainers'),(1,'trainers'),
  (1,'jumper'),(2,'tie'),(2,'shoes');
How do we find the most frequent product for each branch, including ties? Join the intermediate table of grouped counts to itself on matching branches, non-matching stock and not-greater counts in one than in the other:
SELECT DISTINCT a.* 
FROM ( 
  SELECT branch_id, stock_item, COUNT(*) qty
  FROM branch_stock 
  GROUP BY branch_id, stock_item
) a 
JOIN ( 
  SELECT branch_id, stock_item, COUNT(*) qty
  FROM branch_stock
  GROUP BY branch_id, stock_item
) b ON b.branch_id=a.branch_id 
    AND b.stock_item<>a.stock_item 
    AND b.qty<=a.qty;
+-----------+------------+-----+
| branch_id | stock_item | qty |
+-----------+------------+-----+
|         1 | trainers   |   2 |
|         2 | shoes      |   1 |
|         2 | tie        |   1 |
+-----------+------------+-----+
5. Ordered subquery solution: A "trick" solution, made possible by MySQL's non-standard tolerance of GROUP BY when there is no aggregating SELECT expression, uses ORDER BY in a subquery to find the lowest prices, and GROUP BY in the outer query to pick them off:
SELECT *
FROM (
  SELECT *
  FROM products
  ORDER BY price ASC
) AS s
GROUP BY item;
It's succinct, it's fast if the query engine can find an index to order on, and it retrieves the prices and associated values in one step, but this query is seldom much faster than the FROM clause subquery described above.

The self-exclusion join and WHERE clause subquery methods scale badly because they're O(N2). If ordering and grouping columns are indexed, they become O(N * log N), but are still substantially slower than the FROM clause subquery method.

6. Min-Concat-trick solution: Finally, here is a radically different model of the problem. It can find both within-group minima and within-group maxima in a single query. The model aggregates the concatenated within-group grouped column value and the within-group grouping column name in a single string, then uses SUBSTR() to break them apart in the result:

SELECT 
  item,
  SUBSTR( MIN( CONCAT( LPAD(price,6,0),supplier) ), 7)   AS MinSupplier,
    LEFT( MIN( CONCAT( LPAD(price,6,0),supplier) ), 6)+0 AS MinPrice,
  SUBSTR( MAX( CONCAT( LPAD(price,6,0),supplier) ), 7)   AS MaxSupplier,
    LEFT( MAX( CONCAT( LPAD(price,6,0),supplier) ), 6)+0 AS MaxPrice
FROM  products
GROUP BY item;
+------+-------------+----------+-------------+----------+
| item | MinSupplier | MinPrice | MaxSupplier | MaxPrice |
+------+-------------+----------+-------------+----------+
|    1 | 1           |       10 | 2           |       15 |
|    2 | 2           |       18 | 1           |       21 |
+------+-------------+----------+-------------+----------+
Try all solutions to find which is fastest for your version of the problem.

To find the top or bottom N per group, you might think the LIMIT clause would work, but LIMIT is limited in subqueries. See Within-group quotas.

Last updated 16 Aug 2024