Within-group aggregates
You have a products
table with columns item, supplier, price
. Multiple suppliers offer various prices for the same item. You need to find the supplier with the lowest price for each item.
DROP TABLE IF EXISTS products;
CREATE TABLE products(item int,supplier int,price decimal(6,2));
INSERT INTO products VALUES(1,1,10),(1,2,15),(2,2,20),(2,1,21),(2,2,18);
SELECT * FROM products;
+------+----------+-------+
| item | supplier | price |
+------+----------+-------+
| 1 | 1 | 10.00 |
| 1 | 2 | 15.00 |
| 2 | 2 | 20.00 |
| 2 | 1 | 21.00 |
| 2 | 2 | 18.00 |
+------+----------+-------+
Your first thought may be to GROUP BY item
, but that is not guaranteed to return the correct supplier
value for each minimum item price. Grouping by both item
and supplier
will return more information than you want. Nor can you write WHERE price=MIN(...)
because the query engine will evaluate the WHERE
clause before it knows the MIN
value.
This is the problem of aggregating within aggregates. It is sometimes called the 'groupwise aggregates' problem, but the term 'groupwise' is ambiguous. We think better names for it are subaggregates, inner aggregates, or within-group aggregates.
It's easy to show that the within-group aggregates problem is a form of the problem of returning values from non-grouping columns in an aggregate query. Suppose you write ...
SELECT item,supplier,MIN(price)
FROM products
GROUP BY item;
Will this reliably return the correct supplier per item? No. Unless there is exactly one supplier per item, the supplier value returned will be arbitrary. To retrieve the correct supplier for each item, you need more logic.
One way to model the within-aggregates problem is via an left self exclusion join...
SELECT p1.item,p1.supplier,p1.price
FROM products AS p1
LEFT JOIN products AS p2 ON p1.item = p2.item AND p1.price > p2.price
WHERE p2.item IS NULL;
...because in the resultset built by joining on left item = right item and left price > right price, the left-sided rows for which there is no greater right-sided price are precisely the per-item rows with the smallest prices.
Another solution is to derive an intermediate table of aggregated minimum prices. Before MySQL 4.1, it has to be a temporary table:
CREATE TEMPORARY TABLE tmp (
item INT,
minprice DECIMAL DEFAULT 0.0
);
LOCK TABLES products READ;
INSERT INTO tmp
SELECT item, MIN(price)
FROM products
GROUP BY item;
to which you then join the products table to find the matching suppliers:
SELECT products.item, supplier, products.price
FROM products
JOIN tmp ON products.item = tmp.item
WHERE products.price=tmp.minprice;
UNLOCK TABLES;
DROP TABLE tmp;
From MySQL 4.1 on, the temporary table can be a correlated subquery. This is the most intuitively obvious syntax for the problem. Often it's also the slowest solution:
SELECT item, supplier, price
FROM products AS p1
WHERE price = (
SELECT MIN(p2.price)
FROM products AS p2
WHERE p1.item = p2.item
);
It is usually faster to move the aggregating subquery from the WHERE
clause to the FROM
clause:
SELECT p.item, p.supplier, p.price
FROM products AS p
JOIN (
SELECT item, MIN(price) AS minprice
FROM products
GROUP BY item
) AS pm ON p.item = pm.item AND p.price = pm.minprice;
Some users have trouble mapping elements of this model to their instance of the problem. There are five elements (or sets of them):
(i) a table, which might be a view, a single physical table, or a table derived from joins
(ii) one or more grouping columns,
(iii) one or more columns to aggregate,
(iv) one or more columns not mentioned in the GROUP BY
clause,
(v) an aggregating job to do, typically MIN() or MAX().
In the product/minimum price solution above:
(i) table tbl = product
(ii) grouping column grouping_col = item
(iii) column to aggregate = col_to_aggregate = price
(iv) non-aggregated columns other_detail, ...etc... = supplier
(v) aggregating function = MIN()
.
A "tricky" solution is to use ORDER BY
in a subquery to find lowest prices, and write an outer GROUP BY
query to pick each item's lowest price from the subquery:
SELECT *
FROM (
SELECT *
FROM products
ORDER BY price ASC
) AS s
GROUP BY item;
Other methods we've cited for this problem scale poorly because they're O(N2). If ordering and grouping columns are indexed, this solution is O(N * log N), so it may be your best bet for speed. It's an instance of a useful rule of thumb: simplest is often fastest.
Finally, here is a radically different model of the problem. It can find both within-group minima and within-group maxima in a single query. This model aggregates the concatenated within-group grouped column value and the within-group grouping column name in a single string, then uses SUBSTR()
to break them apart in the result:
SELECT
item,
SUBSTR( MIN( CONCAT( LPAD(price,6,0),supplier) ), 7) AS MinSupplier,
LEFT( MIN( CONCAT( LPAD(price,6,0),supplier) ), 6)+0 AS MinPrice,
SUBSTR( MAX( CONCAT( LPAD(price,6,0),supplier) ), 7) AS MaxSupplier,
LEFT( MAX( CONCAT( LPAD(price,6,0),supplier) ), 6)+0 AS MaxPrice
FROM products
GROUP BY item;
+------+-------------+----------+-------------+----------+
| item | MinSupplier | MinPrice | MaxSupplier | MaxPrice |
+------+-------------+----------+-------------+----------+
| 1 | 1 | 10 | 2 | 15 |
| 2 | 2 | 18 | 1 | 21 |
+------+-------------+----------+-------------+----------+
Try all solutions to find which is fastest for your version of the problem.
To find the top or bottom N per group, you might think the LIMIT
clause would work, but LIMIT
is limited in subqueries. See Within-group quotas.