The unbearable slowness of IN()

from the Artful Common Queries page


You track orders and their items in orders and orderdetails tables, as in the NorthWind database. How many of your orders have been for multiple items? We can use the standard SQL IN() operator to answer the question:
SELECT orderID
FROM orders
WHERE orderID IN (
  SELECT orderID 
  FROM orderdetails 
  GROUP BY orderID 
  HAVING COUNT(orderID) > 1
);
Trouble is, in MySQL 5.0 this query takes 130 times longer to execute than a JOIN version of the same logic:
SELECT orderID
FROM orders o
JOIN orderdetails od USING (orderID)
GROUP BY orderID 
HAVING COUNT(od.orderID) > 1
Why? The query optimiser decides to execute the IN() query outside-in: it runs the subquery once for each outer query row it finds, so if there are 800 orders and 2000 items, it does 1,600,000 reads. On the other hand, it optimises the JOIN very well.

But here's a surprise. If a query has a FROM clause subquery, MySQL executes that subquery first. What if we were to add a redundant FROM clause subquery into the IN() subquery?

EXPLAIN EXTENDED
SELECT SQL_NO_CACHE orderID
FROM orders
WHERE orderID IN (
  SELECT orderID FROM (    -- redundant FROM subquery
    SELECT orderID 
    FROM orderdetails 
    GROUP BY orderID 
    HAVING COUNT(orderID) > 1
  ) AS tmp
);
Sure enough, this version of IN() executes as fast as the JOIN version because the total number of reads is 800+2000=2,800, not 800*2000=1,600,000.

Optimisations in MySQL 5.5 reduce the speed difference from 130:1 to 5:1. With large tables, that will still be significant. You wonder if EXISTS() might be better than IN()?

SELECT orderID
FROM orders o
WHERE EXISTS (
  SELECT orderID FROM orderdetails
  WHERE orderID = o.orderID
  GROUP BY orderID 
  HAVING COUNT(orderID)>1
);
In 5.0, EXISTS() is much faster than IN(), but slower than JOIN. In 5.5, EXISTS() performs about as well as JOIN.

For NOT IN(...), consider a straight exclusion join. The basic pattern is that ...

SELECT ...
FROM a
WHERE a.x NOT IN (
  SELECT y FROM b
);
becomes ...
SELECT ...
FROM a 
LEFT JOIN b ON a.x=b.y
WHERE b.y IS NULL;
One more surprise. Running EXPLAIN EXTENDED on these queries does not predict all the performance differences. You have to benchmark the queries. So for problems like this, experiment with the MySQL 5.5/5.6 optimizer_switch variable to see whether your version of this problem is best solved by IN(), IN() with the extra subquery, EXISTS(), or JOIN.

For comparative analysis of NOT IN(SELECT..), NOT EXISTS(SELECT..) and exclusion joins, see http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/. The short answer is: since 5.6, exclusion joins and Not In(Select...) are about equivalent, Not Exists(Select ...) is slower.

Last updated 5 Nov 2024