Simplify a query to improve performance

from the Artful MySQL Tips List


This query...

SELECT COUNT(*) AS cnt 
FROM 
  accesslog an,
  (SELECT DISTINCT sid, title, path, url, hostname 
   FROM accesslog 
   WHERE url NOT LIKE '%SOMETHIHNG%' 
     AND url LIKE '%SOMETHING ELSE%' 
     AND url LIKE '%SOME OTHER THING%'
  ) AS af 
WHERE af.sid = an.sid AND an.path = 'SOMEWHERE'

against this table...

CREATE TABLE `accesslog` (
  `aid` int(10) NOT NULL auto_increment,
  `sid` varchar(32) NOT NULL default '',
  `title` varchar(255) default NULL,
  `path` varchar(255) default NULL,
  `url` varchar(255) default NULL,
  `hostname` varchar(128) default NULL,
  `uid` int(10) unsigned default '0',
  `timer` int(10) unsigned NOT NULL default '0',
  `timestamp` int(11) unsigned NOT NULL default '0',
  `user_agent` varchar(255) NOT NULL,
  PRIMARY KEY (`aid`),
  KEY `accesslog_timestamp` (`timestamp`)
);

with 50,000 rows, was running once a minute and bringing down the hosting server. The hosting company did not appreciate this. The question was, how to optimise it:

The query is more transparent if we write the joins explicitly:

SELECT COUNT(*) AS cnt 
FROM accesslog AS an
JOIN (
  SELECT DISTINCT sid, title, path, url, hostname 
  FROM accesslog 
  WHERE url NOT LIKE '%SOMETHING%' 
    AND url LIKE '%SOMETHING ELSE%' 
    AND url LIKE '%SOME OTHER THING%'
) AS af ON af.sid = an.sid 
WHERE an.path = 'SOMEWHERE';

The DISTINCT clause in the subquery is logically equivalent to GROUP BY. Performance will be better if the table has an index on `sid`, `title`, `path`, `url`, `hostname`. For reasons that will become clear in a moment, make that `path`, `sid`, `title`, `url`, `hostname`.

The next thing to notice is that this is a self-join, so the `path` condition could be moved to the subquery, thereby reducing the number of row accesses the subquery requires; since the LIKE comparisons probably contribute the biggest performance drag, this might improve performance greatly:

SELECT COUNT(*) AS cnt 
FROM accesslog AS an
JOIN (
  SELECT DISTINCT sid, title, path, url, hostname 
  FROM accesslog 
  WHERE path = 'SOMEWHERE'
    AND url NOT LIKE '%SOMETHIHNG%' 
    AND url LIKE '%SOMETHING1%' 
    AND url LIKE '%SOMETHING2%'
) AS af ON af.sid = an.sid;

You can satisfy yourself that this is the case by running EXPLAIN against each query (just type EXPLAIN followed by the query). The first row of that result will show that the engine needs to look at fewer rows when the `path` condition is moved inside the subquery. The LIKE conditions are probably consuming most resources, so the improvement you see will be close to (total rows)/(rows matching `path`).

And now you see why the first column in the index should be `path`: MySQL can use that index to discard non-matching rows.

We've got rid of the first element of the self-join, so we can lose it:

SELECT COUNT(*) AS cnt 
FROM (
  SELECT DISTINCT sid, title, path, url, hostname 
  FROM accesslog 
  WHERE path = 'SOMEWHERE'
    AND url NOT LIKE '%SOMETHIHNG%' 
    AND url LIKE '%SOMETHING1%' 
    AND url LIKE '%SOMETHING2%'
) AS af;

which no longer brings down the server.

Return to the Artful MySQL Tips page