Track state changes

from the Artful Common Queries page


You have a table containing a sequence of IDs, states and dates. The query requirement is to list state changes and their dates ordered by date, and count instances of each sequenced state. This data from a blog entry by a Russian MySQLer calling himself Quassnoi ...
+----+-------+------------------+
|ID  | State |  Datetime        |
+----+-------+------------------+
| 12 |   1   | 2009-07-16 10:00 |
| 45 |   2   | 2009-07-16 13:00 |
| 67 |   2   | 2009-07-16 14:40 |
| 77 |   1   | 2009-07-16 15:00 |
| 89 |   1   | 2009-07-16 15:30 |
| 99 |   1   | 2009-07-16 16:00 |
+----+-------+------------------+
should produce this result ...
+----+-------+------------------+-------+
|ID  | State |  Datetime        | Count |
+----+-------+------------------+-------+
| 12 |   1   | 2009-07-16 10:00 |   1   |
| 45 |   2   | 2009-07-16 13:00 |   2   |
| 77 |   1   | 2009-07-16 15:00 |   3   |
+----+-------+------------------+-------+
To test query efficiency, we need more than six rows. To generate sequential test data, a utility table of ints from 0 through 9 is helpful. We keep general-purpose database objects in a sys database available to all developers on the server:
create database if not exists sys;
use sys;
drop table if exists ints;
-- (See "Make a table of sequential ints")
create table ints(i int);  
insert into ints values
  (0),(1),(2),(3),(4),
  (5),(6),(7),(8),(9); 
Quassnoi's test data was 10,000 rows tracking ID, state, date and value, with state oscillating between 1 and 2 every one thousand rows. The primary key is ID. There is a covering (date, state) index. Dates are in descending order. To simplify comparative testing, we parameterise the state-change interval:
set @state_interval=1000;
drop table if exists history;
create table history (
  id int primary key,
  state int not null,
  date datetime not null,
  value varchar(16) not null,
  key(date, state)
) engine=innodb;
insert into history
 select
   id,
   1 + floor(((id-1)/@state_interval) mod 2),
   '2009-07-24' - interval id minute,
   concat('value ', id)
from (
  select 
    1 + ( t.i*1000 + u.i*100 + v.i*10 + w.i ) 
    as id
  from sys.ints t
  join sys.ints u
  join sys.ints v 
  join sys.ints w
) tmp;
To the original requirement we add reporting the value associated with each state change. Quassnoi found a set-based solution, but it is painfully slow:
SELECT 
  MIN(id) AS id, MIN(date) AS date, 
  MIN(state) AS state, value, COUNT(*) cnt
FROM (
  SELECT  
    id, date, state,value,
    ( SELECT id
      FROM history a
      WHERE (
        a.date,a.state,a.id)<(b.date,b.state,b.id) 
      AND a.state <> b.state
      ORDER BY date DESC, state DESC
      LIMIT 1
    ) AS prev
  FROM history b
) q
GROUP BY prev
ORDER BY date;
+------+---------------------+-------+-------------+------+
| id   | date                | state | value       | cnt  |
+------+---------------------+-------+-------------+------+
| 9001 | 2009-07-17 01:20:00 |     2 | value 10000 | 1000 |
| 8001 | 2009-07-17 18:00:00 |     1 | value 9000  | 1000 |
| 7001 | 2009-07-18 10:40:00 |     2 | value 8000  | 1000 |
| 6001 | 2009-07-19 03:20:00 |     1 | value 7000  | 1000 |
| 5001 | 2009-07-19 20:00:00 |     2 | value 6000  | 1000 |
| 4001 | 2009-07-20 12:40:00 |     1 | value 5000  | 1000 |
| 3001 | 2009-07-21 05:20:00 |     2 | value 4000  | 1000 |
| 2001 | 2009-07-21 22:00:00 |     1 | value 3000  | 1000 |
| 1001 | 2009-07-22 14:40:00 |     2 | value 2000  | 1000 |
|    1 | 2009-07-23 07:20:00 |     1 | value 1000  | 1000 |
+------+---------------------+-------+-------------+------+
The correlated SELECT ... ORDER BY ... LIMIT 1 subquery may look dodgy, but a conventional call to MIN() is even slower:
SELECT 
  MIN(id) AS id, MIN(date) AS date, 
  MIN(state) AS state, COUNT(*) cnt
FROM (
  SELECT  
    id, date, state,
    ( SELECT min(id)
      FROM history a
      WHERE (
        a.date, a.state, a.id) < (b.date, b.state, b.id) 
      AND a.state <> b.state
    ) AS prev
  FROM history b
) tmp
GROUP BY prev
ORDER BY date;
The solution is to build a derived table ordered by date and state with user variables to track state changes. The derived column state_change starts at 0 and increments whenever state changes. Then selecting minimum date and state values from the derived table, and grouping them by state_change, picks out the rows where state changes. This is hundreds of times faster than either of the set-based solutions:
SELECT 
  MIN(id) AS id, MIN(date) AS date, 
  MIN(state) AS state, value, COUNT(*) cnt
FROM (
  SELECT 
    @r := @r + 
          (@state != state OR @state IS NULL) 
      AS state_change,
    @state := state AS current,
    h.id, h.date, h.state, h.value
  FROM (
    -- one-row virtual table
    -- cross-joined with history
    -- ordered to track state changes
    SELECT @r:=0, @state:=NULL ) as vars  
    CROSS JOIN history as h                   
    ORDER BY date, state
  ) q
GROUP BY state_change
ORDER BY date;
If how this works puzzles you, run the query with state_change added to the query's SELECT list. In other SQL dialects, in fact, you would have to include the state_change column, since standard SQL requires that GROUP BY columns be SELECTed.

Return to the Artful Common Queries page