Track state changes

from the Artful Common Queries page


You have a table containing a sequence of IDs, states and dates. The query requirement is to list state changes and their dates ordered by date, and count instances of each sequenced state. This data from a blog entry by a Russian MySQLer calling himself Quassnoi ...
+----+-------+------------------+
|ID  | State |  Datetime        |
+----+-------+------------------+
| 12 |   1   | 2024-07-16 10:00 |
| 45 |   2   | 2024-07-16 13:00 |
| 67 |   2   | 2024-07-16 14:40 |
| 77 |   1   | 2024-07-16 15:00 |
| 89 |   1   | 2024-07-16 15:30 |
| 99 |   1   | 2024-07-16 16:00 |
+----+-------+------------------+
should produce this result ...
+----+-------+------------------+-------+
|ID  | State |  Datetime        | Count |
+----+-------+------------------+-------+
| 12 |   1   | 2024-07-16 10:00 |   1   |
| 45 |   2   | 2024-07-16 13:00 |   2   |
| 77 |   1   | 2024-07-16 15:00 |   3   |
+----+-------+------------------+-------+
To assess query efficiency we need more than six rows. To generate sequential test data, a utility table of ints from 0 through 9 is helpful. We keep general-purpose database objects in a system database available to all developers on the server:
create database if not exists system;
use system;
drop table if exists ints;
-- (See "Make a table of sequential ints")
create table ints(i int);  
insert into ints values
  (0),(1),(2),(3),(4),
  (5),(6),(7),(8),(9); 
Quassnoi's test data was 10k rows tracking ID, state, date and value, with state oscillating between 1 and 2 every one thousand rows. The primary key is ID. There is a covering (date, state) index. Dates are in descending order. To simplify comparative testing, parameterise the state-change interval:
set @state_interval=1000;
drop table if exists history;
create table history (
  id int primary key,
  state int not null,
  date datetime not null,
  value varchar(16) not null,
  key(date, state)
) engine=innodb;
insert into history
 select
   id,
   1 + floor(((id-1)/@state_interval) mod 2),
   '2024-07-24' - interval id minute,
   concat('value ', id)
from (
  select 
    1 + ( t.i*1000 + u.i*100 + v.i*10 + w.i ) 
    as id
  from system.ints t
  join system.ints u
  join system.ints v 
  join system.ints w
) tmp;
alter table history add key(date,state);
For clarity you might want to add a column reporting the value associated with each state change.

As you'll see, this story begins with aggregating queries, ends with non-aggregating queries. Quassnoi found a set-based aggregating solution, but it's painfully slow ...

SELECT 
  MIN(id) AS id, MIN(date) AS date, 
  MIN(state) AS state, value, COUNT(*) cnt
FROM (
  SELECT  
    id, date, state,value,
    ( SELECT id
      FROM history a
      WHERE 
        (a.date,a.state,a.id) < (b.date,b.state,b.id) 
        AND a.state != b.state
      ORDER BY date DESC, state DESC
      LIMIT 1
    ) AS prev
  FROM history b
) q
GROUP BY prev
ORDER BY date;
+------+---------------------+-------+-------------+------+
| id   | date                | state | value       | cnt  |
+------+---------------------+-------+-------------+------+
| 9001 | 2024-07-17 01:20:00 |     2 | value 10000 | 1000 |
| 8001 | 2024-07-17 18:00:00 |     1 | value 9000  | 1000 |
| 7001 | 2024-07-18 10:40:00 |     2 | value 8000  | 1000 |
| 6001 | 2024-07-19 03:20:00 |     1 | value 7000  | 1000 |
| 5001 | 2024-07-19 20:00:00 |     2 | value 6000  | 1000 |
| 4001 | 2024-07-20 12:40:00 |     1 | value 5000  | 1000 |
| 3001 | 2024-07-21 05:20:00 |     2 | value 4000  | 1000 |
| 2001 | 2024-07-21 22:00:00 |     1 | value 3000  | 1000 |
| 1001 | 2024-07-22 14:40:00 |     2 | value 2000  | 1000 |
|    1 | 2024-07-23 07:20:00 |     1 | value 1000  | 1000 |
+------+---------------------+-------+-------------+------+
30.8 seconds to walk 10k rows on a reasonable machine in MySQL 5.6, 18.6 secs in MySQL 8.0. The obvious culprit is the row-by-row correlated subquery in the subquey Select clause. We can cut execution time by a third or more by moving that subquery to the From clause ...
SELECT a.*
FROM history AS a
WHERE a.state <>
      ( SELECT b.state
        FROM history AS b
        WHERE (a.date,a.state) < (b.date,b.state)
        ORDER BY b.date 
        LIMIT 1
      ); 
That takes 22.7 seconds in 5.6, 12.7 secs in 8.0. Better, still slow. Removing columns not covered by the index from the Select list makes no difference.

A non-aggregating row-based solution makes a radical difference: track state changes with user variables in a derived table ordered by date and state The derived column state_change starts at 0 and increments whenever state changes. Then selecting minimum date and state values from the derived table, and grouping them by state_change, picks out the rows where state changes. This is about 400 times faster than the query we started with, still about 200 times faster than our first optimisation ...

SELECT 
  MIN(id) AS id, MIN(date) AS date, 
  MIN(state) AS state, value, COUNT(*) cnt
FROM (
  SELECT 
    @r := @r + 
          (@state!=state OR @state IS NULL) 
      AS state_change,
    @state := state AS current,
    h.id, h.date, h.state, h.value
  FROM (
    -- one-row virtual table
    SELECT @r:=0, @state:=NULL ) as vars  
    CROSS JOIN history as h                   
    -- order to track state changes
    ORDER BY date, state
  ) q
GROUP BY state_change
ORDER BY date;
0.05 seconds from MySQL 5.6 through 8.0.

If how this works puzzles you, run the query with state_change added to the query SELECT list. In other SQL dialects, in fact, you would have to include the state_change column, since standard SQL requires that GROUP BY columns be SELECTed.

MySQL has announced that support for such use of user variables will be removed in a future release. Read on...

Solution for MySQL 8.0

8.0 introduced Common Table Expressions (CTEs) and Windowing functions. The basic algorithm of the user variable solution was ...

1 Create a derived table that identifies rows with state values that differ from the previous row.

2 Pick out those rows for display.

Dead simple with a CTE and LEAD() ...

WITH cte AS (
  SELECT
    id, state, date,
    CASE 
      WHEN LEAD(state) 
      OVER(ORDER BY date) != state 
      THEN 1 
    END As Filter
  FROM history
)
SELECT  id, state, date
FROM cte
WHERE Filter=1
ORDER BY id;
+------+-------+---------------------+
| id   | state | date                |
+------+-------+---------------------+
| 1001 |     2 | 2024-07-23 07:19:00 |
| 2001 |     1 | 2024-07-22 14:39:00 |
| 3001 |     2 | 2024-07-21 21:59:00 |
| 4001 |     1 | 2024-07-21 05:19:00 |
| 5001 |     2 | 2024-07-20 12:39:00 |
| 6001 |     1 | 2024-07-19 19:59:00 |
| 7001 |     2 | 2024-07-19 03:19:00 |
| 8001 |     1 | 2024-07-18 10:39:00 |
| 9001 |     2 | 2024-07-17 17:59:00 |
+------+-------+---------------------+
The user variable solution with 10k rows took 0.05 secs; this takes 0.07 secs. User variable value assignments in MySQL queries are going away? No worries.

Last updated 11 May 2024