Treewalks with CTEs

from the Artful Common Queries page


(If you're unfamiliar with SQL processing of trees, consider spending a little time with our chapter on trees and other hierarchies in MySQL (it's free), before tackling this piece. If you're new to CTEs, you might want to read the MySQL manual page on them.)

Given a table infotree( id, parentid, name ) containing one or more trees—i.e., one or more connected acyclic graphs—how do we use CTEs, available in MySQL since version 8.0 and MariaDB since version 10.2.2, to walk the subtree of a given node?

It's a standard CTE WITH loop. There are just two tricky bits.

First, use an accumulating tree path result column to show top-down tree layout.

In turn, that exposes a resultset column size issue: MySQL is fussy about CTE string columns overflowing with data; the first occurrence of such a column needs to declare the string long enough to accommodate the longest value the tree can fill it with.

set @root=2;                           -- subtree root value
with recursive treewalk as (
  select 
    id, 
    0 as level, 
    cast( id as char(64) ) as path, 
    name
  from infotree
  where id=@root                       -- query for subtree root 
  union
  select                               -- retrieve child nodes
    t.id, tw.level+1 as level,
    concat( path, ',', t.id ) as path, -- path down to this node
    t.name
  from infotree t                      -- the tree table
  join treewalk tw on t.parentid=tw.id -- recursive join
)
select * from treewalk order by path;
Often you'll want just a user-friendly list of node names. The text of the page you're reading, including the code, lives in a pair of MySQL tables, one of which is just like the infotree table described above. A simple way to make the node listing transparent & intuitive is to indent node names proportionately to their tree depth. Here's such a look at our Common Queries tree at time of writing ...
set @root = ( select id 
              from infotree 
              where name="Common Queries"
            ); -- subtree root value
with recursive treewalk as (
  select                                 -- query for subtree root 
    id,
    name, 
    0 as level, 
    cast( left(name,32) as char(255) ) as path 
  from infotree
  where id=@root
  union
  select                                 -- recursive query for nodes
    t.id,
    t.name, 
    tw.level+1 as level,
    concat( path, ',', left(t.name,32) ) as path  
  from infotree t                        -- tree table
  join treewalk tw on t.parentid=tw.id   -- recursive join
)
select node as "Common Queries nodes" -- just show indented node names
from (
  select 
    concat(repeat('-',level),name) as node,
    path
  from treewalk 
  order by path
) x;
+---------------------------------------------------------+
| Common Queries nodes                                    |
+---------------------------------------------------------+
| Common queries                                          |
| -Aggregates                                             |
| --Aggregate across columns                              |
| --Aggregates across multiple joins                      |
| --Aggregates and Statistics                             |
| ---Aggregates and bands                                 |
| ----Aggregates from bands of values                     |
| ----History of world records                            |
| ----Nested, banded aggregation                          |
| ---Average top 50% values per group                     |
| ---Correlation                                          |
| ---Count unique values of one column                    |
| ---Median                                               |
| ---Mode                                                 |
| ---Moving average                                       |
| ---Multiple sums across a join                          |
| ---Percentiles                                          |
| ---Rank order                                           |
| ---Running sums with & without CTEs                     |
| ---Sum across categories                                |
| --Aggregates excluding leaders                          |
| --Aggregates of specified size                          |
| --All X for which all Y are Z                           |
| --All X meeting multi-row conditions                    |
| --Avoid repeat aggregation                              |
| --Basic aggregation                                     |
| --Break out sum on condition                            |
| --Cascading aggregates                                  |
| --Combine Group_Concat() with counts                    |
| --Correlated aggregating subqueries                     |
| --Cross-aggregates                                      |
| --Delete all except the latest                          |
| --Every paying customer wins a prize                    |
| --Group data by datetime periods                        |
| --Grouping()                                            |
| --How long has the machine run?                         |
| --League table                                          |
| --List equivalences                                     |
| --Pairwise matchmaking                                  |
| --Pivot tables (crosstabs)                              |
| ---Automate pivot table queries                         |
| ---Group column statistics in rows                      |
| ---Monthly expenses                                     |
| ---Monthly sales                                        |
| ---Pivot table basics: rows to columns                  |
| ---Pivot table using math tricks                        |
| ---Pivot table with CONCAT                              |
| ---Pivot table without GROUP_CONCAT                     |
| ---Unpivot a table                                      |
| --Rollup on just some columns                           |
| --Show only one child row per parent row                |
| --Skip repeating values                                 |
| --Totals and subtotals--simply                          |
| --Track state changes                                   |
| --Values linked with all values of another column       |
| --Windowing: introduction                               |
| --Within-group aggregates                               |
| --Within-group aggregates with a wrinkle                |
| --Within-group quotas (Top N per group)                 |
| -Data comparison                                        |
| --Backslashes in data                                   |
| --Compare data in two tables                            |
| --Show rows where column value changed                  |
| --Using bit operators                                   |
| -Database metadata                                      |
| --Compare structures of two tables                      |
| --Compare two databases                                 |
| --Database size                                         |
| --Find metadata locks & waits                           |
| --Find size of all server databases                     |
| --Foreign keys                                          |
| ---Change or drop a foreign key                         |
| ---Find child tables                                    |
| ---Find cross-database foreign keys                     |
| ---Find parent tables                                   |
| ---Foreign key basics                                   |
| ---Foreign yey primer                                   |
| --Generate a query from information_schema              |
| --List databases, tables, columns                       |
| --List structural database differences                  |
| --List users of a database                              |
| --Primary keys                                          |
| ---Add auto-incrementing primary key to a table         |
| ---Auto-increment: reset next value                     |
| ---Find primary key of a table                          |
| --Rename Database                                       |
| --Show                                                  |
| ---Show Create Trigger                                  |
| ---Show Table Status from info_schema                   |
| ---Show Tables                                          |
| --Table size                                            |
| -Date and time                                          |
| --Age in years                                          |
| --Compute date from year, week number and weekday       |
| --Comvert ISO 8601 timestamp                            |
| --Count business days between two dates                 |
| --Count Tuesdays between two dates                      |
| --Data uniqueness in a period                           |
| --Date of 1st day of last, this, next month             |
| --Date of Easter                                        |
| --Date of first Friday of next month                    |
| --Date of last or next Thursday                         |
| --Date of Monday in a given week of the year            |
| --Date of Monday of this week                           |
| --Date of Nth weekday of a month                        |
| --Datetime difference                                   |
| --Duration in years, months, days, time                 |
| --First & last days of last month                       |
| --Julian date                                           |
| --Last business day before a reference date             |
| --Make a calendar table                                 |
| --Period arithmetic                                     |
| ---Audit trails and point-in-time architecture          |
| ---Find overlapping periods                             |
| ---Find sequenced duplicates                            |
| ---In/out at given date and time                        |
| ---Peak visit counts by datetime period                 |
| ---Sum for time periods                                 |
| --Scheduling                                            |
| ---Appointments available                               |
| ---Find available reservation periods                   |
| ---Game schedule                                        |
| ---Is a given booking period available?                 |
| ---Pivot table schedule                                 |
| --Scope to the week of a given date                     |
| --Sum accumulated time by date                          |
| --Sum booking days for a given month                    |
| --Sum time values                                       |
| --Track when a value changed                            |
| --US holiday dates                                      |
| --What month does a week fall in?                       |
| --YearMonth()                                           |
| -Frequencies                                            |
| --Display column values which occur N times             |
| --Display every Nth row                                 |
| -Graphs and Hierarchies                                 |
| --Dijkstra's shortest path algorithm                    |
| --Edge node path                                        |
| --Multiple trees in one table                           |
| --Tree query performance                                |
| --Trees and other hierarchies                           |
| --Trees of known depth                                  |
| --Trees, networks and parts explosions in MySQL         |
| --Treewalks with CTEs                                   |
| -JOIN                                                   |
| --Approximate joins                                     |
| --Cascading JOINs                                       |
| --Classroom scheduling                                  |
| --Data-driven joins                                     |
| --Full Outer Join                                       |
| --Intersection and difference                           |
| --JOIN intro                                            |
| --JOIN order                                            |
| --Joins vs subqueries                                   |
| ---Join or subquery?                                    |
| ---Parents with no kids                                 |
| ---Parties who have contracts with one another          |
| ---The [Not] Exists query pattern                       |
| ---The unbearable slowness of IN()                      |
| ---What exams did a student not register for?           |
| --Many-to-many joins                                    |
| --What else did buyers of X buy?                        |
| -Miscellania                                            |
| --Query execution order                                 |
| -NULLs                                                  |
| --List NULLs at end of query output                     |
| --Parents with and without children                     |
| -Ordering resultsets                                    |
| --Next row                                              |
| --Order by leading digits                               |
| --Order by month name                                   |
| --Order by numerics then alphas                         |
| --Pagination                                            |
| --Row_Number()                                          |
| --Suppress repeating ordering values                    |
| -Relational division                                    |
| --All possible recipes with given ingredients           |
| --All X for which all Y are Z (relational division)     |
| --Who makes all the parts for a given assembly?         |
| -Sequences                                              |
| --Find adjacent unbooked theatre seats                  |
| --Find average row-to-row time interval                 |
| --Find blocks of unused numbers                         |
| --Find missing numbers in a sequence                    |
| --Find missing values in a range                        |
| --Find previous and next values in a sequence           |
| --Find row with next value of specified column          |
| --Find sequence starts and ends                         |
| --Find shortest & longest per-user event time intervals |
| --Find specific sequences                               |
| --Find the next value after a sequence                  |
| --Gaps in a time series                                 |
| --Make a table of sequential ints                       |
| --Make values of a column sequential                    |
| --Sequences with CTEs                                   |
| --Track stepwise project completion                     |
| --Winning streaks                                       |
| -Spherical geometry                                     |
| --Great circle distance                                 |
| -Statistics without aggregates                          |
| --Bayesian probability                                  |
| --Top ten                                               |
| -Stored procedures                                      |
| --A cursor if necessary, but not necessarily a cursor   |
| --Emulate sp_exec                                       |
| --Error handling pre-5.7                                |
| --Variable-length argument for query IN() clause        |
| --Why minimise their use?                               |
| -Strings                                                |
| --Count delimited substrings                            |
| --Count substrings                                      |
| --Levenshtein distance                                  |
| --Proper case                                           |
| --Regex and case sensitivity                            |
| --Retrieve octets from IP addresses                     |
| --Return digits or alphas from a string                 |
| --Split a string                                        |
| --Strip HTML tags                                       |
| -User admin                                             |
| --Password expiry dates                                 |
+---------------------------------------------------------+
Do CTEs also make sums down subtrees easier? Yew bet they do. Consider this toy table ...
drop table if exists t;
create table t (
  id int primary key,
  parentid int,
  name varchar(10),
  sales int
);
insert into t values
(1, null, 'Vehicles', 0),
(2, 1,    'Cars',     0),
(3, 1,    'Bikes',    0),
(4, 2,    'Ford',    10),
(5, 4,    'Mustang',  7),
(6, 4,    'Focus',    4),
(7, 3,    'Electra',  5),
(8, 3,    'Trek',     8);
To get our result, walk subtrees in the tree table and join that result with GROUP BY sums ...
with recursive cte as (
  select t.id, t.name, t.sales, t.id as rootid -- seed row
  from t 
  union all
  select t.id, t.name, t.sales, cte.rootid     -- repeat to every leaf edge
  from t
  join cte on t.parentid = cte.id          
)
select 
  rootid, name, sales, 
  sum(sales) as subtree_sales                  -- aggregate
from cte
group by rootid
order by rootid;
+--------+----------+-------+---------------+
| rootid | name     | sales | subtree_sales |
+--------+----------+-------+---------------+
|      1 | Vehicles |     0 |            34 |
|      2 | Cars     |     0 |            21 |
|      3 | Bikes    |     0 |            13 |
|      4 | Ford     |    10 |            21 |
|      5 | Mustang  |     7 |             7 |
|      6 | Focus    |     4 |             4 |
|      7 | Electra  |     5 |             5 |
|      8 | Trek     |     8 |             8 |
+--------+----------+-------+---------------+
Yes these are simple examples with small rowcounts. But with the basic concepts in hand, more demanding queries are possible. Suppose reads of items listed in the infotree table used in the first example above are tracked in visitlog( id int unsigned primary key, infoid int unsigned, dt timestamp ), and you wish to see read counts summed down all subtrees starting from a specified root. We'll need two treewalks---one to populate a tree of all item descendants of the specified root item, a second to walk down all subtrees of the first CTE result, summing as we go.

Our aim is one query incorporating all required CTEs because that'll perform better, but if you're new to CTEs it can be a challenge to follow the cascading references, so first we just save the first CTE query result to an intermediate table that the second second CTE can run against.

Suppose we want the subtree starting at infotree node id=5. First build the tree, saving the result in a table ...

set @root=5;
create table tmp as
  with recursive treewalk as ( 
    select 
      id, parentid, 0 as level, cast( id as char(64) ) as path, 
      upper(name) as name
    from infotree 
    where id=@root                       -- row for subtree root  
    union 
    select                               -- query the nodes 
      t.id, t.parentid, tw.level+1 as level, 
      concat( path, ',', t.id ) as path, -- save path down to this node 
      concat( repeat( '.', level+1 ),    -- indent by path depth
              if( c.id is not null, upper(t.name), t.name )
            ) as name
    from infotree t 
    left join infotree c on t.id=c.parentid
    join treewalk tw on t.parentid=tw.id 
  ) 
  select                                 -- fetch rows
    t.id, t.parentid, t.level, t.path, t.name, v.N as Total
  from treewalk t
  join(                                  -- join to aggregates
    select infoid, count(*) as N
    from visitlog_all
    group by infoid
  ) v on t.id=v.infoid
  order by t.path;                       -- order list by item path
Now to this result apply the logic of subtree sums we used above in the toy sales table example:
with recursive cte as (
  select tmp.id, tmp.id as rootid, tmp.Total
  from tmp                               -- seed rows
  union all
  select tmp.id, cte.rootid, tmp.Total   -- walk all subtrees
  from tmp
  join cte on tmp.parentid = cte.id          
)
select                                   -- fetch subtree walk rows
  tmp.id, tmp.name, 
  tmp.Total, sums.SubTreeTotal
from tmp
join (                                   -- join them to the sums
  select rootid, sum(Total) as SubTreeTotal
  from cte
  group by rootid
) sums on sums.rootid=tmp.id
order by tmp.path ;
Does it perform? 3 secs on an old server to walk and sum 2K tree items against 2M visit rows, even with the write to a tmp table.

Now put it all into one query with three CTEs named treewalk, rowsums and treesums. Remember, with MySQL CTEs the RECURSIVE keyword is required if any of the WITH subqueries is recursive:

set @root=5;
with recursive 
  treewalk as (  
    select  
      id, parentid, 0 as level, cast( id as char(64) ) as path,  
      upper(name) as name 
    from infotree  
    where id=@root                          -- row for subtree root   
    union  
    select                                  -- query the nodes  
      t.id, t.parentid, tw.level+1 as level,  
      concat( path, ',', t.id ) as path,    -- save path down to this node  
      concat( repeat( '.', level+1 ),       -- indent by path depth 
              if( c.id is not null, upper(t.name), t.name ) 
            ) as name 
    from infotree t                         -- the tree table
    left join infotree c on t.id=c.parentid -- find child in tree
    join treewalk tw on t.parentid=tw.id    -- recurse
  ),
  rowsums as (
    select                               -- fetch rows 
      t.id, t.parentid, t.level, t.path, t.name, v.N as Total 
    from treewalk t 
    join(                                -- join to aggregates 
      select infoid, count(*) as N 
      from visitlog_all 
      group by infoid 
    ) v on t.id=v.infoid 
    order by t.path                      -- order list by item path 
  ),
  treesums as ( 
    select                               -- seed rows
      rowsums.id, rowsums.id as rootid, 
      rowsums.Total 
    from rowsums  
    union all 
    select 
      rowsums.id, treesums.rootid, 
      rowsums.Total                      -- walk all subtrees 
    from rowsums 
    join treesums on rowsums.parentid = treesums.id           
  ) 
select                                   -- fetch rowsums 
  rowsums.id, rowsums.name,  
  rowsums.Total, sums.SubTreeTotal 
from rowsums 
join (                                   -- join them to subtree sums 
  select rootid, sum(Total) as SubTreeTotal 
  from treesums 
  group by rootid 
) sums on sums.rootid=rowsums.id 
order by rowsums.path ;                  -- tree path order
It returns its result in 1.4 secs against 2K infotree rows and 2M visit rows.

It would be useful as a parameterised View, but unfortunately MySQL doesn't (yet?) support parameterised Views. Then make it an sproc with a param for the subtree root ...

drop procedure if exists treereads;
delimiter go
create procedure treereads( root int unsigned )
begin
  with recursive  
    treewalk as (   
      select   
        id, parentid, 0 as level, cast( id as char(64) ) as path,   
        upper(name) as name  
      from infotree   
      where id=root                        -- row for subtree root    
      union   
      select                               -- query the nodes   
        t.id, t.parentid, tw.level+1 as level,   
        concat( path, ',', t.id ) as path, -- save path down to this node   
        concat( repeat( '.', level+1 ),    -- indent by path depth  
                if( c.id is not null, upper(t.name), t.name )  
              ) as name  
      from infotree t   
      left join infotree c on t.id=c.parentid  
      join treewalk tw on t.parentid=tw.id   
    ), 
    rowsums as ( 
      select                             -- fetch rows  
        t.id, t.parentid, t.level, t.path, t.name, v.N as Total  
      from treewalk t  
      join(                              -- join to aggregates  
        select infoid, count(*) as N  
        from visitlog_all  
        group by infoid  
      ) v on t.id=v.infoid  
      order by t.path                    -- order list by item path  
    ), 
    treesums as (  
      select rowsums.id, rowsums.id as rootid, rowsums.Total  
      from rowsums                       -- seed rows  
      union all  
      select  
        rowsums.id, treesums.rootid,  
        rowsums.Total                    -- walk all subtrees  
      from rowsums  
      join treesums on rowsums.parentid = treesums.id            
    )  
  select                                 -- fetch rowsums  
    rowsums.id, rowsums.name,   
    rowsums.Total, sums.SubTreeTotal  
  from rowsums  
  join (                                 -- join them to subtree sums  
    select rootid, sum(Total) as SubTreeTotal  
    from treesums  
    group by rootid  
  ) sums on sums.rootid=rowsums.id  
  order by rowsums.path ;                -- tree path order 
end;
go
delimiter ;

Last updated 24 Jan 2021