Gap locks

from the Artful MySQL Tips List


On the MySQL InnoDB forum, a user asked this question ...
CREATE TABLE foo (
  a INT NOT NULL,
  b INT NOT NULL,
  c CHAR(1),
  PRIMARY KEY (a),
  KEY (b)
) ENGINE=InnoDB;

INSERT INTO foo VALUES (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E');

Test case 1

Session 1:
START TRANSACTION;
SELECT * FROM foo WHERE a < 2 FOR UPDATE;

Session 2:
DELETE FROM foo WHERE a = 3;  -- Success

Test case 2

This uses the original rows of the table with the deleted record returned.

Session 1:
START TRANSACTION;
SELECT * FROM foo WHERE b < 2 FOR UPDATE;

Session 2:
DELETE FROM foo WHERE b = 3;  -- Blocks

Locking the index entry with b = 3 depends only on the uniqueness of the index used. The transaction isolation level does not matter.

Why does InnoDB block the next index entry to the right of the
 scanned range in case of a non-unique index? Is there any 
 practical reason for this? Can someone give an example of a 
 problem that could happen if the record with b = 3 is not 
 blocked in the second test case?
Oracle staffer Jakub Lopuszanski offered this response ...
MySQL has a handy tool for analysing scenarios involving more than 
one client, called mysql-test-run.pl, or mtr for short - you can 
find it in mysql-test subdirectory.
So, I've adapted your scenarios to mtr scripts, so I can 
reproduce them easily.

--echo "#"
--echo "# SCENARIO WITH PRIMARY KEY (a)"
--echo "#"

CREATE TABLE foo (
  a INT NOT NULL,
  b INT NOT NULL,
  c CHAR(1),
  PRIMARY KEY (a),
  KEY (b)
) ENGINE=InnoDB;

INSERT INTO foo VALUES 
  (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E');

--connect (c1, localhost, root,,)
	START TRANSACTION;
	SELECT * FROM foo WHERE a < 2 FOR UPDATE;
--echo "# I'd expect exclusive locks by c1 on PRIMARY KEY row a=1 
  and gap before it, and gap before row a=3"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';

--connect (c2, localhost, root,,)
	START TRANSACTION;
	DELETE FROM foo WHERE a = 3;

--echo "# I'd expect additional exclusive lock by c2 on PRIMARY KEY 
row a=3 (which is delete marked and not yet purged, so read views 
can see state of DB before uncommitted c2)."

--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';

--connection c1
	COMMIT;

--echo "# I'd expect only the exclusive lock by c2 on PRIMARY row a=3."
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';

--connection c2
COMMIT;

--connection default
--disconnect c2
--disconnect c1
DROP TABLE foo;

As you can see I'm querying the performance_schema.data_locks to 
see what data locks are being held. This table is documented at 
https://dev.mysql.com/doc/mysql-perfschema-excerpt/8.0/en/data-locks-table.html.

It's important to understand several things when interpreting 
content of data_locks table:
- the foo table has two indexes (PRIMARY and KEY(b)), which can 
be thought of as two separate collections of rows
- the PRIMARY KEY can be imagined as a sorted list of (a,b,c) tuples: 
  (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E')
- the secondary KEY(b) can be imagined as a sorted list of (b,a) 
  tuples: (1,1), (3,3), (5,5), (7,7), (9,9)
- if we imagine these sorted tuples as laying on an axis, then a 
  record lock can have several types:
  - "REC_NOT_GAP" - such lock protects just the tuple itself
  - "GAP" - such lock protects just the gap before (to the left of) 
    the tuples
  - "", the default - such lock protects both: the gap before and 
    the tuple itself
- locks can be e"X"clusive or "S"shared. (Currently this doesn't 
  matter for the gap before record - even "X" is treated as "S" 
  for the gap part of a lock)

If you place the above test in mysql-test/suites/innodb/t/ and 
run it with ./mtr --record forum.test; it will generate 
mysql-test/suites/innodb/r/forum.result file, in which you 
should see following fragments:

"# I'd expect exclusive locks by c1 on PRIMARY KEY row a=1 and 
gap before it, and gap before row a=3"
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';
engine_transaction_id	index_name	lock_mode	lock_status	lock_data
1816	                PRIMARY	        X	        GRANTED	        1
1816	                PRIMARY	        X,GAP	        GRANTED	        3
START TRANSACTION;

Indeed the observation matches the expectation: X,GAP on 3 means 
that the row itself is not protected, just the gap spanning range
between 1 and 3 (non-inclusive). And as expected, this allows 
the c2's DELETE to succeed:

"# I'd expect additional exclusive lock by c2 on PRIMARY KEY row a=3
(which is delete marked and not yet purged, so read views can see 
state of DB before uncommitted c2)."
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';
engine_transaction_id	index_name	lock_mode	lock_status	        lock_data
1817	                PRIMARY	    X,        REC_NOT_GAP	GRANTED	   3
1816	                PRIMARY	    X	        GRANTED	               1
1816	                PRIMARY	    X,GAP	    GRANTED	               3  

Here c1=1816 and c2=1817 (actual transaction ids may vary from run 
to run, obviously).
And we see that c2 successfully got GRANTED the X,REC_NOT_GAP lock on
the row a=3, despite c1 having X,GAP lock on a=3. Cool.

Now, let's move on to the next scenario. I've modified it a bit to
clarify one potential confusion. You wrote:
> Locking the index entry with b = 3 depends only on the uniqueness 
  of the index used. 
I'll demonstrate that locking happens even if the index on 'b' 
is UNIQUE. What matters is rather that it is a secondary (=not 
PRIMARY) index.

--echo "#"
--echo "# SCENARIO WITH UNIQUE KEY (b)"
--echo "#"

CREATE TABLE foo (
  a INT NOT NULL,
  b INT NOT NULL,
  c CHAR(1),
  PRIMARY KEY (a),
  UNIQUE KEY (b)
) ENGINE=InnoDB;

INSERT INTO foo VALUES (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E');

--connect (c1, localhost, root,,)
START TRANSACTION;
SELECT * FROM foo WHERE b < 2 FOR UPDATE;
--echo "# I'd expect exclusive locks by c1 on KEY 'b' row b=1,a=1 
  and gap before it, and gap before row b=3,a=3, and exclusive lock 
  on PRIMARY KEY row a=1,"
--echo "# instead the lock on b=3,a=3 covers not only gap, but also
  record"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';

--connect (c2, localhost, root,,)
START TRANSACTION;
--echo "# because c1 latched the INDEX 'b' row b=3,a=3, this has to wait:"
--send
DELETE FROM foo WHERE b = 3;

# one could use SET DEBUG_SYNC='lock_wait_will_wait SIGNAL c2_will_wait'
before DELETE in c2, and SET DEBUG_SYNC='now WAIT_FOR c2_will_wait' 
here instead of sleep,
# but this wouldn't work on release build of mysqld
--sleep 1
--echo "# I'd expect additional exclusive WAITING lock by c2 
  on KEY 'b' row b=3,a=3"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';

--connection c1
COMMIT;

--echo "# I'd expect exclusive lock by c2 on KEY 'b' row b=3,a=3, 
  and exclusive lock on PRIMARY KEY row a=3"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';

--connection c2
	--reap
COMMIT;

--connection default
--disconnect c2
--disconnect c1
DROP TABLE foo;

Let's focus on the most intriguing part of the result:

"# I'd expect exclusive locks by c1 on KEY 'b' row b=1,a=1 and 
gap before it, and gap before row b=3,a=3, and exclusive lock 
on PRIMARY KEY row a=1,"
"# instead the lock on b=3,a=3 covers not only gap, but also record"

SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data 
FROM performance_schema.data_locks 
WHERE lock_type='RECORD' AND object_name='foo';
engine_transaction_id	index_name	lock_mode	   lock_status	lock_data
1841	                b	          X	            GRANTED	        1, 1
1841	                b	          X	            GRANTED	        3, 3
1841	                PRIMARY	    X,REC_NOT_GAP	GRANTED	        1

Here, the (b=3,a=3) tuple is latched together with the gap before 
it - this is the default in our code.
It takes extra development effort to implement (and prove 
correctness of) locking in REC_NOT_GAP or GAP mode.
Note that this locking happens even though the KEY (b) is UNIQUE.

The logic of choosing correct lock mode was recently (8.0.18) 
fixed for PRIMARY KEY, and this is why you don't see problems 
with it for PRIMARY KEY.
The logic is now contained in 

You can see an explicit `if (index == clust_index` condition  
in it,
row_compare_row_to_range(). (clust_index is short for clustered index, 
which is a different way of saying "PRIMARY KEY").

If I remove this `index == clust_index && ` from the condition, 
recompile, and run the test again, I see a change in the locks taken:
 engine_transaction_id  index_name      lock_mode       lock_status     lock_data
 1841                   b               X               GRANTED         1, 1
-1841                   b               X               GRANTED         3, 3
 1841                   PRIMARY X,      REC_NOT_GAP     GRANTED         1
+1841                   b               X,GAP           GRANTED         3, 3

This is probably a change you would welcome: only the gap before 
(b=3,a=3) - but not the tuple itself - is locked. As expected, 
it follows that c2 can successfully lock and remove the row 
with (b=3,a=3).

What's wrong with such a change?
It took more than a month to implement, verify and test 
the previous change for PRIMARY KEY. The `if` in question is 
"guarded" by this comment:

  /* Following heuristics are meant to avoid locking the row itself, or even
  the gap before it, in case when the row is "after the end of range". The
  difficulty here is in that the index itself can be declared as either
  ascending or descending, separately for each column, and cursor can be
  PAGE_CUR_G(E) or PAGE_CUR_L(E) etc., and direction of scan can be 0,
  ROW_SEL_NEXT or ROW_SEL_PREV, and this might be a secondary index (with
  duplicates). So we limit ourselves just to the cases, which are at the
  same common, tested, actionable and easy to reason about.
  In particular we only handle cases where we iterate the index in its
  natural order. */

What is it all about? If you declare the index as DESC then it has 
tuples sorted in the descending order: 
(9,9), (7,7), (5,5), (3,3), (1,1)
This changes what "gap before record" mean: now it means values 
larger than the record. So, for example, now X,GAP on (3,3) would 
protect (4,4), but not (2,2). This clearly changes some decisions 
to be made, as now we would probably prefer to only X lock (1,1) 
in default mode. But wait, there is more. The scan itself can be 
ascending or descending depending on the ORDER BY clause. Moreover, 
the sign of inequality can be either "sharp" as in b<3 or "soft" 
as in b<=3. Exponential explosion of cases escalates quickly...
But wait, there is more, if the index is non-unique, we have 
additional fun, of having, say (b=3,a=5),(b=3,a=3),(b=3,a=2) 
in it, and having to figure out which (if any) of the "siblings" 
to lock, which gaps, etc. Also, the index can be defined on 
multiple columns (KEY (b,c)) while the query can reference just 
some prefix of it (WHERE b<3), or even a prefix of a column if 
it is a string. I'm sure how NULL values (in particular, in 
"tiebreaker" columns) should be handled for various sort orders.
Part (perhaps large) of the problem is that 
row_compare_row_to_range() is used inside row_search_mvcc() 
function, which spans 1500+ lines of code starting from 
here
- not a toy puzzle to analyse.
You may want to look into 
./mysql-test/suite/innodb/t/lock_end_of_range.test 
which demonstrates some (~400) cases we've tested for 
PRIMARY KEY to make "sure" it works as it should.

So, there would be some non-trivial development effort, to 
"just" remove this one conjunct from the `if`. As I wrote 
above: I wouldn't use the word "bug", because there is no data 
loss, wrong answer, etc. - just less concurrency than one could 
imagine in a different implementation (which might have other 
issues in other scenarios).

And I'm reluctant to even use the word "inefficiency", given 
that "number of resources locked" is just one axis, some of 
others being "simplicity of design" and "speed of development".

I'd love to see a well tested simple contribution for this 
"issue", though.

As an encouragement, I've run the full ./mtr test suite with 
the "index==clust_index&&" removed, and it seems to pass most 
tests, except for:

Failing test(s): sys_vars.innodb_monitor_reset_all_basic
 sys_vars.innodb_monitor_enable_basic innodb.monitor i_main.update 
sys_vars.innodb_monitor_reset_basic sys_vars.innodb_monitor_disable_basic

So, now it "only" remains to prove somehow the optimization is 
really the correct thing to do (or find and handle counterexamples) 
and fix those test cases and/or code logic.
The discussion yielded bug report https://bugs.mysql.com/bug.php?id=98639.

Return to the Artful MySQL Tips page