On the MySQL InnoDB forum, a user asked this question ...
CREATE TABLE foo ( a INT NOT NULL, b INT NOT NULL, c CHAR(1), PRIMARY KEY (a), KEY (b) ) ENGINE=InnoDB; INSERT INTO foo VALUES (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E'); Test case 1 Session 1: START TRANSACTION; SELECT * FROM foo WHERE a < 2 FOR UPDATE; Session 2: DELETE FROM foo WHERE a = 3; -- Success Test case 2 This uses the original rows of the table with the deleted record returned. Session 1: START TRANSACTION; SELECT * FROM foo WHERE b < 2 FOR UPDATE; Session 2: DELETE FROM foo WHERE b = 3; -- Blocks Locking the index entry with b = 3 depends only on the uniqueness of the index used. The transaction isolation level does not matter. Why does InnoDB block the next index entry to the right of the scanned range in case of a non-unique index? Is there any practical reason for this? Can someone give an example of a problem that could happen if the record with b = 3 is not blocked in the second test case?Oracle staffer Jakub Lopuszanski offered this response ... MySQL has a handy tool for analysing scenarios involving more than one client, called mysql-test-run.pl, or mtr for short - you can find it in mysql-test subdirectory. So, I've adapted your scenarios to mtr scripts, so I can reproduce them easily. --echo "#" --echo "# SCENARIO WITH PRIMARY KEY (a)" --echo "#" CREATE TABLE foo ( a INT NOT NULL, b INT NOT NULL, c CHAR(1), PRIMARY KEY (a), KEY (b) ) ENGINE=InnoDB; INSERT INTO foo VALUES (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E'); --connect (c1, localhost, root,,) START TRANSACTION; SELECT * FROM foo WHERE a < 2 FOR UPDATE; --echo "# I'd expect exclusive locks by c1 on PRIMARY KEY row a=1 and gap before it, and gap before row a=3" --connection default SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; --connect (c2, localhost, root,,) START TRANSACTION; DELETE FROM foo WHERE a = 3; --echo "# I'd expect additional exclusive lock by c2 on PRIMARY KEY row a=3 (which is delete marked and not yet purged, so read views can see state of DB before uncommitted c2)." --connection default SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; --connection c1 COMMIT; --echo "# I'd expect only the exclusive lock by c2 on PRIMARY row a=3." --connection default SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; --connection c2 COMMIT; --connection default --disconnect c2 --disconnect c1 DROP TABLE foo; As you can see I'm querying the performance_schema.data_locks to see what data locks are being held. This table is documented at https://dev.mysql.com/doc/mysql-perfschema-excerpt/8.0/en/data-locks-table.html. It's important to understand several things when interpreting content of data_locks table: - the foo table has two indexes (PRIMARY and KEY(b)), which can be thought of as two separate collections of rows - the PRIMARY KEY can be imagined as a sorted list of (a,b,c) tuples: (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E') - the secondary KEY(b) can be imagined as a sorted list of (b,a) tuples: (1,1), (3,3), (5,5), (7,7), (9,9) - if we imagine these sorted tuples as laying on an axis, then a record lock can have several types: - "REC_NOT_GAP" - such lock protects just the tuple itself - "GAP" - such lock protects just the gap before (to the left of) the tuples - "", the default - such lock protects both: the gap before and the tuple itself - locks can be e"X"clusive or "S"shared. (Currently this doesn't matter for the gap before record - even "X" is treated as "S" for the gap part of a lock) If you place the above test in mysql-test/suites/innodb/t/ and run it with ./mtr --record forum.test; it will generate mysql-test/suites/innodb/r/forum.result file, in which you should see following fragments: "# I'd expect exclusive locks by c1 on PRIMARY KEY row a=1 and gap before it, and gap before row a=3" SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; engine_transaction_id index_name lock_mode lock_status lock_data 1816 PRIMARY X GRANTED 1 1816 PRIMARY X,GAP GRANTED 3 START TRANSACTION; Indeed the observation matches the expectation: X,GAP on 3 means that the row itself is not protected, just the gap spanning range between 1 and 3 (non-inclusive). And as expected, this allows the c2's DELETE to succeed: "# I'd expect additional exclusive lock by c2 on PRIMARY KEY row a=3 (which is delete marked and not yet purged, so read views can see state of DB before uncommitted c2)." SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; engine_transaction_id index_name lock_mode lock_status lock_data 1817 PRIMARY X, REC_NOT_GAP GRANTED 3 1816 PRIMARY X GRANTED 1 1816 PRIMARY X,GAP GRANTED 3 Here c1=1816 and c2=1817 (actual transaction ids may vary from run to run, obviously). And we see that c2 successfully got GRANTED the X,REC_NOT_GAP lock on the row a=3, despite c1 having X,GAP lock on a=3. Cool. Now, let's move on to the next scenario. I've modified it a bit to clarify one potential confusion. You wrote: > Locking the index entry with b = 3 depends only on the uniqueness of the index used. I'll demonstrate that locking happens even if the index on 'b' is UNIQUE. What matters is rather that it is a secondary (=not PRIMARY) index. --echo "#" --echo "# SCENARIO WITH UNIQUE KEY (b)" --echo "#" CREATE TABLE foo ( a INT NOT NULL, b INT NOT NULL, c CHAR(1), PRIMARY KEY (a), UNIQUE KEY (b) ) ENGINE=InnoDB; INSERT INTO foo VALUES (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E'); --connect (c1, localhost, root,,) START TRANSACTION; SELECT * FROM foo WHERE b < 2 FOR UPDATE; --echo "# I'd expect exclusive locks by c1 on KEY 'b' row b=1,a=1 and gap before it, and gap before row b=3,a=3, and exclusive lock on PRIMARY KEY row a=1," --echo "# instead the lock on b=3,a=3 covers not only gap, but also record" --connection default SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; --connect (c2, localhost, root,,) START TRANSACTION; --echo "# because c1 latched the INDEX 'b' row b=3,a=3, this has to wait:" --send DELETE FROM foo WHERE b = 3; # one could use SET DEBUG_SYNC='lock_wait_will_wait SIGNAL c2_will_wait' before DELETE in c2, and SET DEBUG_SYNC='now WAIT_FOR c2_will_wait' here instead of sleep, # but this wouldn't work on release build of mysqld --sleep 1 --echo "# I'd expect additional exclusive WAITING lock by c2 on KEY 'b' row b=3,a=3" --connection default SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; --connection c1 COMMIT; --echo "# I'd expect exclusive lock by c2 on KEY 'b' row b=3,a=3, and exclusive lock on PRIMARY KEY row a=3" --connection default SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; --connection c2 --reap COMMIT; --connection default --disconnect c2 --disconnect c1 DROP TABLE foo; Let's focus on the most intriguing part of the result: "# I'd expect exclusive locks by c1 on KEY 'b' row b=1,a=1 and gap before it, and gap before row b=3,a=3, and exclusive lock on PRIMARY KEY row a=1," "# instead the lock on b=3,a=3 covers not only gap, but also record" SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo'; engine_transaction_id index_name lock_mode lock_status lock_data 1841 b X GRANTED 1, 1 1841 b X GRANTED 3, 3 1841 PRIMARY X,REC_NOT_GAP GRANTED 1 Here, the (b=3,a=3) tuple is latched together with the gap before it - this is the default in our code. It takes extra development effort to implement (and prove correctness of) locking in REC_NOT_GAP or GAP mode. Note that this locking happens even though the KEY (b) is UNIQUE. The logic of choosing correct lock mode was recently (8.0.18) fixed for PRIMARY KEY, and this is why you don't see problems with it for PRIMARY KEY. The logic is now contained in You can see an explicit `if (index == clust_index` condition in it, row_compare_row_to_range(). (clust_index is short for clustered index, which is a different way of saying "PRIMARY KEY"). If I remove this `index == clust_index && ` from the condition, recompile, and run the test again, I see a change in the locks taken: engine_transaction_id index_name lock_mode lock_status lock_data 1841 b X GRANTED 1, 1 -1841 b X GRANTED 3, 3 1841 PRIMARY X, REC_NOT_GAP GRANTED 1 +1841 b X,GAP GRANTED 3, 3 This is probably a change you would welcome: only the gap before (b=3,a=3) - but not the tuple itself - is locked. As expected, it follows that c2 can successfully lock and remove the row with (b=3,a=3). What's wrong with such a change? It took more than a month to implement, verify and test the previous change for PRIMARY KEY. The `if` in question is "guarded" by this comment: /* Following heuristics are meant to avoid locking the row itself, or even the gap before it, in case when the row is "after the end of range". The difficulty here is in that the index itself can be declared as either ascending or descending, separately for each column, and cursor can be PAGE_CUR_G(E) or PAGE_CUR_L(E) etc., and direction of scan can be 0, ROW_SEL_NEXT or ROW_SEL_PREV, and this might be a secondary index (with duplicates). So we limit ourselves just to the cases, which are at the same common, tested, actionable and easy to reason about. In particular we only handle cases where we iterate the index in its natural order. */ What is it all about? If you declare the index as DESC then it has tuples sorted in the descending order: (9,9), (7,7), (5,5), (3,3), (1,1) This changes what "gap before record" mean: now it means values larger than the record. So, for example, now X,GAP on (3,3) would protect (4,4), but not (2,2). This clearly changes some decisions to be made, as now we would probably prefer to only X lock (1,1) in default mode. But wait, there is more. The scan itself can be ascending or descending depending on the ORDER BY clause. Moreover, the sign of inequality can be either "sharp" as in b<3 or "soft" as in b<=3. Exponential explosion of cases escalates quickly... But wait, there is more, if the index is non-unique, we have additional fun, of having, say (b=3,a=5),(b=3,a=3),(b=3,a=2) in it, and having to figure out which (if any) of the "siblings" to lock, which gaps, etc. Also, the index can be defined on multiple columns (KEY (b,c)) while the query can reference just some prefix of it (WHERE b<3), or even a prefix of a column if it is a string. I'm sure how NULL values (in particular, in "tiebreaker" columns) should be handled for various sort orders. Part (perhaps large) of the problem is that row_compare_row_to_range() is used inside row_search_mvcc() function, which spans 1500+ lines of code starting from here - not a toy puzzle to analyse. You may want to look into ./mysql-test/suite/innodb/t/lock_end_of_range.test which demonstrates some (~400) cases we've tested for PRIMARY KEY to make "sure" it works as it should. So, there would be some non-trivial development effort, to "just" remove this one conjunct from the `if`. As I wrote above: I wouldn't use the word "bug", because there is no data loss, wrong answer, etc. - just less concurrency than one could imagine in a different implementation (which might have other issues in other scenarios). And I'm reluctant to even use the word "inefficiency", given that "number of resources locked" is just one axis, some of others being "simplicity of design" and "speed of development". I'd love to see a well tested simple contribution for this "issue", though. As an encouragement, I've run the full ./mtr test suite with the "index==clust_index&&" removed, and it seems to pass most tests, except for: Failing test(s): sys_vars.innodb_monitor_reset_all_basic sys_vars.innodb_monitor_enable_basic innodb.monitor i_main.update sys_vars.innodb_monitor_reset_basic sys_vars.innodb_monitor_disable_basic So, now it "only" remains to prove somehow the optimization is really the correct thing to do (or find and handle counterexamples) and fix those test cases and/or code logic.The discussion yielded bug report https://bugs.mysql.com/bug.php?id=98639. Last updated 17 Feb 2020 |
![]() |