query optimization - Why does adding OR to MySQL statement with WHERE IN causes a full table scan? - Stack Overflow

We are using MySql 8.0.36. After running into a performance issue where a developer added an OR WHERE I

We are using MySql 8.0.36. After running into a performance issue where a developer added an OR WHERE IN (subquery) statement to one of our queries, I started to investigate what combination of statements triggers bad performance. I was surprised to learn that adding OR to a query that has WHERE IN significantly decreases performance, even if it is constant time.

Take for example this query. There's a session table (39 million rows), and a presenters table (32 million rows) that has a non-unique foreign key to sessions:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
);

/*
-> Nested loop inner join  (cost=1.45 rows=1) (actual time=0.0489..0.051 rows=1 loops=1)
    -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0301..0.032 rows=1 loops=1)
    -> Single-row index lookup on sessions using PRIMARY (session_id=presenters.session_id)  (cost=0.35 rows=1) (actual time=0.0171..0.0171 rows=1 loops=1)
*/

Now take a look what happens when I add an OR 1=0:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
)
OR 1=0; -- OR false also causes the execution plan change

/*
-> Filter: <in_optimizer>(sessions.session_id,sessions.session_id in (select #2))  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0549..78796 rows=1 loops=1)
    -> Table scan on sessions  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0252..52601 rows=34e+6 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((sessions.session_id = `<materialized_subquery>`.session_id))  (cost=1.3..1.3 rows=1) (actual time=522e-6..522e-6 rows=29.4e-9 loops=34e+6)
            -> Limit: 1 row(s)  (cost=1.2..1.2 rows=1) (actual time=419e-6..419e-6 rows=29.4e-9 loops=34e+6)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (session_id=sessions.session_id)  (actual time=308e-6..308e-6 rows=29.4e-9 loops=34e+6)
                    -> Materialize with deduplication  (cost=1.2..1.2 rows=1) (actual time=0.0224..0.0224 rows=1 loops=1)
                        -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0161..0.0177 rows=1 loops=1)
*/

Note that this does a full table scan. The first query ran in less than a millisecond, this one took a little over a minute. At this point we have to ban any use of OR combined with WHERE IN.

Is there any reason why this is happening? Is there anything we can do to work around this aside from making two separate queries and unioning the results?

We are using MySql 8.0.36. After running into a performance issue where a developer added an OR WHERE IN (subquery) statement to one of our queries, I started to investigate what combination of statements triggers bad performance. I was surprised to learn that adding OR to a query that has WHERE IN significantly decreases performance, even if it is constant time.

Take for example this query. There's a session table (39 million rows), and a presenters table (32 million rows) that has a non-unique foreign key to sessions:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
);

/*
-> Nested loop inner join  (cost=1.45 rows=1) (actual time=0.0489..0.051 rows=1 loops=1)
    -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0301..0.032 rows=1 loops=1)
    -> Single-row index lookup on sessions using PRIMARY (session_id=presenters.session_id)  (cost=0.35 rows=1) (actual time=0.0171..0.0171 rows=1 loops=1)
*/

Now take a look what happens when I add an OR 1=0:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
)
OR 1=0; -- OR false also causes the execution plan change

/*
-> Filter: <in_optimizer>(sessions.session_id,sessions.session_id in (select #2))  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0549..78796 rows=1 loops=1)
    -> Table scan on sessions  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0252..52601 rows=34e+6 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((sessions.session_id = `<materialized_subquery>`.session_id))  (cost=1.3..1.3 rows=1) (actual time=522e-6..522e-6 rows=29.4e-9 loops=34e+6)
            -> Limit: 1 row(s)  (cost=1.2..1.2 rows=1) (actual time=419e-6..419e-6 rows=29.4e-9 loops=34e+6)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (session_id=sessions.session_id)  (actual time=308e-6..308e-6 rows=29.4e-9 loops=34e+6)
                    -> Materialize with deduplication  (cost=1.2..1.2 rows=1) (actual time=0.0224..0.0224 rows=1 loops=1)
                        -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0161..0.0177 rows=1 loops=1)
*/

Note that this does a full table scan. The first query ran in less than a millisecond, this one took a little over a minute. At this point we have to ban any use of OR combined with WHERE IN.

Is there any reason why this is happening? Is there anything we can do to work around this aside from making two separate queries and unioning the results?

Share Improve this question edited Nov 18, 2024 at 17:17 jgawrych asked Nov 17, 2024 at 5:30 jgawrychjgawrych 3,5411 gold badge31 silver badges41 bronze badges 7
  • If you analyze how an index (usually a B-tree) is used, you will see that WHERE IN (...) renders the index not very usable or beneficial. – Tim Biegeleisen Commented Nov 17, 2024 at 5:32
  • 1 As far as I know the optimizer translates exists and in subqueries into joins when suited(see the first explain). Finding an or seems to rule this out(combining a subquery and an or to a join is not straightforward...) and the two queries strategy is taken. – Turo Commented Nov 17, 2024 at 8:24
  • Curiosity killed the cat: Did you try what happens when you put the or in the subquery? – Turo Commented Nov 17, 2024 at 8:38
  • @Turo - EXISTS is a "semijoin"; it works somewhat like a JOIN, but can stop when one row matches. IN is often optimized more poorly than the equivalent. – Rick James Commented Nov 17, 2024 at 16:11
  • @Turo for curiosity's sake, moving the OR 1=0 into the subquery did not cause the optimizer to change plans. Interestingly, I then tried converting the query into an EXISTS and it didn't change either. However, adding back my real condition, caused a completely different plan, but it's worse than ever, joining two full table index scans – jgawrych Commented Nov 18, 2024 at 17:32
 |  Show 2 more comments

1 Answer 1

Reset to default 2
  • When building a WHERE clause, avoid unnecessary things. (I, too, am surprised that the Optimizer failed to see that as OR FALSE and remove it.)

  • Write the query as a simple JOIN; the Optimizer is better at optimizing JOIN than IN ( SELECT ... ):

    SELECT s.*
        FROM presenters AS p
        JOIN sessions AS s  USING (session_id)
        WHERE p.user_id = 71
    
  • Have INDEX(session_id) on `sessions.

  • File a bug with bugs.mysql if you feel strongly enough about this issue.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745637341a4637474.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信