query optimization - MySQL filtered gaps and islands: avoiding temporaries and filesorts? - Stack Overflow

CREATE TABLE `messages` (`ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,`Arrival` TIMESTAMP NOT NULL,`Sen

CREATE TABLE `messages` (
    `ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `Arrival` TIMESTAMP NOT NULL,
    `SenderID` INT UNSIGNED NOT NULL,
-- Fields describing messages skipped
    PRIMARY KEY (`SenderID`, `Arrival`, `ID`) USING BTREE,
    INDEX `ID` (`ID`) USING BTREE,
    INDEX `Arrival_SenderID` (`Arrival`, `SenderID`) USING BTREE
)

The table is also PARTITION BY RANGE(UNIX_TIMESTAMP(`Arrival`)) by month, calendar-aligned.

Normally each SenderID sends at least one message in five minutes. Now the task is to report gaps within some that are larger than some specified amount of time. Currently the query (@aFrom and @aTo are set to December 2024, @aDays equals 5) is this:

    WITH t AS (
        SELECT `Messages`.`SenderID`, `Messages`.`Arrival`,
            LAG(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Prev`,
            LEAD(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Next`
        FROM `Messages`
            INNER JOIN `Senders` ON `Senders`.`SenderID` = `Messages`.`SenderID` AND `Senders`.`Active` IS TRUE
        WHERE `Arrival` BETWEEN @aFrom AND @aTo
    )
    SELECT `SenderID`, IFNULL(`Prev`, @aFrom) AS `From`, `Arrival` AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, IFNULL(`Prev`, @aFrom), `Arrival`) > @aDays * 24 * 3600
        UNION
    SELECT `SenderID`, `Arrival` AS `From`, IFNULL(`Next`, @aTo) AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, `Arrival`, IFNULL(`Next`, @aTo)) > @aDays * 24 * 3600
    ORDER BY `SenderID`, `From`;

But such a query has this EXPLAIN output:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> (NULL) ALL (NULL) (NULL) (NULL) (NULL) 2700140 100,00 Using where
2 UNCACHEABLE DERIVED Senders (NULL) ALL SenderID (NULL) (NULL) (NULL) 14430 100,00 Using where; Using temporary; Using filesort
2 UNCACHEABLE DERIVED Messages p202412 ref PRIMARY,Arrival_SenderID PRIMARY 4 testdb.Senders.SenderID 374 50,00 Using where; Using index
3 UNION <derived2> (NULL) ALL (NULL) (NULL) (NULL) (NULL) 2700140 100,00 Using where
(NULL) UNION RESULT <union1,3> (NULL) ALL (NULL) (NULL) (NULL) (NULL) (NULL) (NULL) Using temporary; Using filesort
CREATE TABLE `messages` (
    `ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `Arrival` TIMESTAMP NOT NULL,
    `SenderID` INT UNSIGNED NOT NULL,
-- Fields describing messages skipped
    PRIMARY KEY (`SenderID`, `Arrival`, `ID`) USING BTREE,
    INDEX `ID` (`ID`) USING BTREE,
    INDEX `Arrival_SenderID` (`Arrival`, `SenderID`) USING BTREE
)

The table is also PARTITION BY RANGE(UNIX_TIMESTAMP(`Arrival`)) by month, calendar-aligned.

Normally each SenderID sends at least one message in five minutes. Now the task is to report gaps within some that are larger than some specified amount of time. Currently the query (@aFrom and @aTo are set to December 2024, @aDays equals 5) is this:

    WITH t AS (
        SELECT `Messages`.`SenderID`, `Messages`.`Arrival`,
            LAG(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Prev`,
            LEAD(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Next`
        FROM `Messages`
            INNER JOIN `Senders` ON `Senders`.`SenderID` = `Messages`.`SenderID` AND `Senders`.`Active` IS TRUE
        WHERE `Arrival` BETWEEN @aFrom AND @aTo
    )
    SELECT `SenderID`, IFNULL(`Prev`, @aFrom) AS `From`, `Arrival` AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, IFNULL(`Prev`, @aFrom), `Arrival`) > @aDays * 24 * 3600
        UNION
    SELECT `SenderID`, `Arrival` AS `From`, IFNULL(`Next`, @aTo) AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, `Arrival`, IFNULL(`Next`, @aTo)) > @aDays * 24 * 3600
    ORDER BY `SenderID`, `From`;

But such a query has this EXPLAIN output:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> (NULL) ALL (NULL) (NULL) (NULL) (NULL) 2700140 100,00 Using where
2 UNCACHEABLE DERIVED Senders (NULL) ALL SenderID (NULL) (NULL) (NULL) 14430 100,00 Using where; Using temporary; Using filesort
2 UNCACHEABLE DERIVED Messages p202412 ref PRIMARY,Arrival_SenderID PRIMARY 4 testdb.Senders.SenderID 374 50,00 Using where; Using index
3 UNION <derived2> (NULL) ALL (NULL) (NULL) (NULL) (NULL) 2700140 100,00 Using where
(NULL) UNION RESULT <union1,3> (NULL) ALL (NULL) (NULL) (NULL) (NULL) (NULL) (NULL) Using temporary; Using filesort

`Senders`.`SenderID` is UNIQUE-indexed. Total number of active senders is around 12000 by now (a few thousands are inactive and normally don’t send messages), total number of messages per month is roughly 100 millions.

When run, this query gets processed for an hour and then MySQL might even just crash (possibly because of the large temporary dataset it produces?). MySQL version for this particular test server is 8.0.20, other target servers might have other 8.0.x versions.

I tried to start building the query from scratch but whenever I add LEAD() or LAG() with WITH (to filter on the gap size), temporary and filesort occur in the Extra column as well as UNCACHEABLE DERIVED in select_type. Are there any ways to improve the query?

Just in case it matters (which I personally doubt), imaginary sample data (two main columns only, PRIMARY KEY order):

SenderID Arrival
42 2024-12-05 01:18:23
42 2024-12-05 03:21:47
42 2024-12-07 08:19:16
42 2024-12-11 03:27:59
42 2024-12-14 22:43:08
42 2024-12-15 01:18:23
85 2024-12-01 02:03:39
85 2024-12-04 02:03:39
85 2024-12-07 02:03:39
85 2024-12-10 02:03:39
85 2024-12-13 02:03:39
85 2024-12-16 02:03:39
85 2024-12-22 02:03:39
85 2024-12-25 02:03:39
85 2024-12-28 02:03:39
85 2024-12-31 02:03:39
158 2024-12-06 14:52:17
158 2024-12-08 14:52:17
158 2024-12-11 14:52:17
158 2024-12-14 14:52:17
158 2024-12-17 14:52:17
158 2024-12-20 14:52:17
158 2024-12-23 14:52:17
158 2024-12-26 14:52:17

For interval from 2024-12-01 00:00:00 to 2024-12-31 23:59:59 and @aDays = 5 the following gaps should be found:

SenderID From To Comment
42 2024-12-15 01:18:23 2024-12-31 23:59:59 <- Lost at the end of month
85 2024-12-16 02:03:39 2024-12-22 02:03:39 <- Got lost temporarily
158 2024-12-01 00:00:00 2024-12-06 14:52:17 <- Was lost at the beginning of month
158 2024-12-26 14:52:17 2024-12-31 23:59:59 <- and lost at the end

The query shown above works as expected for small datasets but is extremely slow and even causes MySQL crashes for real data.

Share edited Mar 10 at 11:44 Akina 42.9k6 gold badges16 silver badges29 bronze badges asked Mar 10 at 8:37 Dmitry VasilievDmitry Vasiliev 236 bronze badges 7
  • Why are you checking both LEAD() and LAG()? It may be enough to check one way. Also, UNION instead of UNION ALL will cause sorting to exclude duplicates. And you will have them almost all duplicates. – ValNik Commented Mar 10 at 9:23
  • Add @from, MIN(arrival) and MAX(arrival), @to for each sender in CTE, this will avoid IFNULL(). get not LAG/LEAD but those +/- according INTERVAL, which allows immediate compare in outer query. Test alternative query with two WHERE NOT EXISTS. – Akina Commented Mar 10 at 9:24
  • @ValNik Well, in fact, I just found a similar query on StackOverflow and edited it accordingly. The main reason to check both is that only LEAD() or only LAG() would skip the cases when sender disappeared in the middle of the month or haven’t sent anything until the middle of the month. – Dmitry Vasiliev Commented Mar 10 at 9:44
  • @Akina I’m not quite sure I understand your suggestions. Where should I try to put WHERE NOT EXISTS? And what way can I use «+/– according INTERVAL» to achieve which immediate compare? Instead of IFNULLs, I guess, using third parameter to LEAD/LAG could help, although, AFAIR, MariaDB does not support it (which I’d like to keep in mind as well). – Dmitry Vasiliev Commented Mar 10 at 9:48
  • See Tips for asking a good Structured Query Language (SQL) question. Create small sample data (2-3 senders, 5-10 rows per sender) and provide needed output for this data and 1-2 month desired dates period. – Akina Commented Mar 10 at 10:09
 |  Show 2 more comments

1 Answer 1

Reset to default 1

Test this query:

WITH 
add_date_from (SenderID, Arrival) AS (
  SELECT SenderID, Arrival
  FROM messages
  UNION ALL
  SELECT DISTINCT SenderID, @date_from
  FROM messages
  ),
get_gaps (SenderID, Arrival, next_arrival, gap) AS (
  SELECT SenderID, 
         Arrival,
         COALESCE(LEAD(Arrival) OVER win, @date_to) next_arrival,
         TIMESTAMPDIFF(DAY, Arrival, COALESCE(LEAD(Arrival) OVER win, @date_to)) gap
  FROM add_date_from
  WINDOW win AS (PARTITION BY SenderID ORDER BY Arrival)
  )
SELECT SenderID, Arrival `From`, next_arrival `To`
FROM get_gaps
WHERE gap >= @Days
ORDER BY SenderID, Arrival
SenderID From To
42 2024-12-15 01:18:23 2024-12-31 23:59:59
85 2024-12-16 02:03:39 2024-12-22 02:03:39
158 2024-12-01 00:00:00 2024-12-06 14:52:17
158 2024-12-26 14:52:17 2024-12-31 23:59:59

fiddle

PS. If there exists some rows whose Arrival value is equal to @date_from value strictly then you may receive duplicates. You'd avoid them - either add according WHERE to the 2nd subquery of the 1st CTE or add DISTINCT to the outer query.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744856350a4597430.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信