query optimization - MySQL filtered gaps and islands: avoiding temporaries and filesorts?

CREATE TABLE `messages` (
    `ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `Arrival` TIMESTAMP NOT NULL,
    `SenderID` INT UNSIGNED NOT NULL,
-- Fields describing messages skipped
    PRIMARY KEY (`SenderID`, `Arrival`, `ID`) USING BTREE,
    INDEX `ID` (`ID`) USING BTREE,
    INDEX `Arrival_SenderID` (`Arrival`, `SenderID`) USING BTREE
)

The table is also PARTITION BY RANGE(UNIX_TIMESTAMP(`Arrival`)) by month, calendar-aligned.

Normally each SenderID sends at least one message in five minutes. Now the task is to report gaps within some that are larger than some specified amount of time. Currently the query (@aFrom and @aTo are set to December 2024, @aDays equals 5) is this:

    WITH t AS (
        SELECT `Messages`.`SenderID`, `Messages`.`Arrival`,
            LAG(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Prev`,
            LEAD(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Next`
        FROM `Messages`
            INNER JOIN `Senders` ON `Senders`.`SenderID` = `Messages`.`SenderID` AND `Senders`.`Active` IS TRUE
        WHERE `Arrival` BETWEEN @aFrom AND @aTo
    )
    SELECT `SenderID`, IFNULL(`Prev`, @aFrom) AS `From`, `Arrival` AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, IFNULL(`Prev`, @aFrom), `Arrival`) > @aDays * 24 * 3600
        UNION
    SELECT `SenderID`, `Arrival` AS `From`, IFNULL(`Next`, @aTo) AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, `Arrival`, IFNULL(`Next`, @aTo)) > @aDays * 24 * 3600
    ORDER BY `SenderID`, `From`;

But such a query has this EXPLAIN output:

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	(NULL)	ALL	(NULL)	(NULL)	(NULL)	(NULL)	2700140	100,00	Using where
2	UNCACHEABLE DERIVED	Senders	(NULL)	ALL	SenderID	(NULL)	(NULL)	(NULL)	14430	100,00	Using where; Using temporary; Using filesort
2	UNCACHEABLE DERIVED	Messages	p202412	ref	PRIMARY,Arrival_SenderID	PRIMARY	4	testdb.Senders.SenderID	374	50,00	Using where; Using index
3	UNION	<derived2>	(NULL)	ALL	(NULL)	(NULL)	(NULL)	(NULL)	2700140	100,00	Using where
(NULL)	UNION RESULT	<union1,3>	(NULL)	ALL	(NULL)	(NULL)	(NULL)	(NULL)	(NULL)	(NULL)	Using temporary; Using filesort

CREATE TABLE `messages` (
    `ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `Arrival` TIMESTAMP NOT NULL,
    `SenderID` INT UNSIGNED NOT NULL,
-- Fields describing messages skipped
    PRIMARY KEY (`SenderID`, `Arrival`, `ID`) USING BTREE,
    INDEX `ID` (`ID`) USING BTREE,
    INDEX `Arrival_SenderID` (`Arrival`, `SenderID`) USING BTREE
)

The table is also PARTITION BY RANGE(UNIX_TIMESTAMP(`Arrival`)) by month, calendar-aligned.

    WITH t AS (
        SELECT `Messages`.`SenderID`, `Messages`.`Arrival`,
            LAG(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Prev`,
            LEAD(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Next`
        FROM `Messages`
            INNER JOIN `Senders` ON `Senders`.`SenderID` = `Messages`.`SenderID` AND `Senders`.`Active` IS TRUE
        WHERE `Arrival` BETWEEN @aFrom AND @aTo
    )
    SELECT `SenderID`, IFNULL(`Prev`, @aFrom) AS `From`, `Arrival` AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, IFNULL(`Prev`, @aFrom), `Arrival`) > @aDays * 24 * 3600
        UNION
    SELECT `SenderID`, `Arrival` AS `From`, IFNULL(`Next`, @aTo) AS `To` FROM t
    WHERE TIMESTAMPDIFF(SECOND, `Arrival`, IFNULL(`Next`, @aTo)) > @aDays * 24 * 3600
    ORDER BY `SenderID`, `From`;

But such a query has this EXPLAIN output:

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	<derived2>	(NULL)	ALL	(NULL)	(NULL)	(NULL)	(NULL)	2700140	100,00	Using where
2	UNCACHEABLE DERIVED	Senders	(NULL)	ALL	SenderID	(NULL)	(NULL)	(NULL)	14430	100,00	Using where; Using temporary; Using filesort
2	UNCACHEABLE DERIVED	Messages	p202412	ref	PRIMARY,Arrival_SenderID	PRIMARY	4	testdb.Senders.SenderID	374	50,00	Using where; Using index
3	UNION	<derived2>	(NULL)	ALL	(NULL)	(NULL)	(NULL)	(NULL)	2700140	100,00	Using where
(NULL)	UNION RESULT	<union1,3>	(NULL)	ALL	(NULL)	(NULL)	(NULL)	(NULL)	(NULL)	(NULL)	Using temporary; Using filesort

`Senders`.`SenderID` is UNIQUE-indexed. Total number of active senders is around 12000 by now (a few thousands are inactive and normally don’t send messages), total number of messages per month is roughly 100 millions.

When run, this query gets processed for an hour and then MySQL might even just crash (possibly because of the large temporary dataset it produces?). MySQL version for this particular test server is 8.0.20, other target servers might have other 8.0.x versions.

I tried to start building the query from scratch but whenever I add LEAD() or LAG() with WITH (to filter on the gap size), temporary and filesort occur in the Extra column as well as UNCACHEABLE DERIVED in select_type. Are there any ways to improve the query?

Just in case it matters (which I personally doubt), imaginary sample data (two main columns only, PRIMARY KEY order):

SenderID	Arrival
42	2024-12-05 01:18:23
42	2024-12-05 03:21:47
42	2024-12-07 08:19:16
42	2024-12-11 03:27:59
42	2024-12-14 22:43:08
42	2024-12-15 01:18:23
85	2024-12-01 02:03:39
85	2024-12-04 02:03:39
85	2024-12-07 02:03:39
85	2024-12-10 02:03:39
85	2024-12-13 02:03:39
85	2024-12-16 02:03:39
85	2024-12-22 02:03:39
85	2024-12-25 02:03:39
85	2024-12-28 02:03:39
85	2024-12-31 02:03:39
158	2024-12-06 14:52:17
158	2024-12-08 14:52:17
158	2024-12-11 14:52:17
158	2024-12-14 14:52:17
158	2024-12-17 14:52:17
158	2024-12-20 14:52:17
158	2024-12-23 14:52:17
158	2024-12-26 14:52:17

For interval from 2024-12-01 00:00:00 to 2024-12-31 23:59:59 and @aDays = 5 the following gaps should be found:

SenderID	From	To	Comment
42	2024-12-15 01:18:23	2024-12-31 23:59:59	<- Lost at the end of month
85	2024-12-16 02:03:39	2024-12-22 02:03:39	<- Got lost temporarily
158	2024-12-01 00:00:00	2024-12-06 14:52:17	<- Was lost at the beginning of month
158	2024-12-26 14:52:17	2024-12-31 23:59:59	<- and lost at the end

The query shown above works as expected for small datasets but is extremely slow and even causes MySQL crashes for real data.

Share edited Mar 10 at 11:44 Akina 42.9k6 gold badges16 silver badges29 bronze badges asked Mar 10 at 8:37 Dmitry Vasiliev 236 bronze badges

Why are you checking both LEAD() and LAG()? It may be enough to check one way. Also, UNION instead of UNION ALL will cause sorting to exclude duplicates. And you will have them almost all duplicates. – ValNik Commented Mar 10 at 9:23
Add @from, MIN(arrival) and MAX(arrival), @to for each sender in CTE, this will avoid IFNULL(). get not LAG/LEAD but those +/- according INTERVAL, which allows immediate compare in outer query. Test alternative query with two WHERE NOT EXISTS. – Akina Commented Mar 10 at 9:24
@ValNik Well, in fact, I just found a similar query on StackOverflow and edited it accordingly. The main reason to check both is that only LEAD() or only LAG() would skip the cases when sender disappeared in the middle of the month or haven’t sent anything until the middle of the month. – Dmitry Vasiliev Commented Mar 10 at 9:44
@Akina I’m not quite sure I understand your suggestions. Where should I try to put WHERE NOT EXISTS? And what way can I use «+/– according INTERVAL» to achieve which immediate compare? Instead of IFNULLs, I guess, using third parameter to LEAD/LAG could help, although, AFAIR, MariaDB does not support it (which I’d like to keep in mind as well). – Dmitry Vasiliev Commented Mar 10 at 9:48
See Tips for asking a good Structured Query Language (SQL) question. Create small sample data (2-3 senders, 5-10 rows per sender) and provide needed output for this data and 1-2 month desired dates period. – Akina Commented Mar 10 at 10:09

| Show 2 more comments

1 Answer 1

Sorted by: Reset to default 1

Test this query:

WITH 
add_date_from (SenderID, Arrival) AS (
  SELECT SenderID, Arrival
  FROM messages
  UNION ALL
  SELECT DISTINCT SenderID, @date_from
  FROM messages
  ),
get_gaps (SenderID, Arrival, next_arrival, gap) AS (
  SELECT SenderID, 
         Arrival,
         COALESCE(LEAD(Arrival) OVER win, @date_to) next_arrival,
         TIMESTAMPDIFF(DAY, Arrival, COALESCE(LEAD(Arrival) OVER win, @date_to)) gap
  FROM add_date_from
  WINDOW win AS (PARTITION BY SenderID ORDER BY Arrival)
  )
SELECT SenderID, Arrival `From`, next_arrival `To`
FROM get_gaps
WHERE gap >= @Days
ORDER BY SenderID, Arrival

SenderID	From	To
42	2024-12-15 01:18:23	2024-12-31 23:59:59
85	2024-12-16 02:03:39	2024-12-22 02:03:39
158	2024-12-01 00:00:00	2024-12-06 14:52:17
158	2024-12-26 14:52:17	2024-12-31 23:59:59

fiddle

PS. If there exists some rows whose Arrival value is equal to @date_from value strictly then you may receive duplicates. You'd avoid them - either add according WHERE to the 2nd subquery of the 1st CTE or add DISTINCT to the outer query.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744856350a4597430.html

query optimization - MySQL filtered gaps and islands: avoiding temporaries and filesorts? - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

query optimization - MySQL filtered gaps and islands: avoiding temporaries and filesorts? - Stack Overflow

1 Answer 1

相关推荐