CREATE TABLE `messages` (
`ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`Arrival` TIMESTAMP NOT NULL,
`SenderID` INT UNSIGNED NOT NULL,
-- Fields describing messages skipped
PRIMARY KEY (`SenderID`, `Arrival`, `ID`) USING BTREE,
INDEX `ID` (`ID`) USING BTREE,
INDEX `Arrival_SenderID` (`Arrival`, `SenderID`) USING BTREE
)
The table is also PARTITION BY RANGE(UNIX_TIMESTAMP(`Arrival`))
by month, calendar-aligned.
Normally each SenderID sends at least one message in five minutes. Now the task is to report gaps within some that are larger than some specified amount of time. Currently the query (@aFrom and @aTo are set to December 2024, @aDays equals 5) is this:
WITH t AS (
SELECT `Messages`.`SenderID`, `Messages`.`Arrival`,
LAG(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Prev`,
LEAD(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Next`
FROM `Messages`
INNER JOIN `Senders` ON `Senders`.`SenderID` = `Messages`.`SenderID` AND `Senders`.`Active` IS TRUE
WHERE `Arrival` BETWEEN @aFrom AND @aTo
)
SELECT `SenderID`, IFNULL(`Prev`, @aFrom) AS `From`, `Arrival` AS `To` FROM t
WHERE TIMESTAMPDIFF(SECOND, IFNULL(`Prev`, @aFrom), `Arrival`) > @aDays * 24 * 3600
UNION
SELECT `SenderID`, `Arrival` AS `From`, IFNULL(`Next`, @aTo) AS `To` FROM t
WHERE TIMESTAMPDIFF(SECOND, `Arrival`, IFNULL(`Next`, @aTo)) > @aDays * 24 * 3600
ORDER BY `SenderID`, `From`;
But such a query has this EXPLAIN output:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | (NULL) | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 2700140 | 100,00 | Using where |
2 | UNCACHEABLE DERIVED | Senders | (NULL) | ALL | SenderID | (NULL) | (NULL) | (NULL) | 14430 | 100,00 | Using where; Using temporary; Using filesort |
2 | UNCACHEABLE DERIVED | Messages | p202412 | ref | PRIMARY,Arrival_SenderID | PRIMARY | 4 | testdb.Senders.SenderID | 374 | 50,00 | Using where; Using index |
3 | UNION | <derived2> | (NULL) | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 2700140 | 100,00 | Using where |
(NULL) | UNION RESULT | <union1,3> | (NULL) | ALL | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | Using temporary; Using filesort |
CREATE TABLE `messages` (
`ID` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`Arrival` TIMESTAMP NOT NULL,
`SenderID` INT UNSIGNED NOT NULL,
-- Fields describing messages skipped
PRIMARY KEY (`SenderID`, `Arrival`, `ID`) USING BTREE,
INDEX `ID` (`ID`) USING BTREE,
INDEX `Arrival_SenderID` (`Arrival`, `SenderID`) USING BTREE
)
The table is also PARTITION BY RANGE(UNIX_TIMESTAMP(`Arrival`))
by month, calendar-aligned.
Normally each SenderID sends at least one message in five minutes. Now the task is to report gaps within some that are larger than some specified amount of time. Currently the query (@aFrom and @aTo are set to December 2024, @aDays equals 5) is this:
WITH t AS (
SELECT `Messages`.`SenderID`, `Messages`.`Arrival`,
LAG(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Prev`,
LEAD(`Arrival`) OVER(PARTITION BY `Messages`.`SenderID` ORDER BY `Messages`.`Arrival`) AS `Next`
FROM `Messages`
INNER JOIN `Senders` ON `Senders`.`SenderID` = `Messages`.`SenderID` AND `Senders`.`Active` IS TRUE
WHERE `Arrival` BETWEEN @aFrom AND @aTo
)
SELECT `SenderID`, IFNULL(`Prev`, @aFrom) AS `From`, `Arrival` AS `To` FROM t
WHERE TIMESTAMPDIFF(SECOND, IFNULL(`Prev`, @aFrom), `Arrival`) > @aDays * 24 * 3600
UNION
SELECT `SenderID`, `Arrival` AS `From`, IFNULL(`Next`, @aTo) AS `To` FROM t
WHERE TIMESTAMPDIFF(SECOND, `Arrival`, IFNULL(`Next`, @aTo)) > @aDays * 24 * 3600
ORDER BY `SenderID`, `From`;
But such a query has this EXPLAIN output:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | (NULL) | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 2700140 | 100,00 | Using where |
2 | UNCACHEABLE DERIVED | Senders | (NULL) | ALL | SenderID | (NULL) | (NULL) | (NULL) | 14430 | 100,00 | Using where; Using temporary; Using filesort |
2 | UNCACHEABLE DERIVED | Messages | p202412 | ref | PRIMARY,Arrival_SenderID | PRIMARY | 4 | testdb.Senders.SenderID | 374 | 50,00 | Using where; Using index |
3 | UNION | <derived2> | (NULL) | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 2700140 | 100,00 | Using where |
(NULL) | UNION RESULT | <union1,3> | (NULL) | ALL | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | Using temporary; Using filesort |
`Senders`.`SenderID`
is UNIQUE-indexed. Total number of active senders is around 12000 by now (a few thousands are inactive and normally don’t send messages), total number of messages per month is roughly 100 millions.
When run, this query gets processed for an hour and then MySQL might even just crash (possibly because of the large temporary dataset it produces?). MySQL version for this particular test server is 8.0.20, other target servers might have other 8.0.x versions.
I tried to start building the query from scratch but whenever I add LEAD() or LAG() with WITH (to filter on the gap size), temporary and filesort occur in the Extra column as well as UNCACHEABLE DERIVED in select_type. Are there any ways to improve the query?
Just in case it matters (which I personally doubt), imaginary sample data (two main columns only, PRIMARY KEY order):
SenderID | Arrival |
---|---|
42 | 2024-12-05 01:18:23 |
42 | 2024-12-05 03:21:47 |
42 | 2024-12-07 08:19:16 |
42 | 2024-12-11 03:27:59 |
42 | 2024-12-14 22:43:08 |
42 | 2024-12-15 01:18:23 |
85 | 2024-12-01 02:03:39 |
85 | 2024-12-04 02:03:39 |
85 | 2024-12-07 02:03:39 |
85 | 2024-12-10 02:03:39 |
85 | 2024-12-13 02:03:39 |
85 | 2024-12-16 02:03:39 |
85 | 2024-12-22 02:03:39 |
85 | 2024-12-25 02:03:39 |
85 | 2024-12-28 02:03:39 |
85 | 2024-12-31 02:03:39 |
158 | 2024-12-06 14:52:17 |
158 | 2024-12-08 14:52:17 |
158 | 2024-12-11 14:52:17 |
158 | 2024-12-14 14:52:17 |
158 | 2024-12-17 14:52:17 |
158 | 2024-12-20 14:52:17 |
158 | 2024-12-23 14:52:17 |
158 | 2024-12-26 14:52:17 |
For interval from 2024-12-01 00:00:00 to 2024-12-31 23:59:59 and @aDays = 5 the following gaps should be found:
SenderID | From | To | Comment |
---|---|---|---|
42 | 2024-12-15 01:18:23 | 2024-12-31 23:59:59 | <- Lost at the end of month |
85 | 2024-12-16 02:03:39 | 2024-12-22 02:03:39 | <- Got lost temporarily |
158 | 2024-12-01 00:00:00 | 2024-12-06 14:52:17 | <- Was lost at the beginning of month |
158 | 2024-12-26 14:52:17 | 2024-12-31 23:59:59 | <- and lost at the end |
The query shown above works as expected for small datasets but is extremely slow and even causes MySQL crashes for real data.
Share edited Mar 10 at 11:44 Akina 42.9k6 gold badges16 silver badges29 bronze badges asked Mar 10 at 8:37 Dmitry VasilievDmitry Vasiliev 236 bronze badges 7 | Show 2 more comments1 Answer
Reset to default 1Test this query:
WITH
add_date_from (SenderID, Arrival) AS (
SELECT SenderID, Arrival
FROM messages
UNION ALL
SELECT DISTINCT SenderID, @date_from
FROM messages
),
get_gaps (SenderID, Arrival, next_arrival, gap) AS (
SELECT SenderID,
Arrival,
COALESCE(LEAD(Arrival) OVER win, @date_to) next_arrival,
TIMESTAMPDIFF(DAY, Arrival, COALESCE(LEAD(Arrival) OVER win, @date_to)) gap
FROM add_date_from
WINDOW win AS (PARTITION BY SenderID ORDER BY Arrival)
)
SELECT SenderID, Arrival `From`, next_arrival `To`
FROM get_gaps
WHERE gap >= @Days
ORDER BY SenderID, Arrival
SenderID | From | To |
---|---|---|
42 | 2024-12-15 01:18:23 | 2024-12-31 23:59:59 |
85 | 2024-12-16 02:03:39 | 2024-12-22 02:03:39 |
158 | 2024-12-01 00:00:00 | 2024-12-06 14:52:17 |
158 | 2024-12-26 14:52:17 | 2024-12-31 23:59:59 |
fiddle
PS. If there exists some rows whose Arrival
value is equal to @date_from
value strictly then you may receive duplicates. You'd avoid them - either add according WHERE to the 2nd subquery of the 1st CTE or add DISTINCT to the outer query.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744856350a4597430.html
@from, MIN(arrival)
andMAX(arrival), @to
for each sender in CTE, this will avoid IFNULL(). get not LAG/LEAD but those +/- according INTERVAL, which allows immediate compare in outer query. Test alternative query with two WHERE NOT EXISTS. – Akina Commented Mar 10 at 9:24