Working on MariaDB 10.6.20. I am trying to extract all unique non-digit sequences from strings in the chars table phone column.
Sample data:
(555) 444 3322
+(1.400) - 200-4044 ext200
Expected output:
(
)
+(
.
) -
-
ext
I have this query:
WITH RECURSIVE phone_parts AS (
SELECT phone,
REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) + 1) AS remaining_phone
FROM chars
WHERE phone REGEXP '[^0-9]'
UNION ALL
SELECT phone,
REGEXP_SUBSTR(remaining_phone, '[^0-9]+'),
SUBSTRING(remaining_phone, LENGTH(REGEXP_SUBSTR(remaining_phone, '[^0-9]+')) + 1)
FROM phone_parts
WHERE remaining_phone != ''
)
SELECT DISTINCT non_digit_sequence
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY non_digit_sequence;
But it returns ") - " and " - " separately, even though they are part of the same sequence.
Any ideas how to fix it?
Thanks!
Working on MariaDB 10.6.20. I am trying to extract all unique non-digit sequences from strings in the chars table phone column.
Sample data:
(555) 444 3322
+(1.400) - 200-4044 ext200
Expected output:
(
)
+(
.
) -
-
ext
I have this query:
WITH RECURSIVE phone_parts AS (
SELECT phone,
REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) + 1) AS remaining_phone
FROM chars
WHERE phone REGEXP '[^0-9]'
UNION ALL
SELECT phone,
REGEXP_SUBSTR(remaining_phone, '[^0-9]+'),
SUBSTRING(remaining_phone, LENGTH(REGEXP_SUBSTR(remaining_phone, '[^0-9]+')) + 1)
FROM phone_parts
WHERE remaining_phone != ''
)
SELECT DISTINCT non_digit_sequence
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY non_digit_sequence;
But it returns ") - " and " - " separately, even though they are part of the same sequence.
Any ideas how to fix it?
Thanks!
Share Improve this question asked Mar 6 at 10:19 yenrenyenren 52210 silver badges22 bronze badges 03 Answers
Reset to default 1CREATE TABLE test
SELECT '(555) 444 3322' phone UNION ALL
SELECT '+(1.400) - 200-4044 ext200';
SELECT phone, REGEXP_REPLACE(phone, '\\d+', '\n') FROM test;
phone | REGEXP_REPLACE(phone, '\d+', '\n') |
---|---|
(555) 444 3322 | ( ) |
+(1.400) - 200-4044 ext200 | +( . ) - - ext |
fiddle
With recursive CTEs it's often a good idea to toss in some extra metadata columns that can be surfaced in your final SQL for troubleshooting, since it's often confusing to track what is happening during iteration without them.
In this case, tracking the number of iterations and also surfacing the remaining_phone
that was passed into the iteration would be enlightening.
Consider the following where I did some renaming of columns as well:
WITH RECURSIVE phone_parts AS (
SELECT phone,
phone as phone_in,
REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) + 1) AS phone_out,
LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) as substring_removal_len,
1 as iteration
FROM chars
WHERE phone REGEXP '[^0-9]'
UNION ALL
SELECT phone,
phone_parts.phone_out as phone_in,
REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+') as non_digit_sequence,
SUBSTRING( phone_parts.phone_out, LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+')) + 1) as phone_out,
LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+')) as substring_removal_len,
iteration+1 as iteration
FROM phone_parts
WHERE iteration < 20
)
SELECT DISTINCT phone, phone_in, non_digit_sequence, substring_removal_len, phone_out, iteration
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY iteration;
phone | phone_in | non_digit_sequence | substring_removal_len | phone_out | iteration |
---|---|---|---|---|---|
+(1.400) - 200-4044 ext200 | +(1.400) - 200-4044 ext200 | +( | 2 | 1.400) - 200-4044 ext200 | 1 |
+(1.400) - 200-4044 ext200 | 1.400) - 200-4044 ext200 | . | 1 | .400) - 200-4044 ext200 | 2 |
+(1.400) - 200-4044 ext200 | .400) - 200-4044 ext200 | . | 1 | 400) - 200-4044 ext200 | 3 |
+(1.400) - 200-4044 ext200 | 400) - 200-4044 ext200 | ) - | 4 | - 200-4044 ext200 | 4 |
+(1.400) - 200-4044 ext200 | - 200-4044 ext200 | - | 3 | 200-4044 ext200 | 5 |
+(1.400) - 200-4044 ext200 | 200-4044 ext200 | - | 1 | 00-4044 ext200 | 6 |
+(1.400) - 200-4044 ext200 | 00-4044 ext200 | - | 1 | 0-4044 ext200 | 7 |
+(1.400) - 200-4044 ext200 | 0-4044 ext200 | - | 1 | -4044 ext200 | 8 |
+(1.400) - 200-4044 ext200 | -4044 ext200 | - | 1 | 4044 ext200 | 9 |
+(1.400) - 200-4044 ext200 | 4044 ext200 | ext | 4 | ext200 | 10 |
+(1.400) - 200-4044 ext200 | ext200 | ext | 4 | 200 | 11 |
Looking just at the first iteration, we identify the +(
and we see that the substring_removal_len
is 2
. When we remove those two characters for the next iteration we are left with:
1.400) - 200-4044 ext200
We then identify the .
character in the second iteration, and because that .
character is of length 1
we only remove the 1
character (not the .
). So for the 3rd iteration we are left with:
.400) - 200-4044 ext200
Where we identify the .
again... Totally superfluous as we already identified that character. This is the crux of the issue you are facing and the it compounds when the non-numeric substring you identify are preceded by numerics where you find yourself identifying substrings of substrings.
Instead, I propose you change your regex for the phone_out
or remaining_phone
column to also include any numeric characters in the substring being removed like:
SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+[0-9]+')) + 1)
WITH RECURSIVE phone_parts AS (
SELECT phone,
phone as phone_in,
REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+[0-9]+')) + 1) AS phone_out,
LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+[0-9]+')) as substring_removal_len,
1 as iteration
FROM chars
WHERE phone REGEXP '[^0-9]'
UNION ALL
SELECT phone,
phone_parts.phone_out as phone_in,
REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+') as non_digit_sequence,
SUBSTRING( phone_parts.phone_out, LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+[0-9]+')) + 1) as phone_out,
LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+[0-9]+')) as substring_removal_len,
iteration+1 as iteration
FROM phone_parts
WHERE iteration < 20
)
SELECT DISTINCT phone, phone_in, non_digit_sequence, substring_removal_len, phone_out, iteration
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY iteration;
phone | phone_in | non_digit_sequence | substring_removal_len | phone_out | iteration |
---|---|---|---|---|---|
+(1.400) - 200-4044 ext200 | +(1.400) - 200-4044 ext200 | +( | 3 | .400) - 200-4044 ext200 | 1 |
+(1.400) - 200-4044 ext200 | .400) - 200-4044 ext200 | . | 4 | ) - 200-4044 ext200 | 2 |
+(1.400) - 200-4044 ext200 | ) - 200-4044 ext200 | ) - | 7 | -4044 ext200 | 3 |
+(1.400) - 200-4044 ext200 | -4044 ext200 | - | 5 | ext200 | 4 |
+(1.400) - 200-4044 ext200 | ext200 | ext | 7 | 5 |
With this change in place you can see that we not only remove the identified non-numeric substring, but also any following numeric substring. This way the next iteration is starting fresh with the next non-numeric substring to identify and remove.
You almost have it, just a couple of problems:
FIrst, your SUBSTRING is incorrect; you are passing the length of the sequence found as the offset, when you want the string beginning after the match. Second, unless this is a non-multibyte column, you want to use CHAR_LENGTH almost always instead of LENGTH (which gives a length in bytes, not characters).
So change:
SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '\[^0-9\]+')) + 1)
to
SUBSTRING(phone, INSTR(phone, REGEXP_SUBSTR(phone, '\[^0-9\]+'))+CHAR_LENGTH(REGEXP_SUBSTR(phone, '\[^0-9\]+')))
for both calls. So:
WITH RECURSIVE phone_parts AS (
SELECT
REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
SUBSTRING(phone, INSTR(phone, REGEXP_SUBSTR(phone, '\[^0-9\]+'))+CHAR_LENGTH(REGEXP_SUBSTR(phone, '\[^0-9\]+'))) remaining_phone
FROM chars
WHERE phone REGEXP '[^0-9]'
UNION ALL
SELECT
REGEXP_SUBSTR(remaining_phone, '[^0-9]+'),
SUBSTRING(remaining_phone, INSTR(remaining_phone, REGEXP_SUBSTR(remaining_phone, '\[^0-9\]+'))+CHAR_LENGTH(REGEXP_SUBSTR(remaining_phone, '\[^0-9\]+')))
FROM phone_parts
WHERE remaining_phone REGEXP '[^0-9]'
)
SELECT DISTINCT non_digit_sequence, CHAR_LENGTH(non_digit_sequence) len
FROM phone_parts
ORDER BY non_digit_sequence, len
fiddle
(I added a WHERE...REGEXP to the second union query, so only actual result rows are returned, and removed the unneeded phone column. Also changed to consider trailing spaces different.)
You also should consider if you want all variants that are case-insensitively equal.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744983550a4604480.html
评论列表(0条)