mariadb - SQL query to extract all non-digit sequences

Working on MariaDB 10.6.20. I am trying to extract all unique non-digit sequences from strings in the chars table phone column.

Sample data:

(555) 444 3322
+(1.400) - 200-4044 ext200

Expected output:

(
) 
 
+(
.
) - 
-
 ext

I have this query:

WITH RECURSIVE phone_parts AS (
    SELECT phone,
           REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
           SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) + 1) AS remaining_phone
    FROM chars
    WHERE phone REGEXP '[^0-9]'
    UNION ALL
    SELECT phone,
           REGEXP_SUBSTR(remaining_phone, '[^0-9]+'),
           SUBSTRING(remaining_phone, LENGTH(REGEXP_SUBSTR(remaining_phone, '[^0-9]+')) + 1)
    FROM phone_parts
    WHERE remaining_phone != ''
)
SELECT DISTINCT non_digit_sequence
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY non_digit_sequence;

But it returns ") - " and " - " separately, even though they are part of the same sequence.

Any ideas how to fix it?

Thanks!

Working on MariaDB 10.6.20. I am trying to extract all unique non-digit sequences from strings in the chars table phone column.

Sample data:

(555) 444 3322
+(1.400) - 200-4044 ext200

Expected output:

(
) 
 
+(
.
) - 
-
 ext

I have this query:

WITH RECURSIVE phone_parts AS (
    SELECT phone,
           REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
           SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) + 1) AS remaining_phone
    FROM chars
    WHERE phone REGEXP '[^0-9]'
    UNION ALL
    SELECT phone,
           REGEXP_SUBSTR(remaining_phone, '[^0-9]+'),
           SUBSTRING(remaining_phone, LENGTH(REGEXP_SUBSTR(remaining_phone, '[^0-9]+')) + 1)
    FROM phone_parts
    WHERE remaining_phone != ''
)
SELECT DISTINCT non_digit_sequence
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY non_digit_sequence;

But it returns ") - " and " - " separately, even though they are part of the same sequence.

Any ideas how to fix it?

Thanks!

Share Improve this question asked Mar 6 at 10:19 yenren 52210 silver badges22 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

CREATE TABLE test 
SELECT '(555) 444 3322' phone UNION ALL
SELECT '+(1.400) - 200-4044 ext200';
SELECT phone, REGEXP_REPLACE(phone, '\\d+', '\n') FROM test;

phone	REGEXP_REPLACE(phone, '\d+', '\n')
(555) 444 3322	( )
+(1.400) - 200-4044 ext200	+( . ) - - ext

fiddle

With recursive CTEs it's often a good idea to toss in some extra metadata columns that can be surfaced in your final SQL for troubleshooting, since it's often confusing to track what is happening during iteration without them.

In this case, tracking the number of iterations and also surfacing the remaining_phone that was passed into the iteration would be enlightening.

Consider the following where I did some renaming of columns as well:

WITH RECURSIVE phone_parts AS (
    SELECT phone,
           phone as phone_in,
           REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
           SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) + 1) AS phone_out,
           LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+')) as substring_removal_len,
           1 as iteration
    FROM chars
    WHERE phone REGEXP '[^0-9]'
    UNION ALL
    SELECT phone,
           phone_parts.phone_out as phone_in,
           REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+') as non_digit_sequence,
           SUBSTRING( phone_parts.phone_out, LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+')) + 1) as phone_out,
           LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+')) as substring_removal_len,
           iteration+1 as iteration
    FROM phone_parts
    WHERE iteration < 20
)
SELECT DISTINCT phone, phone_in, non_digit_sequence, substring_removal_len, phone_out, iteration
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY iteration;

phone	phone_in	non_digit_sequence	substring_removal_len	phone_out	iteration
+(1.400) - 200-4044 ext200	+(1.400) - 200-4044 ext200	+(	2	1.400) - 200-4044 ext200	1
+(1.400) - 200-4044 ext200	1.400) - 200-4044 ext200	.	1	.400) - 200-4044 ext200	2
+(1.400) - 200-4044 ext200	.400) - 200-4044 ext200	.	1	400) - 200-4044 ext200	3
+(1.400) - 200-4044 ext200	400) - 200-4044 ext200	) -	4	- 200-4044 ext200	4
+(1.400) - 200-4044 ext200	- 200-4044 ext200	-	3	200-4044 ext200	5
+(1.400) - 200-4044 ext200	200-4044 ext200	-	1	00-4044 ext200	6
+(1.400) - 200-4044 ext200	00-4044 ext200	-	1	0-4044 ext200	7
+(1.400) - 200-4044 ext200	0-4044 ext200	-	1	-4044 ext200	8
+(1.400) - 200-4044 ext200	-4044 ext200	-	1	4044 ext200	9
+(1.400) - 200-4044 ext200	4044 ext200	ext	4	ext200	10
+(1.400) - 200-4044 ext200	ext200	ext	4	200	11

Looking just at the first iteration, we identify the +( and we see that the substring_removal_len is 2. When we remove those two characters for the next iteration we are left with:

1.400) - 200-4044 ext200

We then identify the . character in the second iteration, and because that . character is of length 1 we only remove the 1 character (not the .). So for the 3rd iteration we are left with:

.400) - 200-4044 ext200

Where we identify the . again... Totally superfluous as we already identified that character. This is the crux of the issue you are facing and the it compounds when the non-numeric substring you identify are preceded by numerics where you find yourself identifying substrings of substrings.

Instead, I propose you change your regex for the phone_out or remaining_phone column to also include any numeric characters in the substring being removed like:

SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+[0-9]+')) + 1)

WITH RECURSIVE phone_parts AS (
    SELECT phone,
           phone as phone_in,
           REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
           SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+[0-9]+')) + 1) AS phone_out,
           LENGTH(REGEXP_SUBSTR(phone, '[^0-9]+[0-9]+')) as substring_removal_len,
           1 as iteration
    FROM chars
    WHERE phone REGEXP '[^0-9]'
    UNION ALL
    SELECT phone,
           phone_parts.phone_out as phone_in,
           REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+') as non_digit_sequence,
           SUBSTRING( phone_parts.phone_out, LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+[0-9]+')) + 1) as phone_out,
           LENGTH(REGEXP_SUBSTR( phone_parts.phone_out, '[^0-9]+[0-9]+')) as substring_removal_len,
           iteration+1 as iteration
    FROM phone_parts
    WHERE iteration < 20
)
SELECT DISTINCT phone, phone_in, non_digit_sequence, substring_removal_len, phone_out, iteration
FROM phone_parts
WHERE non_digit_sequence IS NOT NULL
ORDER BY iteration;

phone	phone_in	non_digit_sequence	substring_removal_len	phone_out	iteration
+(1.400) - 200-4044 ext200	+(1.400) - 200-4044 ext200	+(	3	.400) - 200-4044 ext200	1
+(1.400) - 200-4044 ext200	.400) - 200-4044 ext200	.	4	) - 200-4044 ext200	2
+(1.400) - 200-4044 ext200	) - 200-4044 ext200	) -	7	-4044 ext200	3
+(1.400) - 200-4044 ext200	-4044 ext200	-	5	ext200	4
+(1.400) - 200-4044 ext200	ext200	ext	7		5

With this change in place you can see that we not only remove the identified non-numeric substring, but also any following numeric substring. This way the next iteration is starting fresh with the next non-numeric substring to identify and remove.

You almost have it, just a couple of problems:

FIrst, your SUBSTRING is incorrect; you are passing the length of the sequence found as the offset, when you want the string beginning after the match. Second, unless this is a non-multibyte column, you want to use CHAR_LENGTH almost always instead of LENGTH (which gives a length in bytes, not characters).

So change:

SUBSTRING(phone, LENGTH(REGEXP_SUBSTR(phone, '\[^0-9\]+')) + 1)

SUBSTRING(phone, INSTR(phone, REGEXP_SUBSTR(phone, '\[^0-9\]+'))+CHAR_LENGTH(REGEXP_SUBSTR(phone, '\[^0-9\]+')))

for both calls. So:

WITH RECURSIVE phone_parts AS (
    SELECT
        REGEXP_SUBSTR(phone, '[^0-9]+') AS non_digit_sequence,
        SUBSTRING(phone, INSTR(phone, REGEXP_SUBSTR(phone, '\[^0-9\]+'))+CHAR_LENGTH(REGEXP_SUBSTR(phone, '\[^0-9\]+'))) remaining_phone
    FROM chars
    WHERE phone REGEXP '[^0-9]'
    UNION ALL
    SELECT
        REGEXP_SUBSTR(remaining_phone, '[^0-9]+'),
        SUBSTRING(remaining_phone, INSTR(remaining_phone, REGEXP_SUBSTR(remaining_phone, '\[^0-9\]+'))+CHAR_LENGTH(REGEXP_SUBSTR(remaining_phone, '\[^0-9\]+')))
    FROM phone_parts
    WHERE remaining_phone REGEXP '[^0-9]'
)
SELECT DISTINCT non_digit_sequence, CHAR_LENGTH(non_digit_sequence) len
FROM phone_parts
ORDER BY non_digit_sequence, len

fiddle

(I added a WHERE...REGEXP to the second union query, so only actual result rows are returned, and removed the unneeded phone column. Also changed to consider trailing spaces different.)

You also should consider if you want all variants that are case-insensitively equal.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744983550a4604480.html

mariadb - SQL query to extract all non-digit sequences - Stack Overflow

3 Answers 3

发表回复

评论列表（0条）

联系我们

400-800-8888

mariadb - SQL query to extract all non-digit sequences - Stack Overflow

3 Answers 3

相关推荐