regex - Limit repetitions of character to multiple fixed lengths (and not ranges) - Stack Overflow

I have some identifiers that will appear at the end of some file names and can vary in length. It will

I have some identifiers that will appear at the end of some file names and can vary in length. It will only be 8 or 12 characters long separated by some delimiter. It would be invalid if it were any other length.

I would like to keep the pattern as simple as possible but I don't think there's a mechanism (in standard regular expression syntax) to do multiple lengths without repeating myself.

This will not work for me since it allows lengths of 9-11 which are invalid:

-[A-Za-z0-9]{8,12}$

I could do this but I don't like that I have to repeat the character groups:

-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12})$

It gets a little unruly when there are more lengths I need to support:

-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12}|[A-Za-z0-9]{16}|[A-Za-z0-9]{20}|[A-Za-z0-9]{24}|[A-Za-z0-9]{28}|[A-Za-z0-9]{32})$

Are there any other more concise ways to do this or is this the best I can do?

I will accept anything that works for my case, but would be great if there was an option that would work for any arbitrary lengths.

I have some identifiers that will appear at the end of some file names and can vary in length. It will only be 8 or 12 characters long separated by some delimiter. It would be invalid if it were any other length.

I would like to keep the pattern as simple as possible but I don't think there's a mechanism (in standard regular expression syntax) to do multiple lengths without repeating myself.

This will not work for me since it allows lengths of 9-11 which are invalid:

-[A-Za-z0-9]{8,12}$

I could do this but I don't like that I have to repeat the character groups:

-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12})$

It gets a little unruly when there are more lengths I need to support:

-(?:[A-Za-z0-9]{8}|[A-Za-z0-9]{12}|[A-Za-z0-9]{16}|[A-Za-z0-9]{20}|[A-Za-z0-9]{24}|[A-Za-z0-9]{28}|[A-Za-z0-9]{32})$

Are there any other more concise ways to do this or is this the best I can do?

I will accept anything that works for my case, but would be great if there was an option that would work for any arbitrary lengths.

Share Improve this question edited Feb 21 at 11:13 DuesserBaest 3,0017 silver badges28 bronze badges asked Feb 21 at 0:37 Jeff MercadoJeff Mercado 135k33 gold badges266 silver badges280 bronze badges 6
  • 1 any sample inputs? what language? – aaa Commented Feb 21 at 0:40
  • These are all file names. e.g., some file name-ASDFghjk1234 (extension omitted) Renaming them with PowerRename (part of power toys) and have boost extensions available, though I'm not sure if that is really relevant here. – Jeff Mercado Commented Feb 21 at 0:50
  • 2 If the regex engine supports subroutines, it would make your life a bit easier. – Hao Wu Commented Feb 21 at 2:47
  • @HaoWu I think that would be a perfectly solid option to suggest. I tried something like that with \1 instead of ?1 but it didn't seem to work and I was unaware of that option. It definitely works for the engine I'm using it in so I'd consider putting that up as an answer. – Jeff Mercado Commented Feb 21 at 6:15
  • @JeffMercado Subroutine ((?R)) is a PCRE-like regex feature but I checked the link you have attached and it claims it's using ECMAScript regex engine which should not be supported, so I didn't add it as an answer. Also, (?1) works but \1 does not is because \1 is a back-reference (captured substring) but (?1) is a subroutine (the pattern itself). – Hao Wu Commented Feb 21 at 7:05
 |  Show 1 more comment

5 Answers 5

Reset to default 4 +500

My idea is similar to that of blhsing in that I would suggest checking for the length up front. However, I would suggest a positive definition of possible length. Just for illustration I use length 8,12,14 to not only have multiples of 4.

My regex attempt would be:

-(?=(?:.{8}|.{12}|.{14})$)[A-Za-z0-9]+$

See a demo on regex101. Input was taken from Hao Wus demo.

Explanation:

  • -: Anchor pattern to literal -.
  • (?=(?: ... )$): Look ahead and check for different configurations of string length between - and end of line.
    • .{8}|.{12}|.{14}: In this case 8,12,14.
  • [A-Za-z0-9]+$: Finally assert your strings composition until end of line.

The reason I bothered to add an additional answer is, that in a programming language like Python you would now be able to generate the pattern based on a list of possible length like so:

import re

strings=[
    "some file name-ASDFghjk",
    "some file name-ASDFghjk12",
    "some file name-ASDFghjk1234",
    "some file name-ASDFghjk123456",
    "some file name-ASDFghjk12345678"
]

allowed_len=[8,12,14]

# Concatinate the possible lenght to ".{a}|.{b}|.....".
joined_len="|".join(".{"+str(n)+"}"  for n in allowed_len)

# Use the concatination in the regex pattern to "outsource" this step.
# The ramaining pattern can easily be maintained here now.
pat=repile(rf"-(?=(?:{joined_len})$)[A-Za-z0-9]+$")


# Validate output.
[re.search(pat,s) for s in strings]

In general you can avoid spelling out the same character set multiple times by including the full range of repetition numbers with the quantifier {8,12} but excluding the invalid range of {9,11} with a negative lookahead pattern like this:

-(?!.{9,11}$)[A-Za-z0-9]{8,12}$

Obviously if you have multiple valid repetition numbers you'll have to exclude the multiple invalid ranges in between with multiple negative lookahead patterns, but at least you still get to avoid having to repeatedly spelling out the same character set.

@HaoWu's suggestion of using a subroutine would otherwise be the best option if your regex engine supports it.

Thought I'd add an answer to complement the working answers you currently got. PCRE(2) does support a container (not sure what to name it otherwise) called (?DEFINE) to pre-define patterns that can be re-used throughout the rest of your regular expression. This way you create a somewhat modular pattern. In your case it may be over-engineering a solution but I thought I'd chuck in the option:

(?(DEFINE)(?<PW>[a-zA-Z0-9]))^.*-(?:(?&PW){8}|(?&PW){12})$

See an online demo

  • (?(DEFINE)(?<PW>[a-zA-Z0-9])) - The construction at the start of the pattern that literally holds the named sub-pattern for later usage. I have called it 'PW' for now;
  • ^.*-(?:(?&PW){8}|(?&PW){12})$ - Rather self-explanatory. You can identify the use of the herefor identified sub-pattern named 'PW'.

Why use this? When a pattern becomes long and tedious, this is a nice way to improve readability and maintainability. Btw, the DEFINE construct can hold multiple subroutines like so: (?(DEFINE))(?<x>123)(?<y>456)); could be handy :)

What about ^(([a-z0-9]{4}){2,8})$ since you show in the last example having to support some different multiples of 4, 8 to 32. I used Notepad++ to check my results, hence the other changes in the expression.

Obviously it only works when there is the situation as you explained, all multiples of 4 in the range of 8 to 32.

It seems you already have the correct regex; could probably short it:

-(?:[A-Za-z0-9]{4}){2,8}$

Can't think of anything else.

Details are in this link.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745172171a4614992.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信