javascript - How can I make this regular expression not result in "catastrophic backtracking"? - Stack Overflo

admin•2025-04-19 03:32:05•questions•阅读2

I'm trying to use a URL matching regular expression that I got from (?xi)b(

I'm trying to use a URL matching regular expression that I got from

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

Based on the answers to another question, it appears that there are cases that cause this regex to backtrack catastrophically. For example:

var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA)")

... can take a really long time to execute (e.g. in Chrome)

It seems to me that the problem lies in this part of the code:

(?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+

... which seems to be roughly equivalent to (.+|$(.+|(\(.+$))*\))+, which looks like it contains (.+)+

Is there a change I can make that will avoid that?

I'm trying to use a URL matching regular expression that I got from http://daringfireball/2010/07/improved_regex_for_matching_urls

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

Based on the answers to another question, it appears that there are cases that cause this regex to backtrack catastrophically. For example:

var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("http://google./?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA)")

... can take a really long time to execute (e.g. in Chrome)

It seems to me that the problem lies in this part of the code:

(?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+

... which seems to be roughly equivalent to (.+|$(.+|(\(.+$))*\))+, which looks like it contains (.+)+

Is there a change I can make that will avoid that?

Share Improve this question edited May 23, 2017 at 12:16 CommunityBot 11 silver badge asked Apr 18, 2012 at 21:52 David Ingersol 1856 bronze badges

Really, you should throw this regex away and e up with one that does what you need. I haven't seen an application yet that is both fluffy enough to be using a regex for URL parsing (instead of a real parser) and serious enough that it needs to handle nested parentheses in a URL. Starting with "https?://" and ending at the first character that should be %-encoded in a proper URL but isn't will handle nearly everything, and doesn't cause the regex matcher to go exponential. – Kyle Jones Commented Apr 18, 2012 at 22:28
Have you tried Rubular? It has a handy cheat sheet below it, and you can add all kinds of test expressions to make sure it works. (P.S. I'm aware this is for js, but this is still a handy resource nonetheless.) rubular. – Edwin Commented Apr 18, 2012 at 22:28

Add a ment |

1 Answer 1

Sorted by: Reset to default 10

Changing it to the following should prevent the catastrophic backtracking:

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

The only change that was made was to remove the + after the first [^\s()<>] in each of the "balanced parens" portions of the regex.

Here is the one-line version for testing with JS:

var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("http://google./?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA")

The problem portion of the original regex is the balanced parentheses section, to simplify the explanation of why the backtracking occurs I am going to pletely remove the nested parentheses portion of it because it isn't relevant here:

\(([^\s()<>]+|(\([^\s()<>]+\)))*\)    # original
\(([^\s()<>]+)*\)                     # expanded below

\(                # literal '('
(                 # start group, repeat zero or more times
    [^\s()<>]+        # one or more non-special characters
)*                # end group
\)                # literal ')'

Consider what happens here with the string '(AAAAA', the literal ( would match and then AAAAA would be consumed by the group, and the ) would fail to match. At this point the group would give up one A, leaving AAAA captured and attempting to continue the match at this point. Because the group has a * following it, the group can match multiple times so now you would have ([^\s()<>]+)* matching AAAA, and then A on the second pass. When this fails an additional A would be given up by the original capture and consumed by the second capture.

This would go on for a long while resulting in the following attempts to match, where each ma-separated group indicates a different time that the group is matched, and how many characters that instance matched:

AAAAA
AAAA, A
AAA, AA
AAA, A, A
AA, AAA
AA, AA, A
AA, A, AA
AA, A, A, A
....

I may have counted wrong, but I'm pretty sure it adds up to 16 steps before it is determined that the regex cannot match. As you continue to add additional characters to the string the number of steps to figure this out grows exponentially.

By removing the + and changing this to $([^\s()<>])*$, you would avoid this backtracking scenario.

Adding the alternation back in to check for the nested parentheses doesn't cause any problems.

Note that you may want to add some sort of anchor to the end of the string, because currently "http://google./?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA" will match up to just before the (, so re.test(...) would return true because http://google./?q= matches.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744378577a4571310.html

admin

questions
javascript - Retrieving Date object from chrome storage not working - Stack Overflow
In a Chrome Extension, I'm trying to save a Date object to storage then read it back. According to
admin
27分钟前
00
questions
javascript - Stylesheets and scripts bundles not working in Mono - Stack Overflow
Background: I am migrating an ASP.NET MVC 5 application (developed in Windows 8.1, VS2013 Community, .N
admin
26分钟前
00
questions
javascript - jquery.width return 0 on image - Stack Overflow
I'm using a code that looks like that:img.load(function(){ do some stuff$(this).width();});In my
admin
26分钟前
00
questions
javascript - Are Chrome user-scripts separated from the global namespace like Greasemonkey scripts? - Stack Overflow
I know Greasemonkey scripts are automatically wrapped in anonymous functions isolated in some way in or
admin
24分钟前
10
questions
javascript - Displaying multiple Google charts on same page - Stack Overflow
I have looked into multiple google charts api, on same web page and couple of other URLs but invain. I
admin
23分钟前
10
questions
javascript - Formik dispatch Redux action after valid field change - Stack Overflow
I've built this sandbox React application that uses Formik forms.I want to dispatch a Redux action
admin
22分钟前
00
questions
javascript - What is the simplest check possible for an HTMLJS injection attack? - Stack Overflow
My Javascript code aims to take some untrusted string variable and render it in the DOM. It would be in
admin
22分钟前
10
questions
javascript - jQuery for selecting options that have no value, or particular text - Stack Overflow
I have selects which all have a top option with text "All" and no value, but I need to write
admin
21分钟前
10
questions
network programming - Heavy tcp_send_ack during recvfrom syscall - Stack Overflow
I'm digging one issue(not sure if it's an issue and not sure how tofix this) with linux tcp
admin
21分钟前
10
questions
javascript - Prevent `click` while the link is dragged - Stack Overflow
I'm using gridster to make a grid of links. The link should work normal when click on it. Problem
admin
20分钟前
10
questions
Implementation of custom Angular schematics inside a project - Stack Overflow
I have an Angular 19 project and would like to add some schematics to it. Since these will depend on pr
admin
16分钟前
10
questions
html - What would be the best way to display database data into a select form type and then display it after input using php - S
I am trying to add a html form to my website that allows for users to select an option from data in my
admin
15分钟前
10
questions
javascript - Storing the option values of a select box and storing them into a variable (by comma separated values) - Stack Over
How can I store the list of option values from a select box into a variable, delimited by ma?ie:<!DO
admin
12分钟前
10
questions
plugin development - Fatal error: Uncaught Error: Using $this when not in object context
I am using a theme plugin that is essential for my theme. It is calledPixGridder Pro. The developer website is down and
admin
12分钟前
10
questions
javascript - Change Controller variable value from change in isolate scope directive - Stack Overflow
So I know with two way binding, =, in a directive a controller's value can be passed into the dire
admin
10分钟前
10
questions
Javascript function as QML property defined from c++ - Stack Overflow
I have the following QML object defined in c++:class MyObj : public QQuickItem {Q_OBJECTQ_PROPERTY(QVar
admin
9分钟前
10
questions
advanced custom fields - ACF Flexible Content with Bootstrap Carousel Repeater
I'm trying to add a counter for each instance of this flexible content carousel built with Bootstrap. I've tri
admin
8分钟前
10
questions
How to create Copilot agent to read information about a specific script in SharePoint? - Stack Overflow
I have done a ton of research and Googling on this but can't find a solution. I am trying to creat
admin
3分钟前
10
questions
javascript - Google Maps API markers with multiple icons - Stack Overflow
I'd like to implement a plex marker using the Google Maps JavaScript API that bines both a static
admin
2分钟前
10
questions
javascript - How can I close multiple bootstrap modals with the close button on the last popped modal - Stack Overflow
I have 2 modal popups in this code. How can i use the close button on the second popped modal to close
admin
34秒前
00

发表回复

评论列表（0条）

暂无评论

javascript - How can I make this regular expression not result in "catastrophic backtracking"? - Stack Overflo

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

javascript - How can I make this regular expression not result in &quot;catastrophic backtracking&quot;? - Stack Overflo

1 Answer 1

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888

javascript - How can I make this regular expression not result in "catastrophic backtracking"? - Stack Overflo