html - Javascript regex: Find all URLs outside <a> tags - Nested Tags

I have built this regex code:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)

The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.

I would like that only <a> tags are excluded - so the solution could be to modify only the last term to:

[^<>]*?<\/a>

But now there will be a problem if I have nested tags, for example, <b></b> inside <a>.

Here is the example I am working on: (should be 10 matches).

Negative lookahead is still tricky for me. I thought that the following should work but it isn't:

(?!<a.+?<\/a>)

These are the last discussions that helped me:

Regex replace text outside html tags
Regex replace text but exclude when text is between specific tag

I have built this regex code:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)

The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.

I would like that only <a> tags are excluded - so the solution could be to modify only the last term to:

[^<>]*?<\/a>

But now there will be a problem if I have nested tags, for example, <b></b> inside <a>.

Here is the example I am working on: https://regex101./r/lM3hC5/6 (should be 10 matches).

Negative lookahead is still tricky for me. I thought that the following should work but it isn't:

(?!<a.+?<\/a>)

https://regex101./r/hT1cG5/1

These are the last discussions that helped me:

Regex replace text outside html tags
Regex replace text but exclude when text is between specific tag

Share Improve this question edited May 23, 2017 at 10:33 CommunityBot 11 silver badge asked Feb 22, 2016 at 12:30 Klaidonis 6392 gold badges7 silver badges22 bronze badges

Isn't that problem enough serious to stop relying on regex for extracting texts from the HTML? Use a DOM parser. – Wiktor Stribiżew Commented Feb 22, 2016 at 12:34
@WiktorStribiżew What do you mean DOM parser exactly? Something like this? simplehtmldom.sourceforge/manual.htm – Shafizadeh Commented Feb 22, 2016 at 12:47
I am just wondering if that would be possible since for this purpose my code is quite neat and simple with regex. – Klaidonis Commented Feb 22, 2016 at 12:49
@user2943191 I read somewhere using regex should be last option .. – Shafizadeh Commented Feb 22, 2016 at 12:50
Here is a possible DOM based solution. – Wiktor Stribiżew Commented Feb 22, 2016 at 13:04

Add a ment |

1 Answer 1

Sorted by: Reset to default 8

It turned out that probably the best solution is the following:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.

Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a up to the first " symbol (as it is not a valid URL symbol but <> symbols are present with nested tags).

Now also nested tags inside <a> tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:

placing quotes within <a> tags;
do not use this algorithm on <a> tags without any attribute (placeholders);
as well as you may need to avoid using multiple nested tags/lines unless the URL inside <a> tag is after any double quote.

Here is a very good and messy example (the last match should not be found but it is):

https://regex101./r/pC0jR7/2

It is a pity that this lookahead does not work: (?!<a.*?<\/a>)

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745474413a4629278.html

html - Javascript regex: Find all URLs outside <a> tags - Nested Tags - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

html - Javascript regex: Find all URLs outside &lt;a&gt; tags - Nested Tags - Stack Overflow

1 Answer 1

相关推荐

html - Javascript regex: Find all URLs outside &lt;a&gt; tags - Nested Tags - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888

html - Javascript regex: Find all URLs outside <a> tags - Nested Tags - Stack Overflow

html - Javascript regex: Find all URLs outside <a> tags - Nested Tags - Stack Overflow