html - Javascript regex: Find all URLs outside <a> tags - Nested Tags - Stack Overflow

I have built this regex code:((https?|ftps?):[^"<s]+)(?![^<>]*?>|[^<>]*?<

I have built this regex code:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)

The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.

I would like that only <a> tags are excluded - so the solution could be to modify only the last term to:

[^<>]*?<\/a>

But now there will be a problem if I have nested tags, for example, <b></b> inside <a>.

Here is the example I am working on: (should be 10 matches).

Negative lookahead is still tricky for me. I thought that the following should work but it isn't:

(?!<a.+?<\/a>)

These are the last discussions that helped me:

  • Regex replace text outside html tags

  • Regex replace text but exclude when text is between specific tag

I have built this regex code:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)

The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.

I would like that only <a> tags are excluded - so the solution could be to modify only the last term to:

[^<>]*?<\/a>

But now there will be a problem if I have nested tags, for example, <b></b> inside <a>.

Here is the example I am working on: https://regex101./r/lM3hC5/6 (should be 10 matches).

Negative lookahead is still tricky for me. I thought that the following should work but it isn't:

(?!<a.+?<\/a>)

https://regex101./r/hT1cG5/1

These are the last discussions that helped me:

  • Regex replace text outside html tags

  • Regex replace text but exclude when text is between specific tag

Share Improve this question edited May 23, 2017 at 10:33 CommunityBot 11 silver badge asked Feb 22, 2016 at 12:30 KlaidonisKlaidonis 6392 gold badges7 silver badges22 bronze badges 5
  • Isn't that problem enough serious to stop relying on regex for extracting texts from the HTML? Use a DOM parser. – Wiktor Stribiżew Commented Feb 22, 2016 at 12:34
  • @WiktorStribiżew What do you mean DOM parser exactly? Something like this? simplehtmldom.sourceforge/manual.htm – Shafizadeh Commented Feb 22, 2016 at 12:47
  • I am just wondering if that would be possible since for this purpose my code is quite neat and simple with regex. – Klaidonis Commented Feb 22, 2016 at 12:49
  • @user2943191 I read somewhere using regex should be last option .. – Shafizadeh Commented Feb 22, 2016 at 12:50
  • Here is a possible DOM based solution. – Wiktor Stribiżew Commented Feb 22, 2016 at 13:04
Add a ment  | 

1 Answer 1

Reset to default 8

It turned out that probably the best solution is the following:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.

Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a up to the first " symbol (as it is not a valid URL symbol but <> symbols are present with nested tags).

Now also nested tags inside <a> tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:

  • placing quotes within <a> tags;
  • do not use this algorithm on <a> tags without any attribute (placeholders);
  • as well as you may need to avoid using multiple nested tags/lines unless the URL inside <a> tag is after any double quote.


Here is a very good and messy example (the last match should not be found but it is):

https://regex101./r/pC0jR7/2

It is a pity that this lookahead does not work: (?!<a.*?<\/a>)

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745474413a4629278.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信