I have built this regex code:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)
The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.
I would like that only <a>
tags are excluded - so the solution could be to modify only the last term to:
[^<>]*?<\/a>
But now there will be a problem if I have nested tags, for example, <b></b>
inside <a>
.
Here is the example I am working on: (should be 10 matches).
Negative lookahead is still tricky for me. I thought that the following should work but it isn't:
(?!<a.+?<\/a>)
These are the last discussions that helped me:
Regex replace text outside html tags
Regex replace text but exclude when text is between specific tag
I have built this regex code:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)
The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.
I would like that only <a>
tags are excluded - so the solution could be to modify only the last term to:
[^<>]*?<\/a>
But now there will be a problem if I have nested tags, for example, <b></b>
inside <a>
.
Here is the example I am working on: https://regex101./r/lM3hC5/6 (should be 10 matches).
Negative lookahead is still tricky for me. I thought that the following should work but it isn't:
(?!<a.+?<\/a>)
https://regex101./r/hT1cG5/1
These are the last discussions that helped me:
Regex replace text outside html tags
Regex replace text but exclude when text is between specific tag
- Isn't that problem enough serious to stop relying on regex for extracting texts from the HTML? Use a DOM parser. – Wiktor Stribiżew Commented Feb 22, 2016 at 12:34
- @WiktorStribiżew What do you mean DOM parser exactly? Something like this? simplehtmldom.sourceforge/manual.htm – Shafizadeh Commented Feb 22, 2016 at 12:47
- I am just wondering if that would be possible since for this purpose my code is quite neat and simple with regex. – Klaidonis Commented Feb 22, 2016 at 12:49
- @user2943191 I read somewhere using regex should be last option .. – Shafizadeh Commented Feb 22, 2016 at 12:50
- Here is a possible DOM based solution. – Wiktor Stribiżew Commented Feb 22, 2016 at 13:04
1 Answer
Reset to default 8It turned out that probably the best solution is the following:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.
Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a
up to the first "
symbol (as it is not a valid URL symbol but <>
symbols are present with nested tags).
Now also nested tags inside <a>
tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:
- placing quotes within
<a>
tags; - do not use this algorithm on
<a>
tags without any attribute (placeholders); - as well as you may need to avoid using multiple nested tags/lines unless the URL inside
<a>
tag is after any double quote.
Here is a very good and messy example (the last match should not be found but it is):
https://regex101./r/pC0jR7/2
It is a pity that this lookahead does not work: (?!<a.*?<\/a>)
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745474413a4629278.html
评论列表(0条)