console.log( html.match( /<a href="(.*?)">[^<]+<\/a>/g ));
Instead of returning just the urls like:
http://google,
It's returning the entire tag:
<a href="">Google</a>, <a href="">Yahoo</a>
Why is that the case?
console.log( html.match( /<a href="(.*?)">[^<]+<\/a>/g ));
Instead of returning just the urls like:
http://google, http://yahoo.
It's returning the entire tag:
<a href="http://google.">Google.</a>, <a href="http://yahoo.">Yahoo.</a>
Why is that the case?
Share Improve this question asked Jun 5, 2011 at 8:47 HyderAHyderA 21.5k48 gold badges116 silver badges183 bronze badges 4- Because you're using regular expressions to parse html. It's also because the first array item always contains the entire match, and subsequently it has the bracketed matches. – Zirak Commented Jun 5, 2011 at 8:51
- @Zirak: That's the entire response, but in brackets. – HyderA Commented Jun 5, 2011 at 8:52
- And I understand the reasons not to use regex for parsing html tags, but this is on nodejs, so I don't have any options. Tried jsdom and apricot, but are riddled with errors and have yet to mature. – HyderA Commented Jun 5, 2011 at 8:53
- "but this is on nodejs, so I don't have any options" You do. :-) You can create your own parsing logic (which may use some targeted regular expressions for bits and pieces), which can be much more intelligent for simple extractions like this. – T.J. Crowder Commented Jun 5, 2011 at 9:08
2 Answers
Reset to default 3You want RegExp#exec
and a loop accessing the element at the match result's 1
index, rather than String.match
. String.match
doesn't return the capture groups when there's a g
flag, just an array of the elements at index 0
of each match, which is the whole matching string. (See Section 15.5.4.10 of the spec.)
So in essence:
var re, match, html;
re = /<a href="(.*?)">[^<]+<\/a>/g;
html = 'Testing <a href="http://yahoo.">one two three</a> <a href="http://google.">one two three</a> foo';
re.lastIndex = 0; // Work around literal bug in some implementations
for (match = re.exec(html); match; match = re.exec()) {
display(match[1]);
}
Live example
But this is parsing HTML with regular expressions. Here There Be Dragons.
Update re dragons, here's a quick list of things that will defeat this regexp, off the top of my head:
- Anything other than exactly one space between the
a
andhref
, such as two spaces rather than one, a line break,class='foo'
, etc., etc. - Using single quotes rather than double quotes around the
href
attribute. - Not using quotes around the
href
attribute at all. Anything after the
href
attribute that also uses double quotes, e.g.:<a href="http://google." class="foo">
This is not to be down on your regexp, it's just to highlight that regular expressions can't be reliably used on their own to parse HTML. They can form part of the solution, helping you scan for tokens, but they can't reliably do the whole job.
While it is true you cannot reliably _parse_ HTML using regular expressions, this is not what the OP is asking.
Rather, the OP requires a way to extract anchor links from an HTML document which is easily and admirably handled using regular expressions.
Of the four problems listed by the previous responder:
- multiple spaces between parts of the anchor
- using single rather than double quotation marks
- not using quotation marks at all to delimit the href attribute
- having other leading or trailing attributes other than href
Only number 3 poses significant problems for a single regular expression solution, but also happens to be pletely non-standard HTML which should never appear in an HTML document. (Note if you find HTML that contains non-delimited tag properties, there is a regular expression that will match them, but I maintain they aren't worth extracting. YMMV - Your mileage may vary.)
To extract anchor links (hrefs) using regular expressions from HTML, you would use this regular expression (in mented form):
< # a literal '<'
a # a literal 'a'
[^>]+? # one or more chars which are not '>' (non-greedy)
href= # literal 'href='
('|") # either a single or double-quote captured into group #1
([^\1]+?) # one or more chars that are not the group #1, captured into group #2
\1 # whatever capture group #1 matched
which, without ments, is:
<a[^>]+?href=('|")([^\1]+?)\1
(Note that we do not need to match anything past the final delimiter, including the rest of the tag, since all we are interested in is the anchor link.)
In JavaScript and assuming 'source' contains the HTML from which you wish to extract anchor links:
var source='<a href="double-quote test">\n'+
'<a href=\'single-quote test\'>\n'+
'<a class="foo" href="leading prop test">\n'+
'<a href="trailing prop test" class="foo">\n'+
'<a style="bar" link="baz" '+
'name="quux" '+
'href="multiple prop test" class="foo">\n'+
'<a class="foo"\n href="inline newline test"\n style="bar"\n />';
which, when printed to the console, reads as:
<a href="double-quote test">
<a href='single-quote test'>
<a class="foo" href="leading prop test">
<a href="trailing prop test" class="foo">
<a style="bar" link="baz" name="quux" href="multiple prop test" class="foo">
<a class="foo"
href="inline newline test"
style="bar"
/>
you would write the following:
var RE=new RegExp(/<a[^>]+?href=('|")([^\1]+?)\1/gi),
match;
while(match=RE.exec(source)) {
console.log(match[2]);
}
which prints the following lines to the console:
double-quote test
single-quote test
leading prop test
trailing prop test
multiple prop test
inline newline test
Notes:
Code tested in nodejs v0.5.0-pre but should run under any modern JavaScript.
Since the regular expression uses capture group #1 to note the leading delimiting quote, the resulting link appears in capture group #2.
You might wish to validate the existence, type and length of match using:
but it really shouldn't be necessary since RegExp.exec() returns 'null' on failure. Also, note that the correct typeof match is 'object', not 'Array'.if(match && typeof match === 'object' && match.length > 1) { console.log(match[2]); }
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745374002a4624920.html
评论列表(0条)