javascript - Extract a string from HTML with NodeJS - Stack Overflow

Here is the html...<iframe width="100%" height="166" scrolling="no" fr

Here is the html...

<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="/?url=http%3A%2F%2Fapi.soundcloud%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_ments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>

I'm using NodeJS. I'm trying to extract the trackID, in this case 11111111 following tracks%2F. What is the most stable method for performing this?

Should I use regex or some JS string method such as substring() or match()?

Here is the html...

<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud./player/?url=http%3A%2F%2Fapi.soundcloud.%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_ments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>

I'm using NodeJS. I'm trying to extract the trackID, in this case 11111111 following tracks%2F. What is the most stable method for performing this?

Should I use regex or some JS string method such as substring() or match()?

Share asked Jul 10, 2012 at 3:28 mnort9mnort9 1,8203 gold badges31 silver badges54 bronze badges
Add a ment  | 

6 Answers 6

Reset to default 2

If you know tracks%2F is only going to show up once you could do:

var your_track_ID = src.split(/tracks%2F/)[1].split(/&amp/)[0];

There are probably better ways, but that should work fine for your purposes.

Update for 2019...

This builds off of blueiur's answer and walks through a solution in more detail. JSDOMneeds to be installed before you can use it:

npm install jsdom

Now, according to the documentation, you can instantiate JSDOM like this:

const jsdom = require('jsdom');
const { JSDOM } = jsdom;

You've already got some html you want to parse, I'll use your example and define it as a template literal:

const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud./player/?url=http%3A%2F%2Fapi.soundcloud.%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_ments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>`;

Here's the fun part... parse the html in NodeJS:

const { document } = (new JSDOM(data)).window;

What's happening here? You're creating a new JSDOM object with the provided HTML and grabbing the document attribute of the window attribute. From this point on, you can use document.getElementsByTagName() and other similar functions just like you would in a browser.

To continue with your specific example, you want to extract the src attribute of the only iframe in the document. There are multiple ways to do that. One example is to use getElementsByTagName to pull the first iframe like this:

const src1 = document.getElementsByTagName('iframe')[0].src;

Now that we have the src attribute, we can split it apart and process the url query value. This is where we will use the URL class which es with NodeJS. According to the documentation, we can get the search parameters by creating a URL object and accessing the searchParams attribute like this:

const params = (new URL(src1)).searchParams;

Now you've got the query string as a URLSearchParams object and you can access individual terms like this:

const scURL = params.get('src');

If you look at the contents of scURL now, you'll find it is the embedded url which was passed as a query, so we can parse that with another URL object and extract the pathname attribute like this:

const src2 = (new URL(src2)).pathname;

We're getting close now, and can split the path apart to the get value you wanted using JavaScript's standard string functions:

const val = src2.split('/')[2];

And print the result:

console.log(val);

... which produces this output:

11111111

To summarize, here is the plete code:

const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const data = `<iframe width="100%" height="166" scrolling="no" frameborder="no" 
src="http://w.soundcloud./player/?url=http%3A%2F%2Fapi.soundcloud.%2Ftracks%2F11111111&amp;auto_play=false
&amp;show_artwork=true&amp;color=c3000d&amp;show_ments=false&amp;liking=false
&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>`;

const { document } = (new JSDOM(data)).window;

const src1 = document.getElementsByTagName('iframe')[0].src;

const params = (new URL(src1)).searchParams;

const scURL = params.get('src');

const src2 = (new URL(src2)).pathname;

const val = src2.split('/')[2];

console.log(val);

Feel free to consolidate that and eliminate intermediate values as desired.

You can find tracks with node module [url + jsdom + qs]

Try this

var jsdom = require('jsdom');
var url = require('url');
var qs = require('qs');

var str = '<iframe width="100%" height="166" scrolling="no" frameborder="no"'
  + 'src="http://w.soundcloud./player/?url=http%3A%2F%2Fapi.soundcloud.%2Ftracks%2F11111111&amp;auto_play=false"'
  + '&amp;show_artwork=true&amp;color=c3000d&amp;show_ments=false&amp;liking=false'
  + '&amp;download=false&amp;show_user=false&amp;show_playcount=false"></iframe>';

jsdom.env({
  html: str,
  scripts: [
    'http://code.jquery./jquery-1.5.min.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    var src = $('iframe').attr('src');
    var aRes = qs.parse(decodeURIComponent(url.parse(src).query)).url.split('/');
    var track_id = aRes[aRes.length-1];

    console.log("track_id =", track_id);
  }
});

The result is:

track_id = 11111111

It's generally a terribly bad idea to parse HTML with a regular expression, but this might be forgivable. I'd look for the plete URL for safety:

var pattern = /w\.soundcloud\..*tracks%2F(\d+)&/
  , trackID = (html.match(pattern) || [])[1]

If the track id is always 8 digits and the html doesn't change you can do this:

var trackId = html.match(/\d{8}/)

The Right™ way to to do this is to parse the HTML using some XML parser and get the URL that way and then use a reg-exp to parse the URL.

If for some reasons you don't have an infinite amount of time and energy, one of the proposed purely reg-exp solutions would work.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745043642a4607962.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信