javascript - Webscraping without Node js possible? - Stack Overflow

I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to u

I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.

Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).

E.g. I would like to do:

Get first url link of a google image search.

Edit:

I now tried it and it worked find however after 2 Weeks I get now this error:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at .... (Reason: CORS header ‘Access-Control-Allow-Origin’ missing).

any ideas how to solve that?

Here is the error described by firefox:

I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.

Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).

E.g. I would like to do:

Get first url link of a google image search.

Edit:

I now tried it and it worked find however after 2 Weeks I get now this error:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at .... (Reason: CORS header ‘Access-Control-Allow-Origin’ missing).

any ideas how to solve that?

Here is the error described by firefox: https://developer.mozilla/en-US/docs/Web/HTTP/CORS/Errors/CORSMissingAllowOrigin

Share Improve this question edited Apr 25, 2019 at 20:10 sqp_125 asked Apr 13, 2019 at 10:01 sqp_125sqp_125 5586 silver badges25 bronze badges 1
  • 1 If the website you're trying to scrape doesnt support CORS, you can't circumvent the issue without a server to proxy the request. – zero298 Commented Apr 25, 2019 at 20:00
Add a ment  | 

3 Answers 3

Reset to default 3

Yes, this is possible. Just use the XMLHttpRequest API:

var request = new XMLHttpRequest();
request.open("GET", "https://bypasscors.herokuapp./api/?url=" + encodeURIComponent("https://duckduckgo./html/?q=stack+overflow"), true);  // last parameter must be true
request.responseType = "document";
request.onload = function (e) {
  if (request.readyState === 4) {
    if (request.status === 200) {
      var a = request.responseXML.querySelector("div.result:nth-child(1) > div:nth-child(1) > h2:nth-child(1) > a:nth-child(1)");
      console.log(a.href);
      document.body.appendChild(a);
    } else {
      console.error(request.status, request.statusText);
    }
  }
};
request.onerror = function (e) {
  console.error(request.status, request.statusText);
};
request.send(null);  // not a POST request, so don't send extra data

Note that I had to use a proxy to bypass CORS issues; if you want to do this, run your own proxy on your own server.

Yes, it is theoretically possible to do “web scraping” (i.e. parsing webpages) on the client. There are several restrictions however and I would question why you wouldn’t choose a program that runs on a server or desktop instead.

Web workers are able to request HTML content using XMLHttpRequest, and then parse the ining XML programmatically. Note that the target webpage must send the appropriate CORS headers if it belongs to a foreign domain. You could then pick out content from the resulting HTML.

Parsing content generated with CSS and JavaScript will be harder. You will either have to construct sandboxed content on your host page from the input stream, or run some kind of parser, which doesn’t seem very feasible.

In short, the answer to your question is yes, because you have the tools to do a network request and a Turing-plete language with which to build any kind of parsing and scraping that you wanted. So technically anything is possible.

But the real question is: would it be wise? Would you ever choose this approach when other technologies are at hand? Well, no. For most cases I don’t see why you wouldn’t just write a server side program using e.g. headless Chrome.

If you don’t want to use Node - or aren’t able to deploy Node for some reason - there are many web scraping packages and prior art in languages such as Go, C, Java and Python. Search the package manager of your preferred programming language and you will likely find several.

I heard about python for scraping too, but nodejs + puppeteer kick ass... And is pretty easy to learn

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744204101a4563016.html

相关推荐

  • javascript - Webscraping without Node js possible? - Stack Overflow

    I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to u

    8天前
    20

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信