Use python to a open web browser (on windows), trigger javascript actions, and get the html contents? - Stack Overflow

Yes that sounds overly plicated.I am trying to mine data from pages on our intranet. The pages are secu

Yes that sounds overly plicated.

I am trying to mine data from pages on our intranet. The pages are secure. The connection is refused when I try to get the contents with urllib.urlopen().

So I would like to use python to open a web browser to open the site then click some links that trigger javascript pop ups containing tables of info that I want to collect.

Any suggestions on where to begin?

I know the format of the page. It is something like this:

<div id="list">
    <ul id="list item">
        <li><a onclict="Openpopup('1');">blah</a></li>
    </ul>
    <ul></ul>
    etc

Then a hidden frame bees visible and the fields in the table within are filled.

<div>
    <table>
       <tr><td><span id="info_i_want">...

Yes that sounds overly plicated.

I am trying to mine data from pages on our intranet. The pages are secure. The connection is refused when I try to get the contents with urllib.urlopen().

So I would like to use python to open a web browser to open the site then click some links that trigger javascript pop ups containing tables of info that I want to collect.

Any suggestions on where to begin?

I know the format of the page. It is something like this:

<div id="list">
    <ul id="list item">
        <li><a onclict="Openpopup('1');">blah</a></li>
    </ul>
    <ul></ul>
    etc

Then a hidden frame bees visible and the fields in the table within are filled.

<div>
    <table>
       <tr><td><span id="info_i_want">...
Share Improve this question asked Jan 26, 2012 at 2:47 sequoiasequoia 3,1658 gold badges34 silver badges41 bronze badges
Add a ment  | 

4 Answers 4

Reset to default 5

First off, I suggest that it's better to figure out what the page needs that JS is providing, and fake that - you'll have an easier time scraping the page if a browser isn't involved.

If it's just Javascript making an XMLHttpRequest, you can find the page from which the Javascript fetches the iframe data and connect directly to that.

But in spite of that you may need a library that does Javascript execution (if the reverse-engineering is too hard or it uses challenge tokens). A web-rendering framework like Gecko or WebKit might be appropriate.

Take a good look at Selenium if you insist on using a true web browser or cannot get the programmatic methods to work.

Once you've gotten the page contents via whatever method, you need an HTML parser (such as sgmllib or [almost] xml.dom). I suggest a DOM library. Parse the DOM and extract the contents from the appropriate node in the resulting tree.

The connection is refused when I try to get the contents with urllib.urlopen(). probably means you have to make a post request using python urllib module.I would suggest you use urllib2.You may also need to handle cookies, referrer,user-agent from your python code.

To see all the post request fired from your browser use firefox's live-http-headers.

For the javascript part,

Your best bet is to run a headless browser e.g phantomjs which understands all the intricacies of JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want.

As, @phihag mentioned selenium is also a good option.

First of all, you should really find out why the connection is refused when you access the page with Python. Most likely, you'll have to perform HTTP authentication or specify a different User-Agent.

Firing up a browser, navigating, and getting the HTML back is a plex task. Luckily, you can implement it using selenium.

Consider taking a look at splinter which is a simpler webdriver API than Selenium.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744674466a4587243.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信