Yes that sounds overly plicated.
I am trying to mine data from pages on our intranet. The pages are secure. The connection is refused when I try to get the contents with urllib.urlopen().
So I would like to use python to open a web browser to open the site then click some links that trigger javascript pop ups containing tables of info that I want to collect.
Any suggestions on where to begin?
I know the format of the page. It is something like this:
<div id="list">
<ul id="list item">
<li><a onclict="Openpopup('1');">blah</a></li>
</ul>
<ul></ul>
etc
Then a hidden frame bees visible and the fields in the table within are filled.
<div>
<table>
<tr><td><span id="info_i_want">...
Yes that sounds overly plicated.
I am trying to mine data from pages on our intranet. The pages are secure. The connection is refused when I try to get the contents with urllib.urlopen().
So I would like to use python to open a web browser to open the site then click some links that trigger javascript pop ups containing tables of info that I want to collect.
Any suggestions on where to begin?
I know the format of the page. It is something like this:
<div id="list">
<ul id="list item">
<li><a onclict="Openpopup('1');">blah</a></li>
</ul>
<ul></ul>
etc
Then a hidden frame bees visible and the fields in the table within are filled.
<div>
<table>
<tr><td><span id="info_i_want">...
Share
Improve this question
asked Jan 26, 2012 at 2:47
sequoiasequoia
3,1658 gold badges34 silver badges41 bronze badges
4 Answers
Reset to default 5First off, I suggest that it's better to figure out what the page needs that JS is providing, and fake that - you'll have an easier time scraping the page if a browser isn't involved.
If it's just Javascript making an XMLHttpRequest, you can find the page from which the Javascript fetches the iframe
data and connect directly to that.
But in spite of that you may need a library that does Javascript execution (if the reverse-engineering is too hard or it uses challenge tokens). A web-rendering framework like Gecko or WebKit might be appropriate.
Take a good look at Selenium if you insist on using a true web browser or cannot get the programmatic methods to work.
Once you've gotten the page contents via whatever method, you need an HTML parser (such as sgmllib
or [almost] xml.dom
). I suggest a DOM library. Parse the DOM and extract the contents from the appropriate node in the resulting tree.
The connection is refused when I try to get the contents with urllib.urlopen().
probably means you have to make a post request using python urllib module.I would suggest you use urllib2.You may also need to handle cookies, referrer,user-agent
from your python code.
To see all the post request fired from your browser use firefox's live-http-headers.
For the javascript part,
Your best bet is to run a headless browser e.g phantomjs which understands all the intricacies of JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want.
As, @phihag mentioned selenium
is also a good option.
First of all, you should really find out why the connection is refused when you access the page with Python. Most likely, you'll have to perform HTTP authentication or specify a different User-Agent.
Firing up a browser, navigating, and getting the HTML back is a plex task. Luckily, you can implement it using selenium.
Consider taking a look at splinter which is a simpler webdriver API than Selenium.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744674466a4587243.html
评论列表(0条)