I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;"
. Basically each page contains 20 listings, each time you click on ">>" button, it will load the next 20 items. I can't figure out how to find the actual urls, below is the source code for your reference. Any pointers and help is greatly appreciated.
I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;"
. Basically each page contains 20 listings, each time you click on ">>" button, it will load the next 20 items. I can't figure out how to find the actual urls, below is the source code for your reference. Any pointers and help is greatly appreciated.
- From what I gather, the button is implemented with an anchor element? Without knowing what the website is, it would be difficult to help. My guess is that the anchor onClick event is getting bound to some javascript function that fires an AJAX call when the anchor is clicked. Depending on the site (and how obfuscated it is) tracing the location of the actual HTTP request could be challenging. – junnytony Commented May 29, 2015 at 19:44
- you can see the site at rent.591..hk, it's in chinese, but you can switch it to Eng on top right corner. Any further help is much appreciated junnytony – Pot Commented May 30, 2015 at 4:36
- Check the network tab in your browser developer tools to see which requests that site is doing. – Elias Dorneles Commented May 30, 2015 at 16:40
- I tried, on the top of Network tab shows "rent.591..hk/…". I copied this URL and pasted it on a browser, empty page displayed. Under Network Tab -> Initiator, it shows query-1.8.2.min.js:2, not sure if it tells me anything. Btw, I went on to try Selenium to mimic clicking on the >> button for next page, when i checked the URL (driver.current_url) after the click action, it always displays rent.591..hk/#list, the URL is still hidden. I appreciate a lot if someone can give me a hand to resolve this – Pot Commented May 30, 2015 at 21:04
1 Answer
Reset to default 5Visiting the site with a Web-Browser and activated Web-Developer-Tools (the following screenshots are made with Firefox and add-on Firebug) you should be able to analyze the Network requests and responses. It will show you that the sites pagination buttons send requests like the following:
So the URL seems to be:
http://rent.591..hk/?m=home&c=search&a=rslist&type=1&shType=list&p=2&searchtype=1
But it's not a normal request. It's a XMLHttpRequest
. Indicated unter the Header
tab. And the response is in JSON:
So you don't need to grab the data from plicated nested html-structures, but can get the directly from the JSON dict.
I ended up with this scrapy code (with room for improvement):
import scrapy
import json
class RentObject(scrapy.Item):
address = scrapy.Field()
purpose = scrapy.Field()
# Add more fields as needed
class ScrapeSpider(scrapy.Spider):
name = "rent_hk"
allowed_domains = ['591..hk']
start_urls = ['http://rent.591..hk/?hl=en-us#list' ]
page_number = 0
page_num_max = 5 # for test purposes grab only up to 5 pages
def parse(self, response):
if 'page_number' in response.meta:
result_dict = json.loads(response.body) # get data as dict
for object in result_dict['items']:
ro = RentObject()
ro['address'] = object['address']
ro['purpose'] = object['purpose']
yield ro
# Make request for (next page) JSON data
self.page_number += 1
payload = {
'm': 'home',
'c': 'search',
'a': 'rslist',
'type': '1',
'p': str(self.page_number),
'searchtype': '1'
}
if self.page_number < self.page_num_max:
request = scrapy.FormRequest(url='http://rent.591..hk/',
method='GET',
formdata=payload,
headers={'Referer': 'http://rent.591..hk/?hl=en-us',
'X-Requested-With': 'XMLHttpRequest'},
callback=self.parse)
request.meta['page_number'] = self.page_number
yield request
The site is really not easy for a scrapy beginner - so I piled this detailed answer.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745384689a4625385.html
评论列表(0条)