python - How to use scrapy to crawl a website which hides the url as href="javascript:;" in the next button -

I am learning python and scrapy lately. I googled and searched around for a few days, but I don't

I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on ">>" button, it will load the next 20 items. I can't figure out how to find the actual urls, below is the source code for your reference. Any pointers and help is greatly appreciated.

I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on ">>" button, it will load the next 20 items. I can't figure out how to find the actual urls, below is the source code for your reference. Any pointers and help is greatly appreciated.

Share Improve this question edited Jan 27, 2021 at 6:36 SparkAndShine 18.1k27 gold badges99 silver badges139 bronze badges asked May 29, 2015 at 15:49 PotPot 9431 gold badge8 silver badges8 bronze badges 4
  • From what I gather, the button is implemented with an anchor element? Without knowing what the website is, it would be difficult to help. My guess is that the anchor onClick event is getting bound to some javascript function that fires an AJAX call when the anchor is clicked. Depending on the site (and how obfuscated it is) tracing the location of the actual HTTP request could be challenging. – junnytony Commented May 29, 2015 at 19:44
  • you can see the site at rent.591..hk, it's in chinese, but you can switch it to Eng on top right corner. Any further help is much appreciated junnytony – Pot Commented May 30, 2015 at 4:36
  • Check the network tab in your browser developer tools to see which requests that site is doing. – Elias Dorneles Commented May 30, 2015 at 16:40
  • I tried, on the top of Network tab shows "rent.591..hk/…". I copied this URL and pasted it on a browser, empty page displayed. Under Network Tab -> Initiator, it shows query-1.8.2.min.js:2, not sure if it tells me anything. Btw, I went on to try Selenium to mimic clicking on the >> button for next page, when i checked the URL (driver.current_url) after the click action, it always displays rent.591..hk/#list, the URL is still hidden. I appreciate a lot if someone can give me a hand to resolve this – Pot Commented May 30, 2015 at 21:04
Add a ment  | 

1 Answer 1

Reset to default 5

Visiting the site with a Web-Browser and activated Web-Developer-Tools (the following screenshots are made with Firefox and add-on Firebug) you should be able to analyze the Network requests and responses. It will show you that the sites pagination buttons send requests like the following:

So the URL seems to be:

http://rent.591..hk/?m=home&c=search&a=rslist&type=1&shType=list&p=2&searchtype=1

But it's not a normal request. It's a XMLHttpRequest. Indicated unter the Header tab. And the response is in JSON:

So you don't need to grab the data from plicated nested html-structures, but can get the directly from the JSON dict.

I ended up with this scrapy code (with room for improvement):

import scrapy
import json

class RentObject(scrapy.Item):
    address = scrapy.Field()
    purpose = scrapy.Field()
    # Add more fields as needed

class ScrapeSpider(scrapy.Spider):

    name = "rent_hk"
    allowed_domains = ['591..hk']
    start_urls = ['http://rent.591..hk/?hl=en-us#list' ]

    page_number = 0
    page_num_max = 5 # for test purposes grab only up to 5 pages

    def parse(self, response):

        if 'page_number' in response.meta:
            result_dict = json.loads(response.body)  # get data as dict
            for object in result_dict['items']:
                ro = RentObject()
                ro['address'] = object['address']
                ro['purpose'] = object['purpose']
                yield ro

        # Make request for (next page) JSON data
        self.page_number += 1

        payload = {
            'm': 'home',
            'c': 'search',
            'a': 'rslist',
            'type': '1',
            'p': str(self.page_number),
            'searchtype': '1'
        }

        if self.page_number < self.page_num_max:
            request = scrapy.FormRequest(url='http://rent.591..hk/',
                                         method='GET',
                                         formdata=payload,
                                         headers={'Referer': 'http://rent.591..hk/?hl=en-us',
                                                  'X-Requested-With': 'XMLHttpRequest'},
                                         callback=self.parse)
            request.meta['page_number'] = self.page_number
            yield request

The site is really not easy for a scrapy beginner - so I piled this detailed answer.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745384689a4625385.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信