python - How to use scrapy to crawl a website which hides the url as href="javascript:;" in the next button -

admin•2025-04-23 18:08:11•questions•阅读1

I am learning python and scrapy lately. I googled and searched around for a few days, but I don't

I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on ">>" button, it will load the next 20 items. I can't figure out how to find the actual urls, below is the source code for your reference. Any pointers and help is greatly appreciated.

Share Improve this question edited Jan 27, 2021 at 6:36 SparkAndShine 18.1k27 gold badges99 silver badges139 bronze badges asked May 29, 2015 at 15:49 Pot 9431 gold badge8 silver badges8 bronze badges

From what I gather, the button is implemented with an anchor element? Without knowing what the website is, it would be difficult to help. My guess is that the anchor onClick event is getting bound to some javascript function that fires an AJAX call when the anchor is clicked. Depending on the site (and how obfuscated it is) tracing the location of the actual HTTP request could be challenging. – junnytony Commented May 29, 2015 at 19:44
you can see the site at rent.591..hk, it's in chinese, but you can switch it to Eng on top right corner. Any further help is much appreciated junnytony – Pot Commented May 30, 2015 at 4:36
Check the network tab in your browser developer tools to see which requests that site is doing. – Elias Dorneles Commented May 30, 2015 at 16:40
I tried, on the top of Network tab shows "rent.591..hk/…". I copied this URL and pasted it on a browser, empty page displayed. Under Network Tab -> Initiator, it shows query-1.8.2.min.js:2, not sure if it tells me anything. Btw, I went on to try Selenium to mimic clicking on the >> button for next page, when i checked the URL (driver.current_url) after the click action, it always displays rent.591..hk/#list, the URL is still hidden. I appreciate a lot if someone can give me a hand to resolve this – Pot Commented May 30, 2015 at 21:04

Add a ment |

1 Answer 1

Sorted by: Reset to default 5

Visiting the site with a Web-Browser and activated Web-Developer-Tools (the following screenshots are made with Firefox and add-on Firebug) you should be able to analyze the Network requests and responses. It will show you that the sites pagination buttons send requests like the following:

So the URL seems to be:

http://rent.591..hk/?m=home&c=search&a=rslist&type=1&shType=list&p=2&searchtype=1

But it's not a normal request. It's a XMLHttpRequest. Indicated unter the Header tab. And the response is in JSON:

So you don't need to grab the data from plicated nested html-structures, but can get the directly from the JSON dict.

I ended up with this scrapy code (with room for improvement):

import scrapy
import json

class RentObject(scrapy.Item):
    address = scrapy.Field()
    purpose = scrapy.Field()
    # Add more fields as needed

class ScrapeSpider(scrapy.Spider):

    name = "rent_hk"
    allowed_domains = ['591..hk']
    start_urls = ['http://rent.591..hk/?hl=en-us#list' ]

    page_number = 0
    page_num_max = 5 # for test purposes grab only up to 5 pages

    def parse(self, response):

        if 'page_number' in response.meta:
            result_dict = json.loads(response.body)  # get data as dict
            for object in result_dict['items']:
                ro = RentObject()
                ro['address'] = object['address']
                ro['purpose'] = object['purpose']
                yield ro

        # Make request for (next page) JSON data
        self.page_number += 1

        payload = {
            'm': 'home',
            'c': 'search',
            'a': 'rslist',
            'type': '1',
            'p': str(self.page_number),
            'searchtype': '1'
        }

        if self.page_number < self.page_num_max:
            request = scrapy.FormRequest(url='http://rent.591..hk/',
                                         method='GET',
                                         formdata=payload,
                                         headers={'Referer': 'http://rent.591..hk/?hl=en-us',
                                                  'X-Requested-With': 'XMLHttpRequest'},
                                         callback=self.parse)
            request.meta['page_number'] = self.page_number
            yield request

The site is really not easy for a scrapy beginner - so I piled this detailed answer.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745384689a4625385.html

admin

questions
SQL to Query the db and return all posts and it's metas
I am trying to query the wp db and return all posts and it's metas.However, the return is very slow. Am I doing it
admin
32分钟前
10
questions
javascript - JQuery1.8 Datatables saving state on click of a button - Stack Overflow
I have a jsp with a table and two buttons "home" and "back" in it and have imported
admin
30分钟前
10
questions
javascript - ThreeJS - ExtrudeGeometry depth gives different result than extrudePath - Stack Overflow
I use THREE.ExtrudedGeometry in two different ways and I expected the same result. Once I use the depth
admin
29分钟前
10
questions
javascript - JS sort array by three types of sorting - Stack Overflow
I need to sort an array by the following order based on a search term.Exact string.Starts with. Conta
admin
27分钟前
10
questions
hooks - Writing a function to detect an event
I'm using The Events Calendar to handle events on a site I'm building. I'd like an include to fire if the
admin
25分钟前
00
questions
javascript - How to navigate to a section of a page using animation? - Stack Overflow
in one page, if we have html like this and we click ontag it will navigate to particular section, tha
admin
24分钟前
10
questions
javascript - npx babel not reading configuration from babel.config.js - Stack Overflow
When running npx babel index.js from the mand line, I was hoping I would see my babel configurations be
admin
23分钟前
10
questions
javascript - EISDIR - EISDIR: illegal operation on a directory, read - Stack Overflow
When I try to upload an image into a bucket on the server side I'm getting the error above. I chec
admin
21分钟前
00
questions
javascript - How to expand div after click and collapse it after another click? - Stack Overflow
I have an explandablecollapsible element that I created using Simple-expand(A jQuery plug-in to expand
admin
21分钟前
10
questions
javascript - Displaying other user's email address in Meteor.js - Stack Overflow
Trying to display a view that shows all registered users of the app along with their email addresses. I
admin
19分钟前
10
questions
Adding posts updates previous post
Intermittently adding a new post will cause the previous post to be updated with the new posts information, instead of i
admin
19分钟前
10
questions
javascript - Change Page Title After a Certain Time - Stack Overflow
I am trying to dynamically change the title of a page after 2 seconds (2000 ms).Here is the code I have
admin
18分钟前
00
questions
php - List all ACF field values across every post on one page
I used an ACF repeater field for data sheetsinstruction manual download links of products in woocommerce. There's
admin
16分钟前
00
questions
javascript - How to filter array to match params value with react - Stack Overflow
I wanted to create a e-merce web application using react-bootstrap. I want the page to show different i
admin
14分钟前
10
questions
asp.net - Replacing body.onload in a user control - Stack Overflow
I am refactoring a page that uses <body onload="myJS();"> to a user control. I understa
admin
13分钟前
00
questions
How to add a filter to a custom post type to get adjacent custom posts via the REST API
I need to get the adjecent posts of a custom post type "projects". The custom post type works in general and I
admin
9分钟前
00
questions
javascript - Angularjs checkbox checked in ng-repeat - Stack Overflow
repeat to show the checkbox with box as checked if value is true and unchecked if it is false. My code
admin
7分钟前
00
questions
variables - How do I check if a value has changed in Javascript? - Stack Overflow
So, how do you check if a value has changed in Javascript by user input? For example, if they clicked a
admin
3分钟前
00
questions
javascript - In web-browsers, is the window object a native ECMAScript object? - Stack Overflow
The ECMAScript specification defines an "unique global object that is created before control enter
admin
1分钟前
00
questions
javascript - Handling Combo Keyboard Shortcuts - Stack Overflow
I want to simulate GMailTwitter keyboard shortcuts of pressing a key, followed by another key to navig
admin
1分钟前
00

发表回复

评论列表（0条）

暂无评论

python - How to use scrapy to crawl a website which hides the url as href="javascript:;" in the next button -

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to use scrapy to crawl a website which hides the url as href=&quot;javascript:;&quot; in the next button -

1 Answer 1

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to use scrapy to crawl a website which hides the url as href="javascript:;" in the next button -