javascript - crawl dynamic webpage for data using scrapy

I am trying to get some data from NBA's official stats to be used for some data analysis. I use scrapy as my primary tool for scraping. However, after inspecting the webpage elements I found that it is generated dynamically using the javascript. I am pletely new to javascript and could not figure out how it actually works.(which js file is invoked, how it is loading into which contains the table of data and whether there are more easily accessible way to obtain the data) I have also found some json file in Network and I have no idea how that is used.

.html?TeamID=1610612739&pageNo=1&rowsPerPage=100&Season=2008-09&sortField=MIN&sortOrder=DES&PerMode=Per48

Any one who can kindly guide me using the url above and tell me how the website actually functions in order to load the data and how they process the data such that it is displayed in this way?

The key part is still about how to obtain the data. I have seen answers that use the POST method in order to get back the data( sorry I am not even familiar to GET/POST) but I still could not figure out how that applies to this context.

Thank you for your generous guidance!

http://stats.nba./teamLineups.html?TeamID=1610612739&pageNo=1&rowsPerPage=100&Season=2008-09&sortField=MIN&sortOrder=DES&PerMode=Per48

Any one who can kindly guide me using the url above and tell me how the website actually functions in order to load the data and how they process the data such that it is displayed in this way?

Thank you for your generous guidance!

Share Improve this question asked Jul 12, 2014 at 4:24 ethanluoyc 894 silver badges10 bronze badges

Add a ment |

3 Answers 3

Sorted by: Reset to default 4

In this example, Javascript only allows content to be sent, received and displayed on the webpage without actually reloading the webpage for each request. So you don't need to parse the javascript, you just have to find what information is requested, then imitate that request, and parse the response. For that, you can use Firebug in Firefox, or the developper tools in Chrome (ctrl+shift+J in windows, cmd+opt+J in Mac). In Chrome, just click on the "Network" tab, and you will see requests and responses as you click in the website.

In this particular example, when you want to get the stats for the cleveland team for "2008-09", the javascript sends multiple requests. The request for lineups, which you are probably interested in, is this one: http://stats.nba./stats/teamdashlineups?PlusMinus=N&pageNo=1&GroupQuantity=5&TeamID=1610612739&GameID=&Location=&SeasonType=Regular+Season&Season=2008-09&PaceAdjust=N&DateFrom=&sortOrder=DES&VsConference=&OpponentTeamID=0&DateTo=&GameSegment=&LastNGames=0&VsDivision=&LeagueID=00&Oute=&GameScope=&MeasureType=Base&PerMode=Per48&sortField=MIN&SeasonSegment=&Period=0&Rank=N&Month=0&rowsPerPage=100

Here's an example of scrapy base spider. You just need to define the LineupItem, and then you can execute it with scrapy crawl stats -o output.json.

import json
from scrapy.spider import Spider
from scrapy.http import Request
from nba.items import LineupItem
from urllib import urlencode


class StatsSpider(Spider):
    name = "stats"
    allowed_domains = ["stats.nba."]
    start_urls = (
        'http://stats.nba./',
        )

    def parse(self, response):
        return self.get_lineup('1610612739','2008-09')

    def get_lineup(self, team_id, season):
        params = {
            'Season':         season,
            'SeasonType':     'Regular Season',
            'LeagueID':       '00',
            'TeamID':         team_id,
            'MeasureType':    'Base',
            'PerMode':        'Per48',
            'PlusMinus':      'N',
            'PaceAdjust':     'N',
            'Rank':           'N',
            'Oute':        '',
            'Location':       '',
            'Month':          '0',
            'SeasonSegment':  '',
            'DateFrom':       '',
            'DateTo':         '',
            'OpponentTeamID': '0',
            'VsConference':   '',
            'VsDivision':     '',
            'GameSegment':    '',
            'Period':         '0',
            'LastNGames':     '0',
            'GroupQuantity':  '5',
            'GameScope':      '',
            'GameID':         '',
            'pageNo':         '1',
            'rowsPerPage':    '100',
            'sortField':      'MIN',
            'sortOrder':      'DES'
        }
        return Request(
            url="http://stats.nba./stats/teamdashlineups?" + urlencode(params),
            dont_filter=True,
            callback=self.parse_lineup
        )

    def parse_lineup(self,response):
        data = json.loads(response.body)
        for lineup in data['resultSets'][1]['rowSet']:
            item = LineupItem()
            item['group_set'] = lineup[0]
            item['group_id'] = lineup[1]
            item['group_name'] = lineup[2]
            item['gp'] = lineup[3]
            item['w'] = lineup[4]
            item['l'] = lineup[5]
            item['w_pct'] = lineup[6]
            item['min'] = lineup[7]
            item['fgm'] = lineup[8]
            item['fga'] = lineup[9]
            item['fg_pct'] = lineup[10]
            item['fg3m'] = lineup[11]
            item['fg3a'] = lineup[12]
            item['fg3_pct'] = lineup[13]
            item['ftm'] = lineup[14]
            item['fta'] = lineup[15]
            item['ft_pct'] = lineup[16]
            item['oreb'] = lineup[17]
            item['dreb'] = lineup[18]
            item['reb'] = lineup[19]
            item['ast'] = lineup[20]
            item['tov'] = lineup[21]
            item['stl'] = lineup[22]
            item['blk'] = lineup[23]
            item['blka'] = lineup[24]
            item['pf'] = lineup[25]
            item['pfd'] = lineup[26]
            item['pts'] = lineup[27]
            item['plus_minus'] = lineup[28]
            yield item

which will result in json records such as this one:

{"gp": 30, "fg_pct": 0.491, "group_name": "Ilgauskas,Zydrunas - James,LeBron - Wallace,Ben - West,Delonte - Williams,Mo", "group_set": "Lineups", "w_pct": 0.833, "pts": 103.0, "min": 484.9866666666667, "tov": 13.3, "fta": 21.6, "pf": 16.0, "blk": 7.7, "reb": 44.2, "blka": 3.0, "ftm": 16.6, "ft_pct": 0.771, "fg3a": 18.7, "pfd": 17.2, "ast": 23.3, "fg3m": 7.4, "fgm": 39.5, "fg3_pct": 0.397, "dreb": 32.0, "fga": 80.4, "plus_minus": 18.4, "stl": 8.3, "l": 5, "oreb": 12.3, "w": 25, "group_id": "980 - 2544 - 1112 - 2753 - 2590"}

Scrapy can't run javascript so you will have to analizy javascript code and do something similar in Python and Scrapy or recognize how javascript get data from server (which urls and parameters it use) and use it your script. It can be a lot work - first with Firebug in Firefox, then with Python and Scrapy.

If you have no idea how to do this than better use Selenium (or something similar) which simulate real browser and can run javascript. You will have to only say to Selenium which button press on page, what text put in forms, etc.

import requests
import json

# set request as GET
response = requests.get('http://stats.nba./stats/teamdashlineups?Season=2008-09&SeasonType=Regular+Season&LeagueID=00&TeamID=1610612739&MeasureType=Base&PerMode=Per48&PlusMinus=N&PaceAdjust=N&Rank=N&Oute=&Location=&Month=0&SeasonSegment=&DateFrom=&DateTo=&OpponentTeamID=0&VsConference=&VsDivision=&GameSegment=&Period=0&LastNGames=0&GroupQuantity=5&GameScope=&GameID=&pageNo=1&rowsPerPage=100&sortField=MIN&sortOrder=DES')

# change json into dictionary
data =  json.loads(response.text)

#print data

import pprint

pprint.pprint(data)

for x in data['resultSets']:
    print x['rowSet']

I might not be able to answer your question in the detail that you want but here is how i understand it.

When you go to a page the browser GET's the source code of the page, the same source code that you see when you click "View page source" in chrome. The browser interprets the code and when if finds "src" attributes that points to a external file it imports that file into the source, again with a GET request.

<script src="/js/libs/modernizr.custom.16166.js"></script>

Once the JavaScript files are imported they can run them selfs

jsfile.js:

function myFunction() {
//
//do stuff
//
}

myFunction();

In the case of your nba site, the imported files creates the table and populates it with ajax GET requests.

Your nba site seems to get the table information from this link using a ajax GET request from "jquery.statrequest.js" and "team-lineups.js", its a mess so you might still want to normally scrape the page.

If you decide you scrape the page you will not be able to use urllib because it just gets the page source, it doesn't import any external .js scripts and does not run the JavaScript code, in which case the table on the page wont be created and populated with the nba stats.

You will need to use something like Mechanize, which emulates a browser and import and runs JavaScript.

I hope this give you some idea of what you wanted to know, I'm not that familiar with the inner workings of browsers. You might want to look for a website that has free API's for nda game stats.

Here is a other link from the nba site that might be relevant to you.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1742310541a4419804.html

javascript - crawl dynamic webpage for data using scrapy - Stack Overflow

3 Answers 3

发表回复

评论列表（0条）

联系我们

400-800-8888

javascript - crawl dynamic webpage for data using scrapy - Stack Overflow

3 Answers 3

相关推荐