python - How to find all the JavaScript requests made from my browser when I'm accessing a site - Stack Overflow

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problemwith th

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium

here is my code

import requests
from bs4 import BeautifulSoup

class Linkedin():
    def __init__(self, url ):
        self.url = url
        self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
                                 "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}

    def saveRsulteToHtmlFile(self, nameOfFile=None):
        if nameOfFile == None:
            nameOfFile ="Linkedin_page"
        with open(nameOfFile+".html", "wb") as file:
            file.write(self.response.content)

    def getSingInPage(self):
        self.sess = requests.Session()
        self.response = self.sess.get(self.url, headers=self.header)
        soup = BeautifulSoup(self.response.content, "html.parser")
        self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]

    def connecteToMyLinkdin(self):
        self.form_data = {"session_key": "[email protected]",
                     "loginCsrfParam": self.csrf,
                     "session_password": "mypassword"}
        self.url = ";
        self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)


    def getAnyPage(self,url):
        self.response = self.sess.get(url, headers=self.header)




url = "/"

likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading 
likedin_page.getAnyPage("/")
likedin_page.saveRsulteToHtmlFile()

I want help to pass the javascript loads without using Selenium...

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium

here is my code

import requests
from bs4 import BeautifulSoup

class Linkedin():
    def __init__(self, url ):
        self.url = url
        self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
                                 "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}

    def saveRsulteToHtmlFile(self, nameOfFile=None):
        if nameOfFile == None:
            nameOfFile ="Linkedin_page"
        with open(nameOfFile+".html", "wb") as file:
            file.write(self.response.content)

    def getSingInPage(self):
        self.sess = requests.Session()
        self.response = self.sess.get(self.url, headers=self.header)
        soup = BeautifulSoup(self.response.content, "html.parser")
        self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]

    def connecteToMyLinkdin(self):
        self.form_data = {"session_key": "[email protected]",
                     "loginCsrfParam": self.csrf,
                     "session_password": "mypassword"}
        self.url = "https://www.linkedin./uas/login-submit"
        self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)


    def getAnyPage(self,url):
        self.response = self.sess.get(url, headers=self.header)




url = "https://www.linkedin./"

likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading 
likedin_page.getAnyPage("https://www.linkedin./jobs/")
likedin_page.saveRsulteToHtmlFile()

I want help to pass the javascript loads without using Selenium...

Share Improve this question edited May 14, 2021 at 12:03 DisappointedByUnaccountableMod 6,8464 gold badges20 silver badges23 bronze badges asked Nov 14, 2019 at 21:56 Ali BENALIAli BENALI 7452 gold badges10 silver badges30 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 7 +50

Although it's technically possible to simulate all the calls from Python, at a dynamic page like LinkedIn, I think it will be quite tedious and brittle.

Anyway, you'd open "developer tools" in your browser before you open LinkedIn and see how the traffic looks like. You can filter for the requests from Javascript (in Firefox, the filter is called XHR).

You would then simulate the necessary/interesting requests in your code. The benefit is the servers usually return structured data to Javascript, such as JSON. Therefore you won't need to do as much HTML parsing.

If you find not progressing very much this way (it really depends on the particular site), then you will probably have to use Selenium or some alternative such as:

  • https://robotframework/
  • https://miyakogi.github.io/pyppeteer/ (port of Puppeteer to Python)

You should send all the XHR and JS requests manually [in the same session which you created during login]. Also, pass all the fields in request headers (copy from the network tools).

self.header_static = {
        'authority': 'static-exp2.licdn.',
        'method': 'GET',
        'path': '/sc/h/c356usw7zystbud7v7l42pz0s',
        'scheme': 'https',
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
        'cache-control': 'no-cache',
        'dnt': '1',
        'pragma': 'no-cache',
        'referer': 'https://www.linkedin./jobs/',
        'sec-fetch-mode': 'no-cors',
        'sec-fetch-site': 'cross-site',
        'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36'
        }

def postConnectionRequests(self):
        urls = [
        "https://static-exp2.licdn./sc/h/62mb7ab7wm02esbh500ajmfuz",
        "https://static-exp2.licdn./sc/h/mpxhij2j03tw91bpplja3u9b",
        "https://static-exp2.licdn./sc/h/3nq91cp2wacq39jch2hz5p64y",
        "https://static-exp2.licdn./sc/h/emyc3b18e3q2ntnbncaha2qtp",
        "https://static-exp2.licdn./sc/h/9b0v30pbbvyf3rt7sbtiasuto",
        "https://static-exp2.licdn./sc/h/4ntg5zu4sqpdyaz1he02c441c",
        "https://static-exp2.licdn./sc/h/94cc69wyd1gxdiytujk4d5zm6",
        "https://static-exp2.licdn./sc/h/ck48xrmh3ctwna0w2y1hos0ln",
        "https://static-exp2.licdn./sc/h/c356usw7zystbud7v7l42pz0s",
        ]

        for url in urls:
            self.sess.get(url,headers=self.header_static)
            print("REQUEST SENT TO "+url)

I called the postConnectionRequests() function after before saving the HTML content, and received the plete page. Hope this helps.

XHR is send by JavaScript and Python will not run JavaScript code when it will get page using requests and beautifulsoup. Tools like Selenium loads page and runs JavaScript. You can also use Headless Browsers.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744399161a4572299.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信