python - How to find all the JavaScript requests made from my browser when I'm accessing a site - Stack Overflow

admin•2025-04-19 03:32:00•questions•阅读1

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problemwith th

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium

here is my code

import requests
from bs4 import BeautifulSoup

class Linkedin():
    def __init__(self, url ):
        self.url = url
        self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
                                 "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}

    def saveRsulteToHtmlFile(self, nameOfFile=None):
        if nameOfFile == None:
            nameOfFile ="Linkedin_page"
        with open(nameOfFile+".html", "wb") as file:
            file.write(self.response.content)

    def getSingInPage(self):
        self.sess = requests.Session()
        self.response = self.sess.get(self.url, headers=self.header)
        soup = BeautifulSoup(self.response.content, "html.parser")
        self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]

    def connecteToMyLinkdin(self):
        self.form_data = {"session_key": "[email protected]",
                     "loginCsrfParam": self.csrf,
                     "session_password": "mypassword"}
        self.url = ";
        self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)


    def getAnyPage(self,url):
        self.response = self.sess.get(url, headers=self.header)




url = "/"

likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading 
likedin_page.getAnyPage("/")
likedin_page.saveRsulteToHtmlFile()

I want help to pass the javascript loads without using Selenium...

here is my code

import requests
from bs4 import BeautifulSoup

class Linkedin():
    def __init__(self, url ):
        self.url = url
        self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
                                 "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}

    def saveRsulteToHtmlFile(self, nameOfFile=None):
        if nameOfFile == None:
            nameOfFile ="Linkedin_page"
        with open(nameOfFile+".html", "wb") as file:
            file.write(self.response.content)

    def getSingInPage(self):
        self.sess = requests.Session()
        self.response = self.sess.get(self.url, headers=self.header)
        soup = BeautifulSoup(self.response.content, "html.parser")
        self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]

    def connecteToMyLinkdin(self):
        self.form_data = {"session_key": "[email protected]",
                     "loginCsrfParam": self.csrf,
                     "session_password": "mypassword"}
        self.url = "https://www.linkedin./uas/login-submit"
        self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)


    def getAnyPage(self,url):
        self.response = self.sess.get(url, headers=self.header)




url = "https://www.linkedin./"

likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading 
likedin_page.getAnyPage("https://www.linkedin./jobs/")
likedin_page.saveRsulteToHtmlFile()

I want help to pass the javascript loads without using Selenium...

Share Improve this question edited May 14, 2021 at 12:03 DisappointedByUnaccountableMod 6,8464 gold badges20 silver badges23 bronze badges asked Nov 14, 2019 at 21:56 Ali BENALI 7452 gold badges10 silver badges30 bronze badges

Add a ment |

3 Answers 3

Sorted by: Reset to default 7 +50

Although it's technically possible to simulate all the calls from Python, at a dynamic page like LinkedIn, I think it will be quite tedious and brittle.

Anyway, you'd open "developer tools" in your browser before you open LinkedIn and see how the traffic looks like. You can filter for the requests from Javascript (in Firefox, the filter is called XHR).

You would then simulate the necessary/interesting requests in your code. The benefit is the servers usually return structured data to Javascript, such as JSON. Therefore you won't need to do as much HTML parsing.

If you find not progressing very much this way (it really depends on the particular site), then you will probably have to use Selenium or some alternative such as:

https://robotframework/
https://miyakogi.github.io/pyppeteer/ (port of Puppeteer to Python)

You should send all the XHR and JS requests manually [in the same session which you created during login]. Also, pass all the fields in request headers (copy from the network tools).

self.header_static = {
        'authority': 'static-exp2.licdn.',
        'method': 'GET',
        'path': '/sc/h/c356usw7zystbud7v7l42pz0s',
        'scheme': 'https',
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
        'cache-control': 'no-cache',
        'dnt': '1',
        'pragma': 'no-cache',
        'referer': 'https://www.linkedin./jobs/',
        'sec-fetch-mode': 'no-cors',
        'sec-fetch-site': 'cross-site',
        'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36'
        }

def postConnectionRequests(self):
        urls = [
        "https://static-exp2.licdn./sc/h/62mb7ab7wm02esbh500ajmfuz",
        "https://static-exp2.licdn./sc/h/mpxhij2j03tw91bpplja3u9b",
        "https://static-exp2.licdn./sc/h/3nq91cp2wacq39jch2hz5p64y",
        "https://static-exp2.licdn./sc/h/emyc3b18e3q2ntnbncaha2qtp",
        "https://static-exp2.licdn./sc/h/9b0v30pbbvyf3rt7sbtiasuto",
        "https://static-exp2.licdn./sc/h/4ntg5zu4sqpdyaz1he02c441c",
        "https://static-exp2.licdn./sc/h/94cc69wyd1gxdiytujk4d5zm6",
        "https://static-exp2.licdn./sc/h/ck48xrmh3ctwna0w2y1hos0ln",
        "https://static-exp2.licdn./sc/h/c356usw7zystbud7v7l42pz0s",
        ]

        for url in urls:
            self.sess.get(url,headers=self.header_static)
            print("REQUEST SENT TO "+url)

I called the postConnectionRequests() function after before saving the HTML content, and received the plete page. Hope this helps.

XHR is send by JavaScript and Python will not run JavaScript code when it will get page using requests and beautifulsoup. Tools like Selenium loads page and runs JavaScript. You can also use Headless Browsers.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744399161a4572299.html

admin

questions
javascript - How to fit PDF file horizontally in an iframe? - Stack Overflow
I've been struggling to make pdf fit horizontally dynamically.pdf#toolbar=0&navpanes=0&sc
admin
28分钟前
10
questions
ios - Can I export a HomeKit devices data at once? - Stack Overflow
I'm working with various devices that have HomeKit Support.I can see all the characteristics per
admin
27分钟前
20
questions
python - I cannot display a .html page on an app from PyQt6 - Stack Overflow
Operating System is Windows 10. Whenever I try to display any form of .html file, whether it is local o
admin
26分钟前
10
questions
javascript - open <div> before element and close it after - Stack Overflow
I'm creating a JQuery Plugin.I need to add automatically add a <div id="id">befor
admin
25分钟前
10
questions
r - Reactable expand sub-group if the number of sub-groups is 1 - Stack Overflow
In reactable, is there a way to automatically expand a sub-group when expanding a parent group if there
admin
25分钟前
10
questions
javascript - Are Chrome user-scripts separated from the global namespace like Greasemonkey scripts? - Stack Overflow
I know Greasemonkey scripts are automatically wrapped in anonymous functions isolated in some way in or
admin
24分钟前
10
questions
coding standards - spaces inside parenthesis
The Wordpress coding standards states to put spaces both inside and outside opening and closing parenthesis.What is the
admin
22分钟前
10
questions
javascript - What is the simplest check possible for an HTMLJS injection attack? - Stack Overflow
My Javascript code aims to take some untrusted string variable and render it in the DOM. It would be in
admin
22分钟前
10
questions
network programming - Heavy tcp_send_ack during recvfrom syscall - Stack Overflow
I'm digging one issue(not sure if it's an issue and not sure how tofix this) with linux tcp
admin
20分钟前
10
questions
javascript - Update block once an API request returns with a value
I've got a block that allows the user to enter a script when in edit mode, then sends this script to a REST endpoin
admin
20分钟前
10
questions
rest api - Connecting Wordpress with an External API
I need insight into connecting a Wordpress with an external API.This is the expected action needed to be carried out.A u
admin
13分钟前
10
questions
javascript - How to stop link button refreshing the page - Stack Overflow
In my code I'm using a link button called updateLogButton which showshides a div. Because I use a
admin
12分钟前
10
questions
php - Cron job -many duplicate posts
i have news aggregator website and have many sources need sometime to update in one minute maybe 7 new posts i make more
admin
10分钟前
10
questions
javascript - How to limit the number of words per line in a html paragraph? - Stack Overflow
I have a paragraph containing around a 100 words. I want to limit the maximum number of words to 10 per
admin
8分钟前
10
questions
javascript - Cannot see WebdriverIO logs in the console (webdriver logs) - Stack Overflow
I am using WebdriverIO version 5 and would like to see the logs of my test run.I tried the mand: npm r
admin
7分钟前
10
questions
javascript - How to do oauth 2.0 in electron shell - Stack Overflow
I am trying to build an application which will consume an api. it has a oauth 2.0 authentication. my gu
admin
6分钟前
10
questions
javascript - Error: listen EADDRINUSE when running a Node.js script - Stack Overflow
I am getting this Error: listen EADDRINUSE when I run the following simpleJSON.js script:var http = req
admin
6分钟前
10
questions
buddypress - Code correction - Hide a profile from visitors with a specific role
The following code i tried to hide the profile of subscriber's role of user not logged in but I need it to be visib
admin
5分钟前
10
questions
javascript - Google Maps API markers with multiple icons - Stack Overflow
I'd like to implement a plex marker using the Google Maps JavaScript API that bines both a static
admin
2分钟前
10
questions
performance - WordPress Meta description - is it better to use manual configuration?
I was wondering: aside from the fact that plugins are easier to use, wouldn't it be better to configure meta descri
admin
3秒前
00

发表回复

评论列表（0条）

暂无评论

python - How to find all the JavaScript requests made from my browser when I'm accessing a site - Stack Overflow

3 Answers 3

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to find all the JavaScript requests made from my browser when I&#39;m accessing a site - Stack Overflow

3 Answers 3

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to find all the JavaScript requests made from my browser when I'm accessing a site - Stack Overflow