I am working on a web scraper that searches Google for certain things and then pulls text from the result page, and I am having an issue getting Puppeteer to return the text I need. What I want to return is an array of strings.
Let's say I have a couple nested divs within a div, and each has text like so:
<div class='mainDiv'>
<div>Mary Doe </div>
<div> James Dean </div>
</div>
In the DOM, I can do the following to get the result I need:
document.querySelectorAll('.mainDiv')[0].innerText.split('\n')
This yields: ["Mary Doe", "James Dean"]
.
I understand that Puppeteer doesn't return NodeLists, and instead it uses JSHandles, but I still can't figure out how to get any information using the prescribed methods. See below for what I have tried in Puppeteer and the corresponding console output:
In every scenario, I do await page.waitFor('selector')
to start.
Scenario 1 (using .$$eval()
):
const genreElements = await page.$$eval('div.mainDiv', el => el);
console.log(genreElements) // []
Scenario 2 (using evaluate
):
function extractItems() {
const extractedElements = document.querySelectorAll('div.mainDiv')[0].innerText.split('\n')
return extractedElements
}
let items = await page.evaluate(extractItems)
console.log(items) // UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerText' of undefined
Scenario 3 (using evaluateHandle
):
const selectorHandle = await page.evaluateHandle(() => document.querySelectorAll('div.mainDiv'))
const resultHandle = await page.evaluate(x => x[0], selectorHandle)
console.log(resultHandle) // undefined
Any help or guidance on how I am implementing or how to achieve what I am looking to do is much appreciated. Thank you!
I am working on a web scraper that searches Google for certain things and then pulls text from the result page, and I am having an issue getting Puppeteer to return the text I need. What I want to return is an array of strings.
Let's say I have a couple nested divs within a div, and each has text like so:
<div class='mainDiv'>
<div>Mary Doe </div>
<div> James Dean </div>
</div>
In the DOM, I can do the following to get the result I need:
document.querySelectorAll('.mainDiv')[0].innerText.split('\n')
This yields: ["Mary Doe", "James Dean"]
.
I understand that Puppeteer doesn't return NodeLists, and instead it uses JSHandles, but I still can't figure out how to get any information using the prescribed methods. See below for what I have tried in Puppeteer and the corresponding console output:
In every scenario, I do await page.waitFor('selector')
to start.
Scenario 1 (using .$$eval()
):
const genreElements = await page.$$eval('div.mainDiv', el => el);
console.log(genreElements) // []
Scenario 2 (using evaluate
):
function extractItems() {
const extractedElements = document.querySelectorAll('div.mainDiv')[0].innerText.split('\n')
return extractedElements
}
let items = await page.evaluate(extractItems)
console.log(items) // UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerText' of undefined
Scenario 3 (using evaluateHandle
):
const selectorHandle = await page.evaluateHandle(() => document.querySelectorAll('div.mainDiv'))
const resultHandle = await page.evaluate(x => x[0], selectorHandle)
console.log(resultHandle) // undefined
Any help or guidance on how I am implementing or how to achieve what I am looking to do is much appreciated. Thank you!
Share edited Jun 8, 2021 at 12:07 DisappointedByUnaccountableMod 6,8464 gold badges20 silver badges23 bronze badges asked Dec 5, 2018 at 21:16 Nigel FinleyNigel Finley 1252 gold badges4 silver badges10 bronze badges 1-
Instead of
querySelectorAll()[0]
(get all, then throw everything away but the first) why notquerySelector()
(get the first)? – ggorlen Commented Mar 14, 2023 at 20:34
3 Answers
Reset to default 4Use page.$$eval() or page.evaluate():
You can use page.$$eval()
or page.evaluate()
to run Array.from(
document.querySelectorAll()
)
within the page context and map()
the innerText
of each element to the result array:
const names_1 = await page.$$eval('.mainDiv > div', divs => divs.map(div => div.innerText));
const names_2 = await page.evaluate(() => Array.from(document.querySelectorAll('.mainDiv > div'), div => div.innerText));
Note: Keep in mind that if you use Puppeteer to automate searches on Google, you may be temporarily blocked and end up with an "Unusual traffic from your puter network" notice, requiring you to solve a reCAPTCHA. This may break your web scraper, so proceed with caution.
Try it like this:
let names = page.evaluate(() => [...document.querySelectorAll('.mainDiv div')].map(div => div.innerText))
That way you can test the whole thing in the chrome console.
Using page.$eval:
const names = await page.$eval('.mainDiv', (element) => {
return element.innerText
});
Here the element is retrieved by selector and directly passed to the function to be evaluated.
Using page.evaluate:
const namesElem = await page.$('.mainDiv');
const names = await page.evaluate(namesElem => namesElem.innerText, namesElem);
This is basically the first method split up into two steps. The interesting part is that ElementHandles can be passed as arguments in page.evaluate() and can be evaluated like JSHandles.
Note that for simplicity and clarification I used the methods for retrieving single elements. But page.$$() and page.$$eval() work the same way while selecting multiple elements and returning an array instead.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744917752a4600940.html
评论列表(0条)