javascript - How to output proper json from a pupeteer scraped table? - Stack Overflow

I'm new to pupeteer and don't know it's full potential.I have the following code that

I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.

(async () => {
const browser = await puppeteer.launch( {headless: true} );
    const page = await browser.newPage();
    await page.goto(url, {waitUntil: 'networkidle0'});

    let data = await page.evaluate(() => {
        const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr ')); 
        return table.map(td => td.innerText);
    })

    console.log(data);
})();

Here is the html table:

<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
        <tr >
            <th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col">&nbsp;</th>
        </tr>
        <tr >
            <td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td>&nbsp;</td><td>&nbsp;</td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
        </tr>
        <tr style="background-color:LightGreen;">
            <td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;">&nbsp;</td><td align="center"></td>
        </tr>
</table>

This is what I get:

[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ', '31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t', '1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']

and this is what I want to get:

[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'}, {'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'}, {'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]

I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?

const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');

const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log( 
    query('tr', context).map(row => 
        query('td, th', row).map(cell => 
        cell.textContent))  
);

What does this error mean? (node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1) (node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.

(async () => {
const browser = await puppeteer.launch( {headless: true} );
    const page = await browser.newPage();
    await page.goto(url, {waitUntil: 'networkidle0'});

    let data = await page.evaluate(() => {
        const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr ')); 
        return table.map(td => td.innerText);
    })

    console.log(data);
})();

Here is the html table:

<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
        <tr >
            <th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col">&nbsp;</th>
        </tr>
        <tr >
            <td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td>&nbsp;</td><td>&nbsp;</td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
        </tr>
        <tr style="background-color:LightGreen;">
            <td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;">&nbsp;</td><td align="center"></td>
        </tr>
</table>

This is what I get:

[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ', '31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t', '1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']

and this is what I want to get:

[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'}, {'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'}, {'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]

I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?

const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');

const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log( 
    query('tr', context).map(row => 
        query('td, th', row).map(cell => 
        cell.textContent))  
);

What does this error mean? (node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1) (node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Share Improve this question edited Mar 27, 2019 at 17:15 vt2424253 asked Mar 27, 2019 at 0:38 vt2424253vt2424253 1,4274 gold badges27 silver badges42 bronze badges 1
  • 2 The wanted json in your question is invalid. – Niloct Commented Mar 27, 2019 at 0:45
Add a ment  | 

2 Answers 2

Reset to default 4

If you need an array of arrays from the table, you can try this approach, with mapping all rows to an array of rows and all cells to an array of cells inside a row element (this variant uses Array.from() with mapping function as a second argument):

const data = await page.evaluate(
  () => Array.from(
    document.querySelectorAll('table[id="gvM"] > tbody > tr'),
    row => Array.from(row.querySelectorAll('th, td'), cell => cell.innerText)
  )
);

I don't think this is related to Puppeteer but to the way you "iterate" over your <table>:

In your attempt, you're simply dumping the textual content of an entire row which produces the result that you're observing. Actually for each <tr> you need to get all its <td> (or <th>) elements:

const query = (selector, context) =>
  Array.from(context.querySelectorAll(selector));
  
console.log(

  query('tr', document).map(row =>
    query('td, th', row).map(cell =>
      cell.textContent))

)
<table>
  <tr>
    <th>col 1</th>
    <th>col 2</th>
    <th>col 3</th>
  </tr>
  <tr>
    <td>a</td>
    <td>b</td>
    <td>c</td>
  </tr>
  <tr>
    <td>x</td>
    <td>y</td>
    <td>z</td>
  </tr>
</table>

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745181367a4615433.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信