javascript - Unable to parse page text, getting "ReferenceError: ReadableStream is not defined" - Stack Overfl

I am currently trying to create a util to parse annotations from a PDF. I can load the PDF file just fi

I am currently trying to create a util to parse annotations from a PDF. I can load the PDF file just fine, the annotation objects just fine, but I need to obtain the text that is related to those annotations (underlined, highlighted, etc.).

This gets hairy when I try to use the getTextContent() method which fails. Below is the method where this happens:

/**
 * @param pdf The PDF document obtained upon `pdfjs.getDocument(pdf).promise` success.
 */
function getAllPages(pdf) {
  return new Promise((resolve, reject) => {
    let allPromises = [];
    for (let i = 0; i < numPages; i++) {
      const pageNumber = i + 1; // note: pages are 1-based
      const page = pdf.getPage(pageNumber)
        .then((pageContent) => {

          // testing with just one page to see what's up
          if (pageNumber === 1) {
            try {
              pageContent.getTextContent()
                .then((txt) => {
                  // THIS NEVER OCCURS
                  console.log('got text');
                })
                .catch((error) => {
                  // THIS IS WHERE THE ERROR SHOULD BE CAUGHT
                  console.error('in-promise error', error)
                });
            } catch (error) {
              // AT LEAST IT SHOULD BE CAUGHT HERE
              console.log('try/catch error:', error);
            }
          }
        })
        .catch(reject);

      allPromises.push(page);
    }
    Promise.all(allPromises)
      .then(() => {
        allPagesData.sort(sortByPageNumber);
        resolve(allPagesData);
      })
      .catch(reject);
  });
}

When calling pageContent.getTextContent(), which should return a promise, the error "ReferenceError: ReadableStream is not defined" is thrown in the catch() part of the try.

This is weird because I would have expected the pageContent.getTextContent().catch() to be able to, well, catch that. Also, I don't know what to do to resolve this.

Any help is appreciated.

I am currently trying to create a util to parse annotations from a PDF. I can load the PDF file just fine, the annotation objects just fine, but I need to obtain the text that is related to those annotations (underlined, highlighted, etc.).

This gets hairy when I try to use the getTextContent() method which fails. Below is the method where this happens:

/**
 * @param pdf The PDF document obtained upon `pdfjs.getDocument(pdf).promise` success.
 */
function getAllPages(pdf) {
  return new Promise((resolve, reject) => {
    let allPromises = [];
    for (let i = 0; i < numPages; i++) {
      const pageNumber = i + 1; // note: pages are 1-based
      const page = pdf.getPage(pageNumber)
        .then((pageContent) => {

          // testing with just one page to see what's up
          if (pageNumber === 1) {
            try {
              pageContent.getTextContent()
                .then((txt) => {
                  // THIS NEVER OCCURS
                  console.log('got text');
                })
                .catch((error) => {
                  // THIS IS WHERE THE ERROR SHOULD BE CAUGHT
                  console.error('in-promise error', error)
                });
            } catch (error) {
              // AT LEAST IT SHOULD BE CAUGHT HERE
              console.log('try/catch error:', error);
            }
          }
        })
        .catch(reject);

      allPromises.push(page);
    }
    Promise.all(allPromises)
      .then(() => {
        allPagesData.sort(sortByPageNumber);
        resolve(allPagesData);
      })
      .catch(reject);
  });
}

When calling pageContent.getTextContent(), which should return a promise, the error "ReferenceError: ReadableStream is not defined" is thrown in the catch() part of the try.

This is weird because I would have expected the pageContent.getTextContent().catch() to be able to, well, catch that. Also, I don't know what to do to resolve this.

Any help is appreciated.

Share Improve this question edited Jun 21, 2020 at 18:42 jansensan asked Jun 21, 2020 at 18:21 jansensanjansensan 6271 gold badge8 silver badges25 bronze badges 6
  • 1 You are using Mozilla's pdfjs right? If yes then are you using pdfjs-dist/es5/build/pdf.js file? – Shihab Commented Jun 21, 2020 at 19:21
  • I am currently using const pdfjs = require('pdfjs-dist');, should I be requiring the one you mention instead? – jansensan Commented Jun 21, 2020 at 19:27
  • Yes, give that one a try. – Shihab Commented Jun 21, 2020 at 19:29
  • What version of Node.js are you using? – Quentin Commented Jun 21, 2020 at 20:09
  • I am currently using Node v.12.13.1 (not married to it, just what I had running currently). const pdfjs = require('pdfjs-dist/es5/build/pdf.js'); This did the trick indeed, thanks @Shihab, no error thrown, text content now available! – jansensan Commented Jun 21, 2020 at 21:46
 |  Show 1 more ment

3 Answers 3

Reset to default 6

I have noticed that using pdfjs-dist causes the error.

Use pdfjs-dist/es5/build/pdf.js instead.

const pdfjs = require('pdfjs-dist/es5/build/pdf.js');

Update:

const pdfJs = require('pdfjs-dist/legacy/build/pdf')

Example usage

There was a new change, the only way it worked here was to use this path:

const pdfJs = require('pdfjs-dist/legacy/build/pdf')

I started a new project with pdfjs-dist and got the same ReadableStream error at getTextContent. Also i have an older project with the same lib that works fine. So, when I downgraded to an older version (2.0.943 to be precise) the error was gone. I don't realy know why. Hope that helps.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744128319a4559718.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信