I am currently trying to create a util to parse annotations from a PDF. I can load the PDF file just fine, the annotation objects just fine, but I need to obtain the text that is related to those annotations (underlined, highlighted, etc.).
This gets hairy when I try to use the getTextContent()
method which fails. Below is the method where this happens:
/**
* @param pdf The PDF document obtained upon `pdfjs.getDocument(pdf).promise` success.
*/
function getAllPages(pdf) {
return new Promise((resolve, reject) => {
let allPromises = [];
for (let i = 0; i < numPages; i++) {
const pageNumber = i + 1; // note: pages are 1-based
const page = pdf.getPage(pageNumber)
.then((pageContent) => {
// testing with just one page to see what's up
if (pageNumber === 1) {
try {
pageContent.getTextContent()
.then((txt) => {
// THIS NEVER OCCURS
console.log('got text');
})
.catch((error) => {
// THIS IS WHERE THE ERROR SHOULD BE CAUGHT
console.error('in-promise error', error)
});
} catch (error) {
// AT LEAST IT SHOULD BE CAUGHT HERE
console.log('try/catch error:', error);
}
}
})
.catch(reject);
allPromises.push(page);
}
Promise.all(allPromises)
.then(() => {
allPagesData.sort(sortByPageNumber);
resolve(allPagesData);
})
.catch(reject);
});
}
When calling pageContent.getTextContent()
, which should return a promise, the error "ReferenceError: ReadableStream is not defined
" is thrown in the catch()
part of the try
.
This is weird because I would have expected the pageContent.getTextContent().catch()
to be able to, well, catch that. Also, I don't know what to do to resolve this.
Any help is appreciated.
I am currently trying to create a util to parse annotations from a PDF. I can load the PDF file just fine, the annotation objects just fine, but I need to obtain the text that is related to those annotations (underlined, highlighted, etc.).
This gets hairy when I try to use the getTextContent()
method which fails. Below is the method where this happens:
/**
* @param pdf The PDF document obtained upon `pdfjs.getDocument(pdf).promise` success.
*/
function getAllPages(pdf) {
return new Promise((resolve, reject) => {
let allPromises = [];
for (let i = 0; i < numPages; i++) {
const pageNumber = i + 1; // note: pages are 1-based
const page = pdf.getPage(pageNumber)
.then((pageContent) => {
// testing with just one page to see what's up
if (pageNumber === 1) {
try {
pageContent.getTextContent()
.then((txt) => {
// THIS NEVER OCCURS
console.log('got text');
})
.catch((error) => {
// THIS IS WHERE THE ERROR SHOULD BE CAUGHT
console.error('in-promise error', error)
});
} catch (error) {
// AT LEAST IT SHOULD BE CAUGHT HERE
console.log('try/catch error:', error);
}
}
})
.catch(reject);
allPromises.push(page);
}
Promise.all(allPromises)
.then(() => {
allPagesData.sort(sortByPageNumber);
resolve(allPagesData);
})
.catch(reject);
});
}
When calling pageContent.getTextContent()
, which should return a promise, the error "ReferenceError: ReadableStream is not defined
" is thrown in the catch()
part of the try
.
This is weird because I would have expected the pageContent.getTextContent().catch()
to be able to, well, catch that. Also, I don't know what to do to resolve this.
Any help is appreciated.
Share Improve this question edited Jun 21, 2020 at 18:42 jansensan asked Jun 21, 2020 at 18:21 jansensanjansensan 6271 gold badge8 silver badges25 bronze badges 6-
1
You are using Mozilla's pdfjs right? If yes then are you using
pdfjs-dist/es5/build/pdf.js
file? – Shihab Commented Jun 21, 2020 at 19:21 -
I am currently using
const pdfjs = require('pdfjs-dist');
, should I be requiring the one you mention instead? – jansensan Commented Jun 21, 2020 at 19:27 - Yes, give that one a try. – Shihab Commented Jun 21, 2020 at 19:29
- What version of Node.js are you using? – Quentin Commented Jun 21, 2020 at 20:09
-
I am currently using Node v.12.13.1 (not married to it, just what I had running currently).
const pdfjs = require('pdfjs-dist/es5/build/pdf.js');
This did the trick indeed, thanks @Shihab, no error thrown, text content now available! – jansensan Commented Jun 21, 2020 at 21:46
3 Answers
Reset to default 6I have noticed that using pdfjs-dist
causes the error.
Use pdfjs-dist/es5/build/pdf.js
instead.
const pdfjs = require('pdfjs-dist/es5/build/pdf.js');
Update:
const pdfJs = require('pdfjs-dist/legacy/build/pdf')
Example usage
There was a new change, the only way it worked here was to use this path:
const pdfJs = require('pdfjs-dist/legacy/build/pdf')
I started a new project with pdfjs-dist and got the same ReadableStream error at getTextContent. Also i have an older project with the same lib that works fine. So, when I downgraded to an older version (2.0.943 to be precise) the error was gone. I don't realy know why. Hope that helps.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744128319a4559718.html
评论列表(0条)