Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract ALL text content from the PDF. #98

Open
lisenkaci opened this issue Jan 12, 2024 · 1 comment
Open

Extract ALL text content from the PDF. #98

lisenkaci opened this issue Jan 12, 2024 · 1 comment

Comments

@lisenkaci
Copy link

I need to extract all the text content from a PDF as soon as it's loaded. I can't find the text value in the onDocumentLoad props and using renderPage renderPageProps.textLayerRendered only gives the text content for the currently scrolling page. I need ALL the text found in the PDF as soon as it is available. Thank you.

@lisenkaci
Copy link
Author

Going to answer my own question. Could not figure out how to do it using the library so I implemented the following function using pdfjsLib

function extractText(pdfUrl) {
  var pdf = pdfjsLib.getDocument(pdfUrl);
  return pdf.promise.then(function (pdf) {
    var totalPageCount = pdf.numPages;
    var countPromises = [];
    for (var currentPage = 1; currentPage <= totalPageCount; currentPage++) {
      var page = pdf.getPage(currentPage);
      countPromises.push(
        page.then(function (page) {
          var textContent = page.getTextContent();
          return textContent.then(function (text) {
            return text.items
              .map(function (s) {
                return s.str;
              })
              .join("");
          });
        })
      );
    }

    return Promise.all(countPromises).then(function (texts) {
      return texts.join("");
    });
  });
}

and called it in the onDocumentLoad. This worked and I am getting all the text content from the PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant