Extroactor pdf 2 image #11909

ic-xu · 2024-12-20T10:40:12Z

Summary

Many friends chat with the large model by uploading PDFs. The current approach is to extract the text content from the PDF and input it into the model. We found that this method might lose some important information in the PDF, such as layout, tables, and even the relationships between elements. I hope to input as much information as possible into the model, so I created a feature to convert PDFs to images for input.

Tip

Close issue syntax: Fixes #<issue number> or Resolves #<issue number>, see documentation for more details.

Screenshots

Before	After
...

Checklist

Important

Please review the checklist below before submitting your pull request.

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

feat: add extractor image support in DocumentExtractorNode

feat: output format

feat: reformat

api/libs/helper.py

feat: reformat

feat: remove bugs

fix:: Missing return statement [return]

laipz8200 · 2024-12-27T17:25:11Z

Adding a special case for PDF - Image in the Document Extractor might confuse users. I think this feature should be implemented as a Tool.

Additionally, models that excel at image recognition, such as Gemini and Claude, also support direct PDF input.

ic-xu added 3 commits December 20, 2024 18:26

feat: add extractor image support in DocumentExtractorNode

3ca1292

feat: add extractor image support in DocumentExtractorNode

feat: add extractor image support in DocumentExtractorNode

9a8f1af

feat: add extractor image support in DocumentExtractorNode

feat: output format

f9cc275

feat: output format

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🌊 feat:workflow Workflow related stuff. 💪 enhancement New feature or request labels Dec 20, 2024

ic-xu added 3 commits December 20, 2024 19:12

feat: output format

e4e0c59

feat: output format

feat: reformat

c0a8a11

feat: reformat

feat: reformat

4846f94

feat: reformat

hjlarry reviewed Dec 20, 2024

View reviewed changes

api/libs/helper.py Outdated Show resolved Hide resolved

ic-xu and others added 9 commits December 20, 2024 19:24

Merge remote-tracking branch 'main/main' into extroactor_pdf_2_image

5e4d599

feat: reformat

43bda33

feat: reformat

feat: reformat

5b5bab3

feat: reformat

feat: reformat

d0765e8

feat: reformat

feat: reformat

9da9f24

feat: reformat

feat: remove bugs

ff8aa40

feat: remove bugs

Merge branch 'main' into extroactor_pdf_2_image

51b3afb

fix: styles

2d05ddb

fix: lint

6b863eb

crazywoola previously approved these changes Dec 27, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 27, 2024

fix：: Missing return statement [return]

714646f

fix:: Missing return statement [return]

ic-xu dismissed crazywoola’s stale review via 714646f December 27, 2024 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extroactor pdf 2 image #11909

Extroactor pdf 2 image #11909

ic-xu commented Dec 20, 2024 •

edited by crazywoola

Loading

laipz8200 commented Dec 27, 2024

Extroactor pdf 2 image #11909

Are you sure you want to change the base?

Extroactor pdf 2 image #11909

Conversation

ic-xu commented Dec 20, 2024 • edited by crazywoola Loading

Summary

Screenshots

Checklist

laipz8200 commented Dec 27, 2024

ic-xu commented Dec 20, 2024 •

edited by crazywoola

Loading