Skip to content

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Jan 7, 2025

  • Adds BlobParsers for images. These implementations can take an image and produce one or more documents per image. This interface can be used for exposing OCR capabilities.
  • Update PyMuPDFParser and Loader to standardize metadata, handle images, improve table extraction etc.
  • Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

Copy link

vercel bot commented Jan 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 20, 2025 4:17pm

@pprados
Copy link
Contributor Author

pprados commented Jan 7, 2025

@eyurtsev I rebase the code with master ;-)

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great will take a look in the AM

@pprados pprados mentioned this pull request Jan 8, 2025
2 tasks
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

  1. Define the standardized structure of metadata
  2. Create a dedicated ImageParser which is a blob parser

@pprados
Copy link
Contributor Author

pprados commented Jan 17, 2025

yum is deprecated and replaced by dnf.
But, il doc/Makefile, yum is used.
I can not install yum on Ubuntu.
It's difficult for me to fix a bug in the documentation

@dosubot dosubot bot added the lgtm label Jan 20, 2025
@eyurtsev eyurtsev changed the title Refactoring PDF loaders: 02 PyMuPDF community[minor]: Refactoring PDF loaders: 02 PyMuPDF Jan 20, 2025
@eyurtsev eyurtsev changed the title community[minor]: Refactoring PDF loaders: 02 PyMuPDF community[minor]: Refactoring PyMuPDF parser, loader and add image blob parsers Jan 20, 2025
@eyurtsev eyurtsev merged commit 4efc509 into langchain-ai:master Jan 20, 2025
21 checks passed
v = str(v)
if k.startswith("/"):
k = k[1:]
k = k.lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this break user workflows? #29470

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants