-
Notifications
You must be signed in to change notification settings - Fork 7
unstructured[minor]: 08 - Refactoring 17 unstructured loaders #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks @pprados! Unsure when I or someone else will get to this, but wanted to let you know we're aware. |
@Coniferish I'm also having trouble getting 2 other PRs validated, which are currently blocked. I don't understand why. Can you take a look? |
Hey @pprados, I no longer work at unstructured, so I'm not sure I can help out. |
Hey @Coniferish |
I'm unsure, sorry |
Hey @efriis I'm also having trouble getting 2 other PRs validated, which are currently blocked. I don't understand why. Can you take a look? langchain-ai/langchain#29709 Issue langchain-ai/langchain#30454 |
@ccurme, can you help out this contributor? I'm no longer at unstructured and am unsure if I'm able to continue working on this. |
@baskaryan can you review this PR or assign it to someone? |
@badGarnet, What do you think? |
@ccurme |
@ccurme can you approve the workflow? |
In this PR, we propose a migration of the various
Unstructured*Loader
implementations to thelangchain-unstructured
package.Improvements
We’ve made several key improvements:
langchain-community
)langchain-community
(seetest_migration.py
)Loader
is split into aLoader
/Parser
to allow usage withGenericLoader
#
prefixes) and tables in either Markdown or HTML format. It’s possible to revert to the original behavior by changing a few parameters.keep_header_footer=False
)Path
objects or stringsweb_url
IO
objectauto
,fast
,hi_res
, andocr_only
)lazy_load()
UnstructuredLoader
additionally supports a list of PATHs infile_path
. While we don’t consider this very clean (why only this loader? Why no plural? The user could just loop), we replicate the behavior fromlangchain-community
.langchain-unstructured
dependencies offer the same extras asunstructured
(csv, pdf, docx, etc.). This allows specifying a dependency onlangchain-unstructured
limited to certain file types (langchain-unstructured[pdf]
). The previous behavior pulled in all possible formats, resulting in a package too large for environments like AWS Lambda.With this PR, it will be possible to mark 17 Loader as "deprecated". There will remain 5 dependencies on
unstructured
inlangchain-community
.Once this version is released, we plan to propose a PR to
langchain-community
to mark allUnstructured*Loader
as@deprecated
. Any changes to default parameter values will be explained in the comments.Other dependencies on Unstructured in
langchain-community
There are not part of unstructured
Unstructured
, even thoughUnstructuredCHMLoader
exists. Thelangchain-community
version doesn’t work with the files we tested. We are leaving this loader as-is.UnstructuredLakeFSLoader
SeleniumURLLoader
S3FileLoader
. UseGenericLoader
+CloudBlobLoader
UnstructuredHtmlEvaluator
pyproject.toml
unstructured
is a framework that can pull in a large number of dependencies, depending on the file formats it needs to process. The framework offers various extras to include only the strictly necessary dependencies, for example:unstructured[pdf,csv]
.langchain-unstructured
does not currently work this way. It pulls in all dependencies fromunstructured
, resulting in very large projects that are incompatible with environments that have size limitations, such as AWS Lambda.The change to
pyproject.toml
replicates the different extras provided byunstructured
and propagates them intolangchain-unstructured
.PDF
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the
UnstructuredPDFParser
andUnstructuredPDFLoader
.For more details, see here
Note
I will not split this PR into multiple smaller PRs, each covering a single loader. That approach would take too much time for zero benefit (I’ve had some bad experiences with it). Either this PR works for you, and I’ll make the requested changes, or you can close it and ignore it. It will then be up to another contributor to migrate the various
Unstructured*Loader
to this project.