Skip to content

Conversation

pprados
Copy link

@pprados pprados commented Feb 27, 2025

In this PR, we propose a migration of the various Unstructured*Loader implementations to the langchain-unstructured package.

Improvements

We’ve made several key improvements:

  • Each loader is tested with all supported file types (unlike langchain-community)
  • The result is compared to the same execution using the original version from langchain-community (see test_migration.py)
  • Each Loader is split into a Loader/Parser to allow usage with GenericLoader
  • The output format is Markdown-oriented rather than text stream. This includes support for headings (with # prefixes) and tables in either Markdown or HTML format. It’s possible to revert to the original behavior by changing a few parameters.
  • Headers and footers can be excluded (keep_header_footer=False)
  • Loaders accept:
    • Path objects or strings
    • web_url
    • IO object
  • PDF processing complies with the specifications shared across other PDF parsers, and supports the four strategies (auto, fast, hi_res, and ocr_only)
  • Still compatible with lazy_load()
  • UnstructuredLoader additionally supports a list of PATHs in file_path. While we don’t consider this very clean (why only this loader? Why no plural? The user could just loop), we replicate the behavior from langchain-community.
  • Nearly 300 tests validate this PR
  • The langchain-unstructured dependencies offer the same extras as unstructured (csv, pdf, docx, etc.). This allows specifying a dependency on langchain-unstructured limited to certain file types (langchain-unstructured[pdf]). The previous behavior pulled in all possible formats, resulting in a package too large for environments like AWS Lambda.

With this PR, it will be possible to mark 17 Loader as "deprecated". There will remain 5 dependencies on unstructured in langchain-community.

Once this version is released, we plan to propose a PR to langchain-community to mark all Unstructured*Loader as @deprecated. Any changes to default parameter values will be explained in the comments.

Other dependencies on Unstructured in langchain-community

There are not part of unstructured

  • CHM is not a format directly supported by Unstructured, even though UnstructuredCHMLoader exists. The langchain-community version doesn’t work with the files we tested. We are leaving this loader as-is.
  • UnstructuredLakeFSLoader
  • SeleniumURLLoader
  • S3FileLoader. Use GenericLoader + CloudBlobLoader
  • UnstructuredHtmlEvaluator

pyproject.toml

unstructured is a framework that can pull in a large number of dependencies, depending on the file formats it needs to process. The framework offers various extras to include only the strictly necessary dependencies, for example: unstructured[pdf,csv].

langchain-unstructured does not currently work this way. It pulls in all dependencies from unstructured, resulting in very large projects that are incompatible with environments that have size limitations, such as AWS Lambda.

The change to pyproject.toml replicates the different extras provided by unstructured and propagates them into langchain-unstructured.

PDF

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses on updating the UnstructuredPDFParser and UnstructuredPDFLoader.

For more details, see here

Note

I will not split this PR into multiple smaller PRs, each covering a single loader. That approach would take too much time for zero benefit (I’ve had some bad experiences with it). Either this PR works for you, and I’ll make the requested changes, or you can close it and ignore it. It will then be up to another contributor to migrate the various Unstructured*Loader to this project.

@Coniferish
Copy link
Contributor

Thanks @pprados! Unsure when I or someone else will get to this, but wanted to let you know we're aware.

@pprados
Copy link
Author

pprados commented Apr 15, 2025

@Coniferish
Please, can you review this PR ?

I'm also having trouble getting 2 other PRs validated, which are currently blocked. I don't understand why. Can you take a look?

@pprados pprados marked this pull request as ready for review April 15, 2025 11:06
@Coniferish
Copy link
Contributor

Hey @pprados, I no longer work at unstructured, so I'm not sure I can help out.

@pprados
Copy link
Author

pprados commented Apr 15, 2025

Hey @Coniferish
Can you propose another name?

@Coniferish
Copy link
Contributor

I'm unsure, sorry

@pprados
Copy link
Author

pprados commented Apr 15, 2025

Hey @efriis
Please, can you review this PR ?

I'm also having trouble getting 2 other PRs validated, which are currently blocked. I don't understand why. Can you take a look?

langchain-ai/langchain#29709 Issue langchain-ai/langchain#30454
langchain-ai/langchain#30094 Issue langchain-ai/langchain#30455

@Coniferish
Copy link
Contributor

Coniferish commented Apr 15, 2025

@ccurme, can you help out this contributor? I'm no longer at unstructured and am unsure if I'm able to continue working on this.

@pprados pprados changed the title unstructured[minor]: 08 - Refactoring UnstructuredPDF unstructured[minor]: 08 - Refactoring 17 unstructured loaders Apr 17, 2025
@pprados
Copy link
Author

pprados commented Apr 23, 2025

@baskaryan can you review this PR or assign it to someone?

@pprados
Copy link
Author

pprados commented Apr 29, 2025

@badGarnet, What do you think?

@pprados
Copy link
Author

pprados commented Apr 29, 2025

@ccurme
Now that you've moved langchain-community, it may be time to migrate unstructured as well.
Then it will be possible to clean up and depreciate all Unstructured*Loaders.

@pprados
Copy link
Author

pprados commented May 14, 2025

@ccurme can you approve the workflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants