Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.5 #532

Closed
wants to merge 6 commits into from
Closed

Release v0.5 #532

wants to merge 6 commits into from

Conversation

danmcp
Copy link
Member

@danmcp danmcp commented Feb 1, 2025

Avoids more complicated changes to avoid using HF_TOKEN.

nathan-weinberg and others added 6 commits November 15, 2024 16:00
Signed-off-by: Nathan Weinberg <[email protected]>
(cherry picked from commit 9327327)
…e-v0.5/pr-380

ci: add large-size E2E CI job (backport instructlab#380)
When setting up our ingestion pipeline, explicitly check if tesserocr
is available and Docling can load it. If so, prefer that. Otherwise,
attempt the same for EasyOCR. If neither can load, log an error and
disable optical character recognition.

Fixes instructlab#352

Signed-off-by: Ben Browning <[email protected]>
(cherry picked from commit ba00454)
This borrows and adapts the `leanimports.py` script and test from the
InstructLab CLI repository to ensure within SDG we're not prematurely
loading the entirety of Torch into memory.

The CLI repo noticed we were doing this, and since this PR would
actually have exacerbated this by attempting to load the tesseract and
easyocr modules even earlier, this felt like the right time to address
this. The overall imports are all the same, but now we only import
specific docling pieces as needed when we're actually going to run
chunking vs triggering the whole PyTorch import chain as soon as
someone imports SDG.

Signed-off-by: Ben Browning <[email protected]>
(cherry picked from commit 791fc7f)
…e-v0.5/pr-369

Prefer tesserocr over easyocr, if available (backport instructlab#369)
@danmcp danmcp closed this Feb 1, 2025
@mergify mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing dependencies Pull requests that update a dependency file ci-failure labels Feb 1, 2025
Copy link
Contributor

mergify bot commented Feb 1, 2025

⚠️ The sha of the head commit of this PR conflicts with #534. Mergify cannot evaluate rules on this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants