Skip to content

Commit

Permalink
Only use CPU for the docling OCR models
Browse files Browse the repository at this point in the history
Because GPU memory is extremely tight in many of our supported
hardware configurations, and because our GitHub Mac CI runners error
out when running the OCR models with MPS acceleration, let's just
explicitly pin the OCR models to the CPU.

See DS4SD/docling#286 for a bit more context.

Signed-off-by: Ben Browning <[email protected]>
  • Loading branch information
bbrowning committed Nov 10, 2024
1 parent e0698d6 commit 848d9c8
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/instructlab/sdg/utils/chunkers.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,8 @@ def chunk_documents(self) -> List:

model_artifacts_path = StandardPdfPipeline.download_models_hf()
pipeline_options = PdfPipelineOptions(artifacts_path=model_artifacts_path)
# Keep OCR models on the CPU instead of GPU
pipeline_options.ocr_options.use_gpu = False
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
Expand Down

0 comments on commit 848d9c8

Please sign in to comment.