Only use CPU for the docling OCR models

Because GPU memory is extremely tight in many of our supported hardware configurations, and because our GitHub Mac CI runners error out when running the OCR models with MPS acceleration, let's just explicitly pin the OCR models to the CPU. See DS4SD/docling#286 for a bit more context. Signed-off-by: Ben Browning <[email protected]>
instructlab · Nov 10, 2024 · 848d9c8 · 848d9c8
1 parent e0698d6
commit 848d9c8
Showing 1 changed file with 2 additions and 0 deletions.
diff --git a/src/instructlab/sdg/utils/chunkers.py b/src/instructlab/sdg/utils/chunkers.py
@@ -213,6 +213,8 @@ def chunk_documents(self) -> List:
 
         model_artifacts_path = StandardPdfPipeline.download_models_hf()
         pipeline_options = PdfPipelineOptions(artifacts_path=model_artifacts_path)
+        # Keep OCR models on the CPU instead of GPU
+        pipeline_options.ocr_options.use_gpu = False
         converter = DocumentConverter(
             format_options={
                 InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)