feat(documents): add docling (docling-serve) document-parsing backend#1006
feat(documents): add docling (docling-serve) document-parsing backend#1006yuisheaven wants to merge 3 commits into
Conversation
Add docling-serve as an OCR-strong document-parsing backend alongside unstructured, for photographed/scanned/handwritten text. - Images auto-route to a new images-only DoclingProcessor (priority 20). - Scanned/no-text-layer PDFs escalate to docling via DOCUMENT_OCR_PROVIDER=docling, reusing the classifier's text-layer detection so born-digital PDFs stay on the cheap local tiers. - nc_webdav_read_file gains a force_processor argument so the caller can re-parse a text-layer PDF (tables / partial text) with docling on demand. One shared docling-serve HTTP client (document_processors/docling_serve.py) backs both the DoclingProcessor and the OCR-tier _DoclingServeBackend. Config: ENABLE_DOCLING, DOCLING_API_URL, DOCLING_TIMEOUT, DOCLING_OCR_LANG, DOCLING_DO_OCR; docling added to the document_ocr_provider enum. New docling docker-compose profile. See docs/ADR-031. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- convert_file: raise ProcessorError (not JSONDecodeError/AttributeError) on a non-JSON or non-object docling-serve response, honoring the documented contract. - docling OCR backend always sends do_ocr=true (it IS the OCR tier); DOCLING_DO_OCR now tunes only the image processor, so it can't silently no-OCR scanned PDFs. Dropped docling_do_ocr from Settings/env-map (kept in _DEFAULTS for the image path). - Drop the misleading hardcoded docling_status metadata. - Docs: correct the docling image MIME list (add gif/webp) + clarify DOCLING_DO_OCR scope. - Tests: cover convert_file non-dict/non-JSON bodies and the progress-callback path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The mocked httpx client makes the URL scheme irrelevant, but literal http:// URLs tripped SonarCloud's python:S5332 and dropped the new-code Security Rating to B (Quality Gate requires A). Switch the docling test fixtures to https://, matching the existing convention (e.g. the gateway tests already use https://). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
Hi @yuisheaven, thanks for the contribution! This is a meaty PR - will have to take some time to review.
It looks like this is actually the way you've set this up |
|
Thanks! Yep — for the external-API path docling plugs in as an One bit of transparency so it's not a surprise later: there's still one small |
|
That's fine - I'm working on an image processor which is separate from generic document processing, I'll clean these up once that lands. Feel free to check out #800 if you're interested in that and leave a comment |
|
Thanks @yuisheaven! Superseded by #1009, which carries all of your commits from this branch plus a CI lane that actually exercises the docling OCR path (a Because this PR is from a fork with maintainer edits disabled, the CI commit couldn't be pushed onto your branch — hence the new PR. Your original commits (and authorship) are preserved there. Closing this in favor of #1009. |
|
Okay, very happy to hear it made it! Thanks a lot! I'll take note to enable the maintainer edits for envs next time if I get to create other features in the future |



What & why
unstructuredhandles photographed/scanned and especially handwritten textpoorly. This adds docling (via an external
docling-serveHTTP instance,DOCLING_API_URL) as an OCR-strong parsing backend — alongside unstructured,with no docling Python dependency added to the server (HTTP client only).
Three touchpoints (one shared client,
document_processors/docling_serve.py)DoclingProcessorregisters at priority 20(above unstructured) for image MIME types when
ENABLE_DOCLING=true+ a URL is set.DOCUMENT_OCR_PROVIDER=doclingplugs a
_DoclingServeBackendinto the existing OCR tier; the tier-0 classifier'stext-layer detection means born-digital PDFs stay on the cheap local tiers.
nc_webdav_read_file(force_processor="docling")re-parses any file with docling even when it has a text layer (tables/partial text).
Config
ENABLE_DOCLING,DOCLING_API_URL,DOCLING_TIMEOUT,DOCLING_OCR_LANG,DOCLING_DO_OCR;doclingadded to thedocument_ocr_providerenum. New opt-indoclingdocker-compose profile. Off by default → no behavior change when unset.Non-goals
Office formats stay with unstructured;
autonever selects docling (needs anexplicit self-hosted URL); sync-only v1 (async submit/poll is future work).
Testing
ruff/ty clean; unit tests for config, processor, OCR backend, routing, and the
force override; a gated live integration suite. Verified end-to-end against a real
docling-serve instance (image auto-route, forced text-layer PDF, scanned-PDF OCR
escalation). See
docs/ADR-031.