Skip to content

feat(documents): add docling (docling-serve) document-parsing backend#1006

Closed
yuisheaven wants to merge 3 commits into
cbcoutinho:masterfrom
yuisheaven:claude/nextcloud-docling-integration-vbg9jt
Closed

feat(documents): add docling (docling-serve) document-parsing backend#1006
yuisheaven wants to merge 3 commits into
cbcoutinho:masterfrom
yuisheaven:claude/nextcloud-docling-integration-vbg9jt

Conversation

@yuisheaven

Copy link
Copy Markdown
Contributor

What & why

unstructured handles photographed/scanned and especially handwritten text
poorly. This adds docling (via an external docling-serve HTTP instance,
DOCLING_API_URL) as an OCR-strong parsing backend — alongside unstructured,
with no docling Python dependency added to the server (HTTP client only).

Three touchpoints (one shared client, document_processors/docling_serve.py)

  • Images → docling (automatic): DoclingProcessor registers at priority 20
    (above unstructured) for image MIME types when ENABLE_DOCLING=true + a URL is set.
  • Scanned PDFs → docling (automatic, opt-in): DOCUMENT_OCR_PROVIDER=docling
    plugs a _DoclingServeBackend into the existing OCR tier; the tier-0 classifier's
    text-layer detection means born-digital PDFs stay on the cheap local tiers.
  • Text-layer PDFs → docling (on demand): nc_webdav_read_file(force_processor="docling")
    re-parses any file with docling even when it has a text layer (tables/partial text).

Config

ENABLE_DOCLING, DOCLING_API_URL, DOCLING_TIMEOUT, DOCLING_OCR_LANG,
DOCLING_DO_OCR; docling added to the document_ocr_provider enum. New opt-in
docling docker-compose profile. Off by default → no behavior change when unset.

Non-goals

Office formats stay with unstructured; auto never selects docling (needs an
explicit self-hosted URL); sync-only v1 (async submit/poll is future work).

Testing

ruff/ty clean; unit tests for config, processor, OCR backend, routing, and the
force override; a gated live integration suite. Verified end-to-end against a real
docling-serve instance (image auto-route, forced text-layer PDF, scanned-PDF OCR
escalation). See docs/ADR-031.

claude added 3 commits July 2, 2026 11:45
Add docling-serve as an OCR-strong document-parsing backend alongside
unstructured, for photographed/scanned/handwritten text.

- Images auto-route to a new images-only DoclingProcessor (priority 20).
- Scanned/no-text-layer PDFs escalate to docling via
  DOCUMENT_OCR_PROVIDER=docling, reusing the classifier's text-layer
  detection so born-digital PDFs stay on the cheap local tiers.
- nc_webdav_read_file gains a force_processor argument so the caller can
  re-parse a text-layer PDF (tables / partial text) with docling on demand.

One shared docling-serve HTTP client (document_processors/docling_serve.py)
backs both the DoclingProcessor and the OCR-tier _DoclingServeBackend.

Config: ENABLE_DOCLING, DOCLING_API_URL, DOCLING_TIMEOUT, DOCLING_OCR_LANG,
DOCLING_DO_OCR; docling added to the document_ocr_provider enum. New docling
docker-compose profile. See docs/ADR-031.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- convert_file: raise ProcessorError (not JSONDecodeError/AttributeError) on a
  non-JSON or non-object docling-serve response, honoring the documented contract.
- docling OCR backend always sends do_ocr=true (it IS the OCR tier); DOCLING_DO_OCR
  now tunes only the image processor, so it can't silently no-OCR scanned PDFs.
  Dropped docling_do_ocr from Settings/env-map (kept in _DEFAULTS for the image path).
- Drop the misleading hardcoded docling_status metadata.
- Docs: correct the docling image MIME list (add gif/webp) + clarify DOCLING_DO_OCR scope.
- Tests: cover convert_file non-dict/non-JSON bodies and the progress-callback path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The mocked httpx client makes the URL scheme irrelevant, but literal
http:// URLs tripped SonarCloud's python:S5332 and dropped the new-code
Security Rating to B (Quality Gate requires A). Switch the docling test
fixtures to https://, matching the existing convention (e.g. the gateway
tests already use https://).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jul 2, 2026

Copy link
Copy Markdown

@cbcoutinho

cbcoutinho commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Hi @yuisheaven, thanks for the contribution! This is a meaty PR - will have to take some time to review.

As you may have seen, I'm consolidating document processors that make requests to external APIs into an OcrProcessor, rather than individual processors per upstream tool. Would it be possible to route requests to docling using this same interface rather than creating a new processor? Happy to hear your thoughts 🙏

It looks like this is actually the way you've set this up

@yuisheaven

Copy link
Copy Markdown
Contributor Author

Thanks! Yep — for the external-API path docling plugs in as an _OcrBackend
behind OcrProcessor (DOCUMENT_OCR_PROVIDER=docling), so scanned PDFs go
through the same interface as the gateway/Mistral backends.

One bit of transparency so it's not a surprise later: there's still one small
standalone processor for images (the photographed/handwritten case), because
OcrProcessor is PDF-only today and images don't enter the OCR tier — they route
via find_processor.

@cbcoutinho

Copy link
Copy Markdown
Owner

That's fine - I'm working on an image processor which is separate from generic document processing, I'll clean these up once that lands. Feel free to check out #800 if you're interested in that and leave a comment

@cbcoutinho

Copy link
Copy Markdown
Owner

Thanks @yuisheaven! Superseded by #1009, which carries all of your commits from this branch plus a CI lane that actually exercises the docling OCR path (a docling-integration job runs the gated -m docling suite against a docling-serve-cpu container).

Because this PR is from a fork with maintainer edits disabled, the CI commit couldn't be pushed onto your branch — hence the new PR. Your original commits (and authorship) are preserved there. Closing this in favor of #1009.

@cbcoutinho cbcoutinho closed this Jul 2, 2026
@yuisheaven

Copy link
Copy Markdown
Contributor Author

Okay, very happy to hear it made it! Thanks a lot!

I'll take note to enable the maintainer edits for envs next time if I get to create other features in the future

@yuisheaven yuisheaven deleted the claude/nextcloud-docling-integration-vbg9jt branch July 3, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants