Skip to content

feat(ingest): support Microsoft Word .docx/.doc blobs with chunking#178

Merged
prsasattms merged 1 commit into
mainfrom
feat-word-docs
Jun 3, 2026
Merged

feat(ingest): support Microsoft Word .docx/.doc blobs with chunking#178
prsasattms merged 1 commit into
mainfrom
feat-word-docs

Conversation

@prsasattms

Copy link
Copy Markdown
Collaborator

Adds first-class ingestion of Microsoft Word documents alongside the existing .pdf/.txt paths, with the same paragraph-aware chunking pipeline.

What's in

Pipeline worker (docgrok/pipeline-worker/)

  • word_extract.py (new):
    • extract_docx(path)python-docx-based, walks the body in document order, emits paragraph text + flattened table cells + headers/footers, splits on Word <w:sectPr> section breaks.
    • extract_doc(path) — best-effort olefile-based reader for the legacy binary .doc format. Tries the FIB ccpText fast path, falls back to piece-table reconstruction for complex documents, auto-decodes UTF-16LE/CP1252. Raises a clear 415 with a 'convert to .docx and retry' hint if it can't recover usable text.
  • worker.py:
    • New _classify_blob_doctype() helper (precedence: docx → doc → text → pdf), with word/docx substring detection on pipeline hints. _is_text_blob preserved by delegation.
    • _stage_extract gained doctype in ('docx','doc') branches: streams the blob to a temp file (no full-payload buffering), runs the extractor, populates ctx['page_texts'] (one entry per Word section), records pages_total/sections/chars.
    • New word-transform registered in BUILTIN_TRANSFORMS (priority 25, between PDF and text) wired to .docx/.doc with strategy=paragraph, max_chars=1500, overlap_chars=100 — same chunker the text-transform uses.
    • doctype enum in STAGE_CATALOG updated to enum[auto|pdf|text|docx|doc].

Blob connector (api/connectors/blob_connector.py)

  • .docx and .doc added to the default allowed_extensions in both list_blobs() and list_blobs_paginated() so blob sources pick them up out of the box.

Packaging

  • requirements.txt: python-docx==1.1.2 and olefile==0.47.
  • Dockerfile: COPY word_extract.py.

Verified

  • python word_extract.py round-trips a generated .docx with headings + paragraphs + 2×2 table + after-table paragraph; all 7 needles present, 1 section emitted.
  • 12/12 _classify_blob_doctype cases pass (extension + pipeline hint + override precedence).
  • python -m py_compile clean across all touched files.

Notes

  • .docx is fully supported; .doc is best-effort by design — perfect fidelity on legacy .doc requires either LibreOffice --headless or antiword, both of which would balloon the worker image. Callers get an actionable 415 when the OLE path can't recover text.

Adds first-class extraction of Microsoft Word documents alongside the existing .pdf/.txt support, with the same paragraph-aware chunking pipeline.

Pipeline worker:

* New docgrok/pipeline-worker/word_extract.py with extract_docx() (python-docx; paragraphs, tables, headers/footers, splits on Word section breaks) and extract_doc() (olefile-based best-effort: FIB ccpText + piece-table reconstruction for complex .doc, plus UTF-16/CP1252 auto-decode).

* worker.py: new _classify_blob_doctype() helper (docx > doc > text > pdf priority on extension, with 'word'/'docx' pipeline-hint substring detection). _is_text_blob() preserved via classifier delegation.

* _stage_extract gained 'docx' and 'doc' doctype branches: streams blob to temp file (no full-payload buffering), runs the extractor, exposes pages_total=#sections + chars in the step record. Legacy .doc returns 415 with a 'convert to .docx and retry' hint when extraction can't recover text.

* New word-transform in BUILTIN_TRANSFORMS (priority 25, between PDF and text) wired to .docx/.doc with strategy=paragraph, max_chars=1500, overlap_chars=100 — same chunker the text-transform uses.

* doctype enum advertised in STAGE_CATALOG: enum[auto|pdf|text|docx|doc].

Blob connector:

* api/connectors/blob_connector.py: .docx and .doc added to the default allowed_extensions in both list_blobs() and list_blobs_paginated().

Packaging:

* requirements.txt: python-docx==1.1.2, olefile==0.47.

* Dockerfile: COPY word_extract.py into the image.

Verified locally: extract_docx round-trips a generated .docx (paragraphs + table cells + after-table paragraph all present); 12/12 classifier cases pass; worker.py compiles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@prsasattms prsasattms merged commit d6aa60c into main Jun 3, 2026
10 checks passed
@prsasattms prsasattms deleted the feat-word-docs branch June 3, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants