feat(ingest): support Microsoft Word .docx/.doc blobs with chunking by prsasattms · Pull Request #178 · AzureCosmosDB/OmniVec

prsasattms · 2026-06-03T17:28:38Z

Adds first-class ingestion of Microsoft Word documents alongside the existing .pdf/.txt paths, with the same paragraph-aware chunking pipeline.

What's in

Pipeline worker (docgrok/pipeline-worker/)

word_extract.py (new):
- extract_docx(path) — python-docx-based, walks the body in document order, emits paragraph text + flattened table cells + headers/footers, splits on Word <w:sectPr> section breaks.
- extract_doc(path) — best-effort olefile-based reader for the legacy binary .doc format. Tries the FIB ccpText fast path, falls back to piece-table reconstruction for complex documents, auto-decodes UTF-16LE/CP1252. Raises a clear 415 with a 'convert to .docx and retry' hint if it can't recover usable text.
worker.py:
- New _classify_blob_doctype() helper (precedence: docx → doc → text → pdf), with word/docx substring detection on pipeline hints. _is_text_blob preserved by delegation.
- _stage_extract gained doctype in ('docx','doc') branches: streams the blob to a temp file (no full-payload buffering), runs the extractor, populates ctx['page_texts'] (one entry per Word section), records pages_total/sections/chars.
- New word-transform registered in BUILTIN_TRANSFORMS (priority 25, between PDF and text) wired to .docx/.doc with strategy=paragraph, max_chars=1500, overlap_chars=100 — same chunker the text-transform uses.
- doctype enum in STAGE_CATALOG updated to enum[auto|pdf|text|docx|doc].

Blob connector (api/connectors/blob_connector.py)

.docx and .doc added to the default allowed_extensions in both list_blobs() and list_blobs_paginated() so blob sources pick them up out of the box.

Packaging

requirements.txt: python-docx==1.1.2 and olefile==0.47.
Dockerfile: COPY word_extract.py.

Verified

python word_extract.py round-trips a generated .docx with headings + paragraphs + 2×2 table + after-table paragraph; all 7 needles present, 1 section emitted.
12/12 _classify_blob_doctype cases pass (extension + pipeline hint + override precedence).
python -m py_compile clean across all touched files.

Notes

.docx is fully supported; .doc is best-effort by design — perfect fidelity on legacy .doc requires either LibreOffice --headless or antiword, both of which would balloon the worker image. Callers get an actionable 415 when the OLE path can't recover text.

Adds first-class extraction of Microsoft Word documents alongside the existing .pdf/.txt support, with the same paragraph-aware chunking pipeline. Pipeline worker: * New docgrok/pipeline-worker/word_extract.py with extract_docx() (python-docx; paragraphs, tables, headers/footers, splits on Word section breaks) and extract_doc() (olefile-based best-effort: FIB ccpText + piece-table reconstruction for complex .doc, plus UTF-16/CP1252 auto-decode). * worker.py: new _classify_blob_doctype() helper (docx > doc > text > pdf priority on extension, with 'word'/'docx' pipeline-hint substring detection). _is_text_blob() preserved via classifier delegation. * _stage_extract gained 'docx' and 'doc' doctype branches: streams blob to temp file (no full-payload buffering), runs the extractor, exposes pages_total=#sections + chars in the step record. Legacy .doc returns 415 with a 'convert to .docx and retry' hint when extraction can't recover text. * New word-transform in BUILTIN_TRANSFORMS (priority 25, between PDF and text) wired to .docx/.doc with strategy=paragraph, max_chars=1500, overlap_chars=100 — same chunker the text-transform uses. * doctype enum advertised in STAGE_CATALOG: enum[auto|pdf|text|docx|doc]. Blob connector: * api/connectors/blob_connector.py: .docx and .doc added to the default allowed_extensions in both list_blobs() and list_blobs_paginated(). Packaging: * requirements.txt: python-docx==1.1.2, olefile==0.47. * Dockerfile: COPY word_extract.py into the image. Verified locally: extract_docx round-trips a generated .docx (paragraphs + table cells + after-table paragraph all present); 12/12 classifier cases pass; worker.py compiles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

prsasattms merged commit d6aa60c into main Jun 3, 2026
10 checks passed

prsasattms deleted the feat-word-docs branch June 3, 2026 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ingest): support Microsoft Word .docx/.doc blobs with chunking#178

feat(ingest): support Microsoft Word .docx/.doc blobs with chunking#178
prsasattms merged 1 commit into
mainfrom
feat-word-docs

prsasattms commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

prsasattms commented Jun 3, 2026

What's in

Verified

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants