feat(ingest): support Microsoft Word .docx/.doc blobs with chunking#178
Merged
Conversation
Adds first-class extraction of Microsoft Word documents alongside the existing .pdf/.txt support, with the same paragraph-aware chunking pipeline. Pipeline worker: * New docgrok/pipeline-worker/word_extract.py with extract_docx() (python-docx; paragraphs, tables, headers/footers, splits on Word section breaks) and extract_doc() (olefile-based best-effort: FIB ccpText + piece-table reconstruction for complex .doc, plus UTF-16/CP1252 auto-decode). * worker.py: new _classify_blob_doctype() helper (docx > doc > text > pdf priority on extension, with 'word'/'docx' pipeline-hint substring detection). _is_text_blob() preserved via classifier delegation. * _stage_extract gained 'docx' and 'doc' doctype branches: streams blob to temp file (no full-payload buffering), runs the extractor, exposes pages_total=#sections + chars in the step record. Legacy .doc returns 415 with a 'convert to .docx and retry' hint when extraction can't recover text. * New word-transform in BUILTIN_TRANSFORMS (priority 25, between PDF and text) wired to .docx/.doc with strategy=paragraph, max_chars=1500, overlap_chars=100 — same chunker the text-transform uses. * doctype enum advertised in STAGE_CATALOG: enum[auto|pdf|text|docx|doc]. Blob connector: * api/connectors/blob_connector.py: .docx and .doc added to the default allowed_extensions in both list_blobs() and list_blobs_paginated(). Packaging: * requirements.txt: python-docx==1.1.2, olefile==0.47. * Dockerfile: COPY word_extract.py into the image. Verified locally: extract_docx round-trips a generated .docx (paragraphs + table cells + after-table paragraph all present); 12/12 classifier cases pass; worker.py compiles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds first-class ingestion of Microsoft Word documents alongside the existing
.pdf/.txtpaths, with the same paragraph-aware chunking pipeline.What's in
Pipeline worker (
docgrok/pipeline-worker/)word_extract.py(new):extract_docx(path)—python-docx-based, walks the body in document order, emits paragraph text + flattened table cells + headers/footers, splits on Word<w:sectPr>section breaks.extract_doc(path)— best-effortolefile-based reader for the legacy binary.docformat. Tries the FIBccpTextfast path, falls back to piece-table reconstruction for complex documents, auto-decodes UTF-16LE/CP1252. Raises a clear 415 with a 'convert to .docx and retry' hint if it can't recover usable text.worker.py:_classify_blob_doctype()helper (precedence: docx → doc → text → pdf), withword/docxsubstring detection on pipeline hints._is_text_blobpreserved by delegation._stage_extractgaineddoctype in ('docx','doc')branches: streams the blob to a temp file (no full-payload buffering), runs the extractor, populatesctx['page_texts'](one entry per Word section), recordspages_total/sections/chars.word-transformregistered inBUILTIN_TRANSFORMS(priority 25, between PDF and text) wired to.docx/.docwithstrategy=paragraph, max_chars=1500, overlap_chars=100— same chunker the text-transform uses.doctypeenum inSTAGE_CATALOGupdated toenum[auto|pdf|text|docx|doc].Blob connector (
api/connectors/blob_connector.py).docxand.docadded to the defaultallowed_extensionsin bothlist_blobs()andlist_blobs_paginated()so blob sources pick them up out of the box.Packaging
requirements.txt:python-docx==1.1.2andolefile==0.47.Dockerfile:COPY word_extract.py.Verified
python word_extract.pyround-trips a generated.docxwith headings + paragraphs + 2×2 table + after-table paragraph; all 7 needles present, 1 section emitted._classify_blob_doctypecases pass (extension + pipeline hint + override precedence).python -m py_compileclean across all touched files.Notes
.docxis fully supported;.docis best-effort by design — perfect fidelity on legacy.docrequires eitherLibreOffice --headlessorantiword, both of which would balloon the worker image. Callers get an actionable 415 when the OLE path can't recover text.