Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/pages.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
name: Deploy Demo to GitHub Pages

on:
push:
branches: [main]
paths: [demo/**]
workflow_dispatch:

permissions:
Expand Down
21 changes: 20 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,23 @@ invoices/
*.pyc
output/
node_modules/
demo/dist/
demo/dist/

# B0 hand-wired bootstrap scaffolding (caravan-conversion branch).
# The local-source pattern is transitional — gone once caravan-rpc lands on
# PyPI (post-M9-cloud) and the M1 compiler emits compose from caravan.yaml.
# Until then these files live on disk locally but stay out of git history.
infra/docker-compose.caravan-bootstrap.yaml
infra/rebuild-caravan-rpc-wheel.sh
services/processing/vendor/

# pip editable-install metadata
*.egg-info/

# B0 / M1 acceptance-gate run outputs
.b0-runs/
.m1-runs/

# caravan-emitted artifacts. Regenerated by `caravan compile --target=<name>`.
# Should never be hand-edited; should never be committed.
infra/*/generated/
41 changes: 38 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,40 @@ redis-cli XLEN queue:a && redis-cli XLEN queue:b
| Processing | Python | CPU/API-bound; OCR + LLM libraries are Python-native |
| Output | Rust | I/O-bound delivery; Excel gen is stable plumbing that rarely changes |

## Deployment via Caravan

invoice-parse adopts [Caravan](https://github.com/paulxiep/caravan) — an application-definition compiler — as its deployment composition layer (Phase 1 close, 2026-05-21). Three RPC dispatch boundaries are declared as Caravan **seams** via the `caravan-rpc` SDK (`@wagon` interface + `provide(I, impl)` registration + `client(I).method()` dispatch); data-plane primitives are declared as Caravan **resources**. One yaml + one `caravan compile --target=<name>` produces a per-target `docker-compose.override.generated.yaml` plus per-service patched `requirements.txt`. Source code is byte-identical across every target — yaml-line changes alone flip dispatch and composition. See [caravan.yaml](caravan.yaml) for the full descriptor and [docs/caravan-readiness.md](docs/caravan-readiness.md) for the pre-adoption structural analysis.

**Declared seams** (live in `services/processing/invoice_processing/`):

| Seam | Interface | Container peer service | Notes |
|---|---|---|---|
| `LLMExtraction` | `extraction.py` | `llm-extractor` | Gemini-backed; provider-swappable via `provide()`. |
| `OCRText` | `ocr.py` | `ocr-text` | PaddleOCR; impl eager-loads the model in `__init__` before binding the TCP port. |
| `OCRLayout` | `table_extract.py` | `ocr-layout` | Per-target impl choice: inproc default uses `SpatialClusterExtractor` (CPU-light); container peer uses `PPStructureExtractor` (PPStructureV3 model). |

**Declared resources:** `invoice_queue` (queue, default `redis-streams`, can flip to `rabbitmq` via yaml composition override); `invoice_db` (Postgres). `invoice_blobs` is **intentionally not yet declared** — Caravan's only OSS-local bucket variant is MinIO, which would flip just the consumer it's declared on (`processing`) to S3-protocol calls while ingestion + output continue talking to the hand-authored `blobdata` volume mount; the storage-backend mismatch would break the queue pipeline. Re-introducing it requires declaring all three deploy units' `uses:` together (see the deferred-work comment in [caravan.yaml](caravan.yaml)).

**Demo targets** (each is a one-line yaml change):

| Target | LLMExtraction | OCRText | OCRLayout | Queue |
|---|---|---|---|---|
| `dev-bootstrap` | container | container | container | redis-streams |
| `dev-split-llm` | container | inproc | inproc | redis-streams |
| `dev-inproc` | inproc | inproc | inproc | redis-streams |
| `dev-rabbitmq-flip` | container | container | container | **rabbitmq** |

```bash
# Install caravan, then from invoice-parse repo root:
caravan compile --target=dev-bootstrap
docker compose \
-f infra/docker-compose.yaml \
-f infra/dev-bootstrap/generated/docker-compose.override.generated.yaml \
--profile app up
```

`git diff -- services/ libs/` between target compiles is empty — the thesis claim that yaml-line dispatch / composition flips require zero source edits. The Caravan-emitted override layers atop the hand-authored [infra/docker-compose.yaml](infra/docker-compose.yaml); the base compose still works standalone for the no-Caravan local-dev path (all seams default inproc, all resources default to the hand-authored Postgres + Redis containers).

## Design Decisions

| Decision | Rationale |
Expand All @@ -127,7 +161,7 @@ redis-cli XLEN queue:a && redis-cli XLEN queue:b
| **Deterministic pipeline, not agentic** | Invoice extraction is structured and repeatable. Conditional fallbacks (vision model, provider failover, VAT derivation) provide adaptability without LLM decision loops. |
| **Queue-based async** | Redis Streams locally, SQS in production. Workers scale independently. Queue absorbs bursts. |
| **Confidence from validation, not LLM self-report** | 50% validation checks + 30% OCR confidence + 20% field completeness. Deterministic and auditable. |
| **Adapter pattern** | `BlobStore` (local FS / S3) and `MessageQueue` (Redis / SQS) abstractions. Same code, swap config. |
| **Adapter pattern** | `BlobStore` (local FS / S3) and `MessageQueue` (Redis / RabbitMQ / SQS) abstractions. Selection at startup via env-var presence (Caravan-injected) with YAML-config fallback; see [Deployment via Caravan](#deployment-via-caravan). Same code, swap target. |
| **Layout-aware OCR** | PaddleOCR with spatial clustering preserves table row/column structure critical for invoices. |
| **LLM result cache** | SHA-256 hash of input file → cached extraction. First run hits LLM, subsequent runs are free. Enables load testing without API cost. |

Expand Down Expand Up @@ -155,6 +189,7 @@ redis-cli XLEN queue:a && redis-cli XLEN queue:b

```
invoice_parse/
├── caravan.yaml Caravan application descriptor (entries + seams + resources + targets)
├── services/
│ ├── ingestion/ Rust — IMAP poll, file ingest, job creation
│ ├── processing/ Python — OCR, table extraction, LLM, validation
Expand All @@ -164,10 +199,10 @@ invoice_parse/
│ ├── shared-rs/ Rust — models, DB, blob store, queue adapters
│ └── shared-py/ Python — models, DB, blob store, queue adapters
├── config/ YAML config (local.yaml, docker.yaml, production.yaml)
├── infra/ Docker Compose (Postgres, Redis)
├── infra/ Docker Compose (hand-authored base) + per-target Caravan-emitted overrides under infra/<target>/generated/
├── migrations/ SQL schema
├── invoices/ Sample invoice files for testing
└── docs/ Architecture, service plans, devlog
└── docs/ Architecture, service plans, devlog, caravan-readiness analysis
```

**Mono-repo for POC, multi-repo in production.** Each service under `services/` has its own `Cargo.toml` or `pyproject.toml`, `Dockerfile`, and dependency closure — no Cargo workspace, no shared virtualenv. Shared libraries under `libs/` use path dependencies locally and would publish to a private crate registry / PyPI in production. This means any service can be extracted to its own repo and CI pipeline without restructuring.
Expand Down
178 changes: 178 additions & 0 deletions caravan.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Caravan application descriptor for invoice-parse.
#
# This file is the user-authored input to `caravan compile --target=<name>`.
# It maps the existing invoice-parse code (Python processing service + Rust
# ingestion/output) onto Caravan's vocabulary: entries (deploy units),
# seams (synchronous abstraction boundaries inside a single language),
# resources (data-plane primitives), and per-target dispatch overrides.
#
# Status: M0 seed. Only the LLMExtraction seam + the processing entry are
# fully declared — enough to drive the dev-bootstrap target that matches
# B0's hand-edited override. Other invoice-parse services (output, ingest,
# dashboard, model-init) and resources (postgres, redis, blob storage) are
# stubs / unmentioned and stay hand-authored in infra/docker-compose.yaml
# until M6 brings them under caravan declaration.

name: invoice-parse
default_target: dev-bootstrap
# Pin Caravan's per-target write root to the pre-existing `infra/<target>/generated/`
# layout (otherwise the compiler defaults to `caravan-out/`). All committed
# generated artifacts + the layering hints in infra/docker-compose.yaml assume
# `infra/`.
output_dir: infra

entries:
processing:
path: services/processing
dockerfile: services/processing/Dockerfile
triggers:
- queue: { from: invoice_queue }
uses: [invoice_queue, invoice_db, gemini_key]

# ingest is a one-shot Rust CLI that enqueues PDFs from a mounted
# directory onto invoice_queue and writes the raw input blobs to
# invoice_blobs. The compose service is hand-authored in
# infra/docker-compose.yaml; declaring it here makes Caravan emit the
# data-plane env vars (S3_*, QUEUE_URL) onto the same service name in
# the generated override. No seams of its own — single-binary CLI.
ingest:
path: services/ingestion
dockerfile: services/ingestion/Dockerfile
uses: [invoice_queue]

# output is a Rust queue-consumer that reads completed extractions
# from invoice_queue, fetches the input blob, generates an Excel,
# writes it back to invoice_blobs, and records final state in
# invoice_db. Compose service is hand-authored; this declaration
# unifies its env-var emission with processing's.
output:
path: services/output
dockerfile: services/output/Dockerfile
triggers:
- queue: { from: invoice_queue }
uses: [invoice_queue, invoice_db]

seams:
LLMExtraction:
# Same code lives in the same module as the processing entry; the
# peer service in container-mode reuses the processing image with an
# overridden command (B0's pattern, now compiler-emitted at M1).
path: services/processing
dockerfile: services/processing/Dockerfile
uses: [gemini_key]
# `impl` tells the M1 compose-emitter which class to load when this
# seam dispatches as a container peer. Language-agnostic shape; the
# Python emitter parses `module:Class`.
impl: invoice_processing.extraction:GeminiExtractor
# Override the default kebab-case naming (`llm-extraction`) so the
# generated compose service name matches B0's hand-edit byte-for-byte.
service_name: llm-extractor
# The llm-extractor peer needs GEMINI_API_KEY to call out to
# Gemini. invoice-parse runs compose from `invoice-parse/`, so
# `../.env` from the generated override file resolves to the user's
# repo-level .env. Code-rag's Rust peers don't need any envvars
# beyond CARAVAN_RPC_SHARED_SECRET, so they omit this field.
env_file: ../.env

OCRText:
# M3 second seam — PaddleOCR-backed raw OCR text extraction.
# Same module/image as LLMExtraction; differs by impl class and
# service_name. Container-mode peer eagerly loads PaddleOCR in its
# __init__ before binding the TCP port (avoids a cold-start race
# with consumer dispatch).
path: services/processing
dockerfile: services/processing/Dockerfile
uses: []
impl: invoice_processing.ocr:PaddleOCRTextImpl
service_name: ocr-text

OCRLayout:
# M6 third seam — table/layout extraction (PPStructureV3 model).
# Container-mode peer eagerly loads PPStructureV3 in its __init__.
# The worker's local provide() registers SpatialClusterExtractor
# (CPU-only, no model) for the inproc default; this `impl:` ref
# selects PPStructureExtractor for the container-mode peer where
# the heavy-model accuracy is worth its load cost. inproc and
# container thus run different impls of the same seam — both
# satisfy the OCRLayout contract; the choice is per-target.
path: services/processing
dockerfile: services/processing/Dockerfile
uses: []
impl: invoice_processing.table_extract:PPStructureExtractor
service_name: ocr-layout

resources:
invoice_queue: { type: queue, composition: oss-local }
# Credentials match the hand-authored infra/docker-compose.yaml's
# postgres service (POSTGRES_USER/PASSWORD/DB) so the Caravan-emitted
# DATABASE_URL points at the same engine. When Caravan also emits the
# postgres container (no collision), POSTGRES_* env on that container
# tracks these same values.
invoice_db: { type: db.sql, composition: oss-local, user: invoice, password: invoice, dbname: invoice_parse }
# invoice_blobs intentionally NOT declared as a Caravan resource yet.
# Caravan's only OSS-local bucket variant today is MinIO, which would
# flip just the consumer it's declared on (`processing` originally) to
# S3 wire calls while ingest + output continue to use the LocalFs
# adapter against the hand-authored `blobdata` volume mount —
# storage-backend mismatch breaks the queue pipeline (NoSuchKey).
# Re-introduce `invoice_blobs: { type: bucket, composition: oss-local }`
# AND add it to ingest + output + processing's `uses:` lists in the same
# change once all three deploy units consume MinIO consistently. A
# `variant: localfs` for buckets would be the IR-pure way to declare
# "Caravan-managed, but bound to the shared volume the user provides";
# not in the compiler today.

secrets:
gemini_key: { from: env, path: GEMINI_API_KEY }

targets:
# dev-bootstrap is M6's canonical demo: all three seams as peer
# containers (llm-extractor, ocr-text, ocr-layout). Stresses the
# multi-seam emit + per-seam env-var wiring end-to-end.
dev-bootstrap:
runtime: docker-compose
default_composition: oss-local
entries: { processing: container, ingest: container, output: container }
seams:
LLMExtraction: container
OCRText: container
OCRLayout: container

# dev-split-llm is the mix-and-match demo: LLMExtraction runs as a
# peer container while the OCR seams stay inproc inside the
# processing service. Proves per-seam dispatch flips independently —
# the load-bearing thesis claim. Source code is identical to
# dev-bootstrap; only this target's seams block differs.
dev-split-llm:
runtime: docker-compose
default_composition: oss-local
entries: { processing: container, ingest: container, output: container }
seams:
LLMExtraction: container
OCRText: inproc
OCRLayout: inproc

# dev-inproc keeps everything in-process. Same source code; flip yaml
# lines and the inproc/http dispatch toggles. This is the thesis.
dev-inproc:
runtime: docker-compose
default_composition: oss-local
entries: { processing: container, ingest: container, output: container }
# seams omitted → all three seams default to inproc

# dev-rabbitmq-flip is M4's composition-orthogonality demo. Same
# source, same per-seam dispatch as dev-bootstrap (all three seams
# container); only the queue engine swaps from redis-streams (default)
# to rabbitmq. Caravan emits a new `rabbitmq:` service and flips
# QUEUE_URL to amqp:// on the processing consumer. invoice-parse's
# queue adapter routes on URL scheme.
dev-rabbitmq-flip:
runtime: docker-compose
default_composition: oss-local
entries: { processing: container, ingest: container, output: container }
seams:
LLMExtraction: container
OCRText: container
OCRLayout: container
composition:
invoice_queue: { mode: oss-local, kind: rabbitmq }
Loading
Loading