This is a starter implementation for a governed, shared S3-compatible store where files land in an ingestion prefix and are then:
- discovered by a periodic or on-demand scan,
- triaged by an LLM adapter,
- optionally organized into an archive prefix,
- indexed for retrieval,
- fully audited, and
- snapshotted into reproducible point-in-time manifests.
The default dev stack uses MinIO as the S3-compatible object store and a local SQLite database for state. For production, swap SQLite for Postgres and the simple text index for a dedicated search backend.
- Remote object storage via S3/MinIO.
- Direct upload flow via presigned URLs to the ingestion prefix.
- Periodic scan loop and on-demand scan endpoint.
- LLM triage adapter interface with a deterministic
noopadapter included. - Indexing hook with a simple search implementation.
- Audit log for every ingest, triage, move, index, snapshot, and human override.
- Logical snapshots with manifest hashes.
- lakeFS hook stub for stronger repo-style snapshots later.
- Human-in-the-loop override endpoint.
app/
adapters/
llm.py # LLM adapter interface + noop heuristic adapter
snapshot.py # external snapshot backend hook
storage.py # S3/MinIO adapter
services/
audit.py
indexing.py
ingest.py
snapshots.py
tasks.py
triage.py
api.py
cli.py
config.py
db.py
main.py
models.py
schemas.py
schemas_s3.py # typed S3 event payload validation
security.py # API-key and S3 event authorization helpers
utils.py
docs/
CONSIDERATIONS.md
comparable_projects.md
requirements_interview.md
testing.md
packages/
security-core/ # secret scanning and review-scope enforcement helpers
tests/
conftest.py
integration/
unit/
README_TESTS.md # full testing guide and fixture documentation
- Copy the env file.
cp .env.example .env- Start the dev stack.
docker compose up --build- Open:
- API:
http://localhost:8080/docs - MinIO API:
http://localhost:9000 - MinIO console:
http://localhost:9001
- Upload into the ingestion prefix:
- through MinIO UI / S3 client, or
- request a presigned URL from the API.
curl -s -X POST http://localhost:8080/uploads/presign \
-H 'content-type: application/json' \
-d '{"filename":"acme-invoice-001.txt","content_type":"text/plain"}'Use the returned url to PUT the file bytes. The object key will land under the configured ingestion prefix.
curl -s -X POST http://localhost:8080/jobs/scanIf you configure MinIO/S3 bucket notifications to POST events to the API, you can ingest specific objects without scanning whole prefixes.
curl -s -X POST http://localhost:8080/events/s3 \
-H 'content-type: application/json' \
-d '{"Records":[{"s3":{"bucket":{"name":"llm-ingest"},"object":{"key":"ingestion/acme-invoice-001.txt"}}}]}'# Latest-state search
curl -s "http://localhost:8080/search?q=invoice"
# Search restricted to a snapshot (logical time-travel filter)
curl -s "http://localhost:8080/search?q=invoice&snapshot_id=<SNAPSHOT_ID>"curl -s -X POST http://localhost:8080/snapshots \
-H 'content-type: application/json' \
-d '{"label":"baseline"}'curl -s "http://localhost:8080/snapshots/diff?from=<SNAPSHOT_ID_A>&to=<SNAPSHOT_ID_B>"- A file lands in
SOURCE_BUCKET/INGESTION_PREFIX. - The scheduler enqueues a
scan_prefixtask, or you call the scan endpoint. - The worker discovers new or changed objects and creates:
- a
Documentlogical record, - a
DocumentVersionrecord for the observed object version.
- a
- The worker runs triage:
- generates title / summary / tags / sensitivity / collection suggestion,
- decides whether review is required,
- optionally moves the object into the archive prefix.
- The worker indexes the resulting metadata and text preview.
- Every step emits immutable audit rows.
- Snapshots capture the current logical corpus state and hash it.
- This is intentionally a skeleton, not a production-complete DMS.
- The included search is a simple text search over metadata and previews.
- The included LLM adapter is deterministic so the project runs without an external model.
- The
lakeFSintegration point is a stub. The skeleton already produces logical snapshots; use the adapter hook when you adopt repo-style object-store commits.
- Postgres instead of SQLite.
- Dedicated search / vector backend (OpenSearch, Vespa, Qdrant, pgvector).
- Real OCR / content extraction pipeline.
- Stronger task queue (Celery, Dramatiq, Arq, Temporal, or a managed queue).
- Policy engine for retention / legal hold / auto-approval thresholds.
- External audit sink and SIEM integration.
- lakeFS or equivalent if full namespace time-travel is mandatory.