Skip to content

Coldaine/llm-governed-store-skeleton

Repository files navigation

LLM-Governed Object Store Skeleton

This is a starter implementation for a governed, shared S3-compatible store where files land in an ingestion prefix and are then:

  1. discovered by a periodic or on-demand scan,
  2. triaged by an LLM adapter,
  3. optionally organized into an archive prefix,
  4. indexed for retrieval,
  5. fully audited, and
  6. snapshotted into reproducible point-in-time manifests.

The default dev stack uses MinIO as the S3-compatible object store and a local SQLite database for state. For production, swap SQLite for Postgres and the simple text index for a dedicated search backend.

What is included

  • Remote object storage via S3/MinIO.
  • Direct upload flow via presigned URLs to the ingestion prefix.
  • Periodic scan loop and on-demand scan endpoint.
  • LLM triage adapter interface with a deterministic noop adapter included.
  • Indexing hook with a simple search implementation.
  • Audit log for every ingest, triage, move, index, snapshot, and human override.
  • Logical snapshots with manifest hashes.
  • lakeFS hook stub for stronger repo-style snapshots later.
  • Human-in-the-loop override endpoint.

Repository layout

app/
  adapters/
    llm.py          # LLM adapter interface + noop heuristic adapter
    snapshot.py     # external snapshot backend hook
    storage.py      # S3/MinIO adapter
  services/
    audit.py
    indexing.py
    ingest.py
    snapshots.py
    tasks.py
    triage.py
  api.py
  cli.py
  config.py
  db.py
  main.py
  models.py
  schemas.py
  schemas_s3.py     # typed S3 event payload validation
  security.py       # API-key and S3 event authorization helpers
  utils.py
docs/
  CONSIDERATIONS.md
  comparable_projects.md
  requirements_interview.md
  testing.md
packages/
  security-core/    # secret scanning and review-scope enforcement helpers
tests/
  conftest.py
  integration/
  unit/
README_TESTS.md     # full testing guide and fixture documentation

Quick start

  1. Copy the env file.
cp .env.example .env
  1. Start the dev stack.
docker compose up --build
  1. Open:
  • API: http://localhost:8080/docs
  • MinIO API: http://localhost:9000
  • MinIO console: http://localhost:9001
  1. Upload into the ingestion prefix:
  • through MinIO UI / S3 client, or
  • request a presigned URL from the API.

Presign an upload

curl -s -X POST http://localhost:8080/uploads/presign \
  -H 'content-type: application/json' \
  -d '{"filename":"acme-invoice-001.txt","content_type":"text/plain"}'

Use the returned url to PUT the file bytes. The object key will land under the configured ingestion prefix.

Trigger a scan immediately

curl -s -X POST http://localhost:8080/jobs/scan

Event-driven ingestion (S3 notifications)

If you configure MinIO/S3 bucket notifications to POST events to the API, you can ingest specific objects without scanning whole prefixes.

curl -s -X POST http://localhost:8080/events/s3 \
  -H 'content-type: application/json' \
  -d '{"Records":[{"s3":{"bucket":{"name":"llm-ingest"},"object":{"key":"ingestion/acme-invoice-001.txt"}}}]}'

Search

# Latest-state search
curl -s "http://localhost:8080/search?q=invoice"

# Search restricted to a snapshot (logical time-travel filter)
curl -s "http://localhost:8080/search?q=invoice&snapshot_id=<SNAPSHOT_ID>"

Create a logical snapshot

curl -s -X POST http://localhost:8080/snapshots \
  -H 'content-type: application/json' \
  -d '{"label":"baseline"}'

Diff snapshots

curl -s "http://localhost:8080/snapshots/diff?from=<SNAPSHOT_ID_A>&to=<SNAPSHOT_ID_B>"

Processing flow

  1. A file lands in SOURCE_BUCKET/INGESTION_PREFIX.
  2. The scheduler enqueues a scan_prefix task, or you call the scan endpoint.
  3. The worker discovers new or changed objects and creates:
    • a Document logical record,
    • a DocumentVersion record for the observed object version.
  4. The worker runs triage:
    • generates title / summary / tags / sensitivity / collection suggestion,
    • decides whether review is required,
    • optionally moves the object into the archive prefix.
  5. The worker indexes the resulting metadata and text preview.
  6. Every step emits immutable audit rows.
  7. Snapshots capture the current logical corpus state and hash it.

Important notes

  • This is intentionally a skeleton, not a production-complete DMS.
  • The included search is a simple text search over metadata and previews.
  • The included LLM adapter is deterministic so the project runs without an external model.
  • The lakeFS integration point is a stub. The skeleton already produces logical snapshots; use the adapter hook when you adopt repo-style object-store commits.

Suggested production upgrades

  • Postgres instead of SQLite.
  • Dedicated search / vector backend (OpenSearch, Vespa, Qdrant, pgvector).
  • Real OCR / content extraction pipeline.
  • Stronger task queue (Celery, Dramatiq, Arq, Temporal, or a managed queue).
  • Policy engine for retention / legal hold / auto-approval thresholds.
  • External audit sink and SIEM integration.
  • lakeFS or equivalent if full namespace time-travel is mandatory.

About

LLM-Governed Object Store with comprehensive test suite

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages