LLM-Governed Object Store Skeleton

This is a starter implementation for a governed, shared S3-compatible store where files land in an ingestion prefix and are then:

discovered by a periodic or on-demand scan,
triaged by an LLM adapter,
optionally organized into an archive prefix,
indexed for retrieval,
fully audited, and
snapshotted into reproducible point-in-time manifests.

The default dev stack uses MinIO as the S3-compatible object store and a local SQLite database for state. For production, swap SQLite for Postgres and the simple text index for a dedicated search backend.

What is included

Remote object storage via S3/MinIO.
Direct upload flow via presigned URLs to the ingestion prefix.
Periodic scan loop and on-demand scan endpoint.
LLM triage adapter interface with a deterministic noop adapter included.
Indexing hook with a simple search implementation.
Audit log for every ingest, triage, move, index, snapshot, and human override.
Logical snapshots with manifest hashes.
lakeFS hook stub for stronger repo-style snapshots later.
Human-in-the-loop override endpoint.

Repository layout

app/
  adapters/
    llm.py          # LLM adapter interface + noop heuristic adapter
    snapshot.py     # external snapshot backend hook
    storage.py      # S3/MinIO adapter
  services/
    audit.py
    indexing.py
    ingest.py
    snapshots.py
    tasks.py
    triage.py
  api.py
  cli.py
  config.py
  db.py
  main.py
  models.py
  schemas.py
  schemas_s3.py     # typed S3 event payload validation
  security.py       # API-key and S3 event authorization helpers
  utils.py
docs/
  CONSIDERATIONS.md
  comparable_projects.md
  requirements_interview.md
  testing.md
packages/
  security-core/    # secret scanning and review-scope enforcement helpers
tests/
  conftest.py
  integration/
  unit/
README_TESTS.md     # full testing guide and fixture documentation

Quick start

Copy the env file.

cp .env.example .env

Start the dev stack.

docker compose up --build

Open:

API: http://localhost:8080/docs
MinIO API: http://localhost:9000
MinIO console: http://localhost:9001

Upload into the ingestion prefix:

through MinIO UI / S3 client, or
request a presigned URL from the API.

Presign an upload

curl -s -X POST http://localhost:8080/uploads/presign \
  -H 'content-type: application/json' \
  -d '{"filename":"acme-invoice-001.txt","content_type":"text/plain"}'

Use the returned url to PUT the file bytes. The object key will land under the configured ingestion prefix.

Trigger a scan immediately

curl -s -X POST http://localhost:8080/jobs/scan

Event-driven ingestion (S3 notifications)

If you configure MinIO/S3 bucket notifications to POST events to the API, you can ingest specific objects without scanning whole prefixes.

curl -s -X POST http://localhost:8080/events/s3 \
  -H 'content-type: application/json' \
  -d '{"Records":[{"s3":{"bucket":{"name":"llm-ingest"},"object":{"key":"ingestion/acme-invoice-001.txt"}}}]}'

Search

# Latest-state search
curl -s "http://localhost:8080/search?q=invoice"

# Search restricted to a snapshot (logical time-travel filter)
curl -s "http://localhost:8080/search?q=invoice&snapshot_id=<SNAPSHOT_ID>"

Create a logical snapshot

curl -s -X POST http://localhost:8080/snapshots \
  -H 'content-type: application/json' \
  -d '{"label":"baseline"}'

Diff snapshots

curl -s "http://localhost:8080/snapshots/diff?from=<SNAPSHOT_ID_A>&to=<SNAPSHOT_ID_B>"

Processing flow

A file lands in SOURCE_BUCKET/INGESTION_PREFIX.
The scheduler enqueues a scan_prefix task, or you call the scan endpoint.
The worker discovers new or changed objects and creates:
- a Document logical record,
- a DocumentVersion record for the observed object version.
The worker runs triage:
- generates title / summary / tags / sensitivity / collection suggestion,
- decides whether review is required,
- optionally moves the object into the archive prefix.
The worker indexes the resulting metadata and text preview.
Every step emits immutable audit rows.
Snapshots capture the current logical corpus state and hash it.

Important notes

This is intentionally a skeleton, not a production-complete DMS.
The included search is a simple text search over metadata and previews.
The included LLM adapter is deterministic so the project runs without an external model.
The lakeFS integration point is a stub. The skeleton already produces logical snapshots; use the adapter hook when you adopt repo-style object-store commits.

Suggested production upgrades

Postgres instead of SQLite.
Dedicated search / vector backend (OpenSearch, Vespa, Qdrant, pgvector).
Real OCR / content extraction pipeline.
Stronger task queue (Celery, Dramatiq, Arq, Temporal, or a managed queue).
Policy engine for retention / legal hold / auto-approval thresholds.
External audit sink and SIEM integration.
lakeFS or equivalent if full namespace time-travel is mandatory.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.githooks		.githooks
.github/workflows		.github/workflows
app		app
docs		docs
packages/security-core		packages/security-core
tests		tests
tools		tools
.env.example		.env.example
.env.test		.env.test
.gitguardian.yaml		.gitguardian.yaml
.gitignore		.gitignore
AGENTS.md		AGENTS.md
BOOTSTRAP_BASE.md		BOOTSTRAP_BASE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
README_TESTS.md		README_TESTS.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Governed Object Store Skeleton

What is included

Repository layout

Quick start

Presign an upload

Trigger a scan immediately

Event-driven ingestion (S3 notifications)

Search

Create a logical snapshot

Diff snapshots

Processing flow

Important notes

Suggested production upgrades

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Governed Object Store Skeleton

What is included

Repository layout

Quick start

Presign an upload

Trigger a scan immediately

Event-driven ingestion (S3 notifications)

Search

Create a logical snapshot

Diff snapshots

Processing flow

Important notes

Suggested production upgrades

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages