Skip to content

fix(extractors): store state in brain DB, not JSON files (closes #63)#91

Open
rajkripal wants to merge 1 commit into
mainfrom
fix/extractor-state-in-db
Open

fix(extractors): store state in brain DB, not JSON files (closes #63)#91
rajkripal wants to merge 1 commit into
mainfrom
fix/extractor-state-in-db

Conversation

@rajkripal

Copy link
Copy Markdown
Owner

Problem

Extractor state files lived at <data_dir>/extractor_state/<name>.json, independent of graph.db. Wiping the DB while keeping the state dir caused silent skip-all on reingest: 0 nodes, no error.

Solution

Move extractor state into the brain DB as an extractor_state table:

CREATE TABLE extractor_state (
    extractor_name TEXT PRIMARY KEY,
    state_json     TEXT NOT NULL,
    updated_at     TEXT NOT NULL
)

State lifecycle now matches DB lifecycle. Wipe the DB, state is gone, reingest works.

Changes

  • ExtractorRegistry.__init__ accepts optional db_path; creates the table on init
  • _save_state / _load_state use DB when db_path is set, JSON file otherwise
  • Backwards-compat shim: on first _load_state miss, checks for legacy JSON file and imports it into the DB
  • cmd_ingest in cashew_cli.py passes db_path to registry
  • Four new tests: DB persistence, DB lifecycle matches DB wipe, JSON fallback (no db_path), legacy shim import

Test results

38 passed in tests/test_extractors.py
285 passed overall (1 pre-existing failure in test_prepare_ingest.py unrelated to this PR)

Move extractor_state from <data_dir>/extractor_state/<name>.json into an
extractor_state table in brain.db so that state lifecycle matches DB
lifecycle. Wiping the DB now also resets extractor progress, eliminating
silent skip-all-reprocessing on reingest.

- ExtractorRegistry.__init__ accepts optional db_path; creates
  extractor_state table (name TEXT PK, state_json TEXT, updated_at TEXT)
- _save_state / _load_state use DB when db_path is set, JSON otherwise
- Backwards-compat shim: on first load, if DB row is missing, import from
  legacy JSON file and migrate into DB
- CLI (cmd_ingest) passes db_path to registry
- Four new tests: DB persistence, DB lifecycle matches DB wipe, JSON
  fallback, legacy shim import

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant