Status (2026-06-11): RFC endorsed with the rollout order inverted — the derived index ships first, the source-of-truth flip becomes an explicit per-user opt-in. This text reflects the agreed order. Phase 1 is implemented in #919.
Roadmap
Problem
data/applications.md is a markdown table acting as a database. At small scale it works; at real scale it degrades in ways that are structural, not user error. Evidence from my own tracker (934 entries, accumulated over ~2 months of heavy use):
node verify-pipeline.mjs reports 422 errors, all of them artifacts of many-writers-plus-no-schema:
- Encoding corruption — dozens of rows where an em-dash became
鈥? (UTF-8 read as GBK at some write point); once it is in, every later read propagates it
- Column drift — 80+ rows with a score (
3.5/5) sitting in the status column; markdown tables silently accept any cell content
- Delimiter fragility — any
| inside a company name or note shifts every column after it
- No referential integrity — report links point at files that do not exist
None of these are possible in a real store: columns have types, statuses can be constrained to templates/states.yml, links can be validated on write.
The deeper problem: data correctness currently depends on model intelligence
career-ops is CLI-agnostic by design — Claude Code, Codex, Gemini, OpenCode, Qwen, Kimi. But the tracker is queried by agents doing regex/grep over a markdown table, and how reliably that works depends on which model is doing the grepping. Stronger models mostly get it right; weaker models mis-parse rows, miss matches on corrupted cells, then burn tokens retrying or — worse — silently act on wrong data (e.g. re-applying to a company because a grep missed the existing row).
A data layer should not be an intelligence test. SELECT company, role, status FROM applications WHERE status = 'Applied' returns the same rows for every model on every CLI. That makes storage the natural place to equalize the multi-CLI experience — and it also stops wasting tokens: one query instead of reading a 900-row table into context.
Proposal: SQLite index first, source-of-truth flip as a later opt-in
Markdown does not go away — and in phase 1 it doesn't even change owners. The rollout is two phases, derived-index first (order inverted from the original draft per the review):
Phase 1 — DB as derived index (#919)
data/applications.md ← source of truth (unchanged, all writes go here)
│
└── sync: node tracker.mjs sync → data/applications.db (rebuilt, safe to delete)
│
├── query: node tracker.mjs query --status Applied (agents read this)
├── history: status transitions accumulated across syncs
└── export: node tracker.mjs export (inverse of sync — round-trip tested)
- Zero-keys, local-first compliant — SQLite is a single local file, no server, no account, no token
- Zero new dependencies — Node 22.5+ ships
node:sqlite built in
- Corruption detected at sync time instead of propagating —
sync diagnoses every class above (mojibake, column drift, non-canonical statuses, duplicate ids) and normalizes them in the index only; the markdown is never modified
- No stale reads —
query/history auto-resync when the markdown changed since the last sync
- Status history for free — a
status_events table records Applied → Interview → Rejected transitions across syncs, so analyze-patterns.mjs gets a real funnel instead of only the current snapshot
- User Layer —
data/applications.db is gitignored with a DATA_CONTRACT.md entry; it's user data, and it's disposable
Phase 2 — opt-in flip (later, separate issue + PR)
For users who explicitly migrate, the DB becomes source of truth and applications.md becomes a rendered read-only view in the exact current table format — so modes, dashboard, merge-tracker semantics, human readability, and git diffs all survive. Reversible (export back to markdown), documented, never the default. Scope and conditions tracked in the roadmap above; implementation starts once phase 1 has earned it.
Validation dataset: my 964-row real-world tracker — exactly the scale where the current format breaks. One phase-1 sync against it surfaces every corruption class as diagnostics while leaving the file untouched.
Related: #230 (the scanner side already writes through shared functions, so pointing writers at a DB is one seam, not many).
Roadmap
applications.mdstays the source of truth; the SQLite DB is rebuilt from it; tools read through the index. Users notice nothing except fewer bugs.applications.mdbecomes the rendered view. Gated on:migrateships with its inverse (export back to markdown, documented rollback)tracker.mjs query; teachmerge-tracker.mjsto write through the DBProblem
data/applications.mdis a markdown table acting as a database. At small scale it works; at real scale it degrades in ways that are structural, not user error. Evidence from my own tracker (934 entries, accumulated over ~2 months of heavy use):node verify-pipeline.mjsreports 422 errors, all of them artifacts of many-writers-plus-no-schema:鈥?(UTF-8 read as GBK at some write point); once it is in, every later read propagates it3.5/5) sitting in the status column; markdown tables silently accept any cell content|inside a company name or note shifts every column after itNone of these are possible in a real store: columns have types, statuses can be constrained to
templates/states.yml, links can be validated on write.The deeper problem: data correctness currently depends on model intelligence
career-ops is CLI-agnostic by design — Claude Code, Codex, Gemini, OpenCode, Qwen, Kimi. But the tracker is queried by agents doing regex/grep over a markdown table, and how reliably that works depends on which model is doing the grepping. Stronger models mostly get it right; weaker models mis-parse rows, miss matches on corrupted cells, then burn tokens retrying or — worse — silently act on wrong data (e.g. re-applying to a company because a grep missed the existing row).
A data layer should not be an intelligence test.
SELECT company, role, status FROM applications WHERE status = 'Applied'returns the same rows for every model on every CLI. That makes storage the natural place to equalize the multi-CLI experience — and it also stops wasting tokens: one query instead of reading a 900-row table into context.Proposal: SQLite index first, source-of-truth flip as a later opt-in
Markdown does not go away — and in phase 1 it doesn't even change owners. The rollout is two phases, derived-index first (order inverted from the original draft per the review):
Phase 1 — DB as derived index (#919)
node:sqlitebuilt insyncdiagnoses every class above (mojibake, column drift, non-canonical statuses, duplicate ids) and normalizes them in the index only; the markdown is never modifiedquery/historyauto-resync when the markdown changed since the last syncstatus_eventstable records Applied → Interview → Rejected transitions across syncs, soanalyze-patterns.mjsgets a real funnel instead of only the current snapshotdata/applications.dbis gitignored with a DATA_CONTRACT.md entry; it's user data, and it's disposablePhase 2 — opt-in flip (later, separate issue + PR)
For users who explicitly migrate, the DB becomes source of truth and
applications.mdbecomes a rendered read-only view in the exact current table format — so modes, dashboard, merge-tracker semantics, human readability, and git diffs all survive. Reversible (export back to markdown), documented, never the default. Scope and conditions tracked in the roadmap above; implementation starts once phase 1 has earned it.Validation dataset: my 964-row real-world tracker — exactly the scale where the current format breaks. One phase-1
syncagainst it surfaces every corruption class as diagnostics while leaving the file untouched.Related: #230 (the scanner side already writes through shared functions, so pointing writers at a DB is one seam, not many).