RFC: SQLite storage layer for the tracker — applications.md becomes a rendered view

> **Status (2026-06-11):** RFC [endorsed](https://github.com/santifer/career-ops/issues/918#issuecomment-4678262108) with the rollout order inverted — the derived index ships first, the source-of-truth flip becomes an explicit per-user opt-in. This text reflects the agreed order. Phase 1 is implemented in #919.

## Roadmap

- [x] **Phase 1 — DB as derived index** → #919 (in review). `applications.md` stays the source of truth; the SQLite DB is rebuilt from it; tools read through the index. Users notice nothing except fewer bugs.
- [ ] **Phase 2 — opt-in flip.** A user runs an explicit migrate command; for them the DB becomes source of truth and `applications.md` becomes the rendered view. Gated on:
  - [ ] Phase 1 merged, released, and lived with in the wild
  - [ ] Round-trip tests (md → db → md lossless) holding in CI — shipped early in #919
  - [ ] `migrate` ships with its inverse (export back to markdown, documented rollback)
  - [ ] Never the default — opt-in only, reversible, documented
  - [ ] Point modes/docs at `tracker.mjs query`; teach `merge-tracker.mjs` to write through the DB

## Problem

`data/applications.md` is a markdown table acting as a database. At small scale it works; at real scale it degrades in ways that are structural, not user error. Evidence from my own tracker (934 entries, accumulated over ~2 months of heavy use):

`node verify-pipeline.mjs` reports **422 errors**, all of them artifacts of many-writers-plus-no-schema:

- **Encoding corruption** — dozens of rows where an em-dash became `鈥?` (UTF-8 read as GBK at some write point); once it is in, every later read propagates it
- **Column drift** — 80+ rows with a score (`3.5/5`) sitting in the *status* column; markdown tables silently accept any cell content
- **Delimiter fragility** — any `|` inside a company name or note shifts every column after it
- **No referential integrity** — report links point at files that do not exist

None of these are *possible* in a real store: columns have types, statuses can be constrained to `templates/states.yml`, links can be validated on write.

## The deeper problem: data correctness currently depends on model intelligence

career-ops is CLI-agnostic by design — Claude Code, Codex, Gemini, OpenCode, Qwen, Kimi. But the tracker is queried by agents doing regex/grep over a markdown table, and **how reliably that works depends on which model is doing the grepping**. Stronger models mostly get it right; weaker models mis-parse rows, miss matches on corrupted cells, then burn tokens retrying or — worse — silently act on wrong data (e.g. re-applying to a company because a grep missed the existing row).

A data layer should not be an intelligence test. `SELECT company, role, status FROM applications WHERE status = 'Applied'` returns the same rows for every model on every CLI. That makes storage the natural place to equalize the multi-CLI experience — and it also stops wasting tokens: one query instead of reading a 900-row table into context.

## Proposal: SQLite index first, source-of-truth flip as a later opt-in

**Markdown does not go away — and in phase 1 it doesn't even change owners.** The rollout is two phases, derived-index first (order inverted from the original draft per the review):

### Phase 1 — DB as derived index (#919)

```
data/applications.md          ← source of truth (unchanged, all writes go here)
        │
        └── sync:   node tracker.mjs sync      → data/applications.db (rebuilt, safe to delete)
                    │
                    ├── query:  node tracker.mjs query --status Applied   (agents read this)
                    ├── history: status transitions accumulated across syncs
                    └── export: node tracker.mjs export   (inverse of sync — round-trip tested)
```

- **Zero-keys, local-first compliant** — SQLite is a single local file, no server, no account, no token
- **Zero new dependencies** — Node 22.5+ ships `node:sqlite` built in
- **Corruption detected at sync time instead of propagating** — `sync` diagnoses every class above (mojibake, column drift, non-canonical statuses, duplicate ids) and normalizes them *in the index only*; the markdown is never modified
- **No stale reads** — `query`/`history` auto-resync when the markdown changed since the last sync
- **Status history for free** — a `status_events` table records Applied → Interview → Rejected transitions across syncs, so `analyze-patterns.mjs` gets a real funnel instead of only the current snapshot
- **User Layer** — `data/applications.db` is gitignored with a DATA_CONTRACT.md entry; it's user data, and it's disposable

### Phase 2 — opt-in flip (later, separate issue + PR)

For users who explicitly migrate, the DB becomes source of truth and `applications.md` becomes a rendered read-only view in the exact current table format — so modes, dashboard, merge-tracker semantics, human readability, and git diffs all survive. Reversible (export back to markdown), documented, never the default. Scope and conditions tracked in the roadmap above; implementation starts once phase 1 has earned it.

Validation dataset: my 964-row real-world tracker — exactly the scale where the current format breaks. One phase-1 `sync` against it surfaces every corruption class as diagnostics while leaving the file untouched.

Related: #230 (the scanner side already writes through shared functions, so pointing writers at a DB is one seam, not many).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: SQLite storage layer for the tracker — applications.md becomes a rendered view #918

Roadmap

Problem

The deeper problem: data correctness currently depends on model intelligence

Proposal: SQLite index first, source-of-truth flip as a later opt-in

Phase 1 — DB as derived index (#919)

Phase 2 — opt-in flip (later, separate issue + PR)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

RFC: SQLite storage layer for the tracker — applications.md becomes a rendered view #918

Description

Roadmap

Problem

The deeper problem: data correctness currently depends on model intelligence

Proposal: SQLite index first, source-of-truth flip as a later opt-in

Phase 1 — DB as derived index (#919)

Phase 2 — opt-in flip (later, separate issue + PR)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions