Skip to content

RFC: SQLite storage layer for the tracker — applications.md becomes a rendered view #918

@Schlaflied

Description

@Schlaflied

Status (2026-06-11): RFC endorsed with the rollout order inverted — the derived index ships first, the source-of-truth flip becomes an explicit per-user opt-in. This text reflects the agreed order. Phase 1 is implemented in #919.

Roadmap

  • Phase 1 — DB as derived indexfeat(tracker): SQLite derived index over applications.md — phase 1 of RFC #918 #919 (in review). applications.md stays the source of truth; the SQLite DB is rebuilt from it; tools read through the index. Users notice nothing except fewer bugs.
  • Phase 2 — opt-in flip. A user runs an explicit migrate command; for them the DB becomes source of truth and applications.md becomes the rendered view. Gated on:
    • Phase 1 merged, released, and lived with in the wild
    • Round-trip tests (md → db → md lossless) holding in CI — shipped early in feat(tracker): SQLite derived index over applications.md — phase 1 of RFC #918 #919
    • migrate ships with its inverse (export back to markdown, documented rollback)
    • Never the default — opt-in only, reversible, documented
    • Point modes/docs at tracker.mjs query; teach merge-tracker.mjs to write through the DB

Problem

data/applications.md is a markdown table acting as a database. At small scale it works; at real scale it degrades in ways that are structural, not user error. Evidence from my own tracker (934 entries, accumulated over ~2 months of heavy use):

node verify-pipeline.mjs reports 422 errors, all of them artifacts of many-writers-plus-no-schema:

  • Encoding corruption — dozens of rows where an em-dash became 鈥? (UTF-8 read as GBK at some write point); once it is in, every later read propagates it
  • Column drift — 80+ rows with a score (3.5/5) sitting in the status column; markdown tables silently accept any cell content
  • Delimiter fragility — any | inside a company name or note shifts every column after it
  • No referential integrity — report links point at files that do not exist

None of these are possible in a real store: columns have types, statuses can be constrained to templates/states.yml, links can be validated on write.

The deeper problem: data correctness currently depends on model intelligence

career-ops is CLI-agnostic by design — Claude Code, Codex, Gemini, OpenCode, Qwen, Kimi. But the tracker is queried by agents doing regex/grep over a markdown table, and how reliably that works depends on which model is doing the grepping. Stronger models mostly get it right; weaker models mis-parse rows, miss matches on corrupted cells, then burn tokens retrying or — worse — silently act on wrong data (e.g. re-applying to a company because a grep missed the existing row).

A data layer should not be an intelligence test. SELECT company, role, status FROM applications WHERE status = 'Applied' returns the same rows for every model on every CLI. That makes storage the natural place to equalize the multi-CLI experience — and it also stops wasting tokens: one query instead of reading a 900-row table into context.

Proposal: SQLite index first, source-of-truth flip as a later opt-in

Markdown does not go away — and in phase 1 it doesn't even change owners. The rollout is two phases, derived-index first (order inverted from the original draft per the review):

Phase 1 — DB as derived index (#919)

data/applications.md          ← source of truth (unchanged, all writes go here)
        │
        └── sync:   node tracker.mjs sync      → data/applications.db (rebuilt, safe to delete)
                    │
                    ├── query:  node tracker.mjs query --status Applied   (agents read this)
                    ├── history: status transitions accumulated across syncs
                    └── export: node tracker.mjs export   (inverse of sync — round-trip tested)
  • Zero-keys, local-first compliant — SQLite is a single local file, no server, no account, no token
  • Zero new dependencies — Node 22.5+ ships node:sqlite built in
  • Corruption detected at sync time instead of propagatingsync diagnoses every class above (mojibake, column drift, non-canonical statuses, duplicate ids) and normalizes them in the index only; the markdown is never modified
  • No stale readsquery/history auto-resync when the markdown changed since the last sync
  • Status history for free — a status_events table records Applied → Interview → Rejected transitions across syncs, so analyze-patterns.mjs gets a real funnel instead of only the current snapshot
  • User Layerdata/applications.db is gitignored with a DATA_CONTRACT.md entry; it's user data, and it's disposable

Phase 2 — opt-in flip (later, separate issue + PR)

For users who explicitly migrate, the DB becomes source of truth and applications.md becomes a rendered read-only view in the exact current table format — so modes, dashboard, merge-tracker semantics, human readability, and git diffs all survive. Reversible (export back to markdown), documented, never the default. Scope and conditions tracked in the roadmap above; implementation starts once phase 1 has earned it.

Validation dataset: my 964-row real-world tracker — exactly the scale where the current format breaks. One phase-1 sync against it surfaces every corruption class as diagnostics while leaving the file untouched.

Related: #230 (the scanner side already writes through shared functions, so pointing writers at a DB is one seam, not many).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions