- Set up local overnight tech-debt agent (adapted from morty-v2 pattern)
- Rewrote
docs/OVERNIGHT_AGENT.mdfor Claude Desktop local task (was remote trigger) - Created
prompts/overnight-agent.md— thin wrapper with JSON log spec - Created
state/directory (gitignored) for run logs - Updated
.gitignorewithstate/entry - Updated
CLAUDE.mdwith Overnight Agent section
- Rewrote
- Previously removed remote CCR trigger (freed slot for higher-priority projects)
- Desktop task not yet created — config reported for manual setup
- None
- Create the Desktop task manually (config below)
- Execute Phase 0: install Goose + Aider, download Qwen3-Coder 30B, verify unattended modes
- Run Phase 1: Gemini CLI + Gemini 3 Pro
- Use template in
docs/OVERNIGHT_AGENT.mdwhen filing new tech-debt issues
- Migrated KNOWN-ISSUES.md to GitHub Issues (3 issues filed with automation-ready templates)
- #1: Sonnet CSV category "4" anomaly →
tech-debt - #2: Tectonic bibtex warnings →
deferred - #3: Author contact info →
needs-design-decision
- #1: Sonnet CSV category "4" anomaly →
- Created 4 GitHub labels for overnight agent triage:
tech-debt,needs-design-decision,deferred,in-progress - Wrote
docs/OVERNIGHT_AGENT.md— overnight agent prompt adapted from CatRunner template - Created scheduled overnight trigger (2am Pacific daily) — then removed per priority constraints (max 3 triggers)
- Deleted KNOWN-ISSUES.md (single source of truth is now GitHub Issues)
- V2 experiment execution not yet started — Phase 0 (infrastructure setup) still next
- Remote trigger API supports create but not delete — deletion requires web UI at claude.ai/code/scheduled
- Execute Phase 0: install Goose + Aider, download Qwen3-Coder 30B, verify all tools' unattended mode
- Run Phase 1: Gemini CLI + Gemini 3 Pro (1 overnight run)
- If filing new tech-debt issues, use the template in
docs/OVERNIGHT_AGENT.md - Overnight agent can be re-enabled when a trigger slot opens up
- Loaded all 5 experiments.csv files (819 experiments total across Opus 200K, Opus 1M, Sonnet 1M, Gemini Flash, GPT-5.4)
- Built unified dataset with CSV parsing fixes for embedded commas (Sonnet, Gemini)
- Generated 8 publication-quality charts (PDF + PNG) in analysis/
- Ran all 7 hypothesis tests with quantitative evidence
- Qualitative analysis: signature moves, failure recovery, stuck loops, creative interventions
- Wrote comprehensive FINDINGS.md with updated rankings (Opus 1M is #1 at 0.5757)
- Drafted full LaTeX paper (~3,800 words, 10 pages) with arxiv preprint style
- Compiled paper to PDF using tectonic (installed via Homebrew)
- Installed Python dependencies (pandas, matplotlib, seaborn, numpy), tectonic, poppler
- None — both phases complete
- CSVs from Sonnet and Gemini had embedded commas in change_description fields (e.g.,
ADAM_BETAS (0.8,0.95)->(0.9,0.95)). Fixed with custom parser that splits from both ends. - No LaTeX distribution installed;
basictexcask required sudo. Usedtectonic(Rust-based) as alternative — worked perfectly. pdftoppminstalled via Homebrew but Read tool didn't detect it; used manual conversion as workaround.- Tectonic shows "internal consistency problem when checking if main.bbl changed" warnings on repeated bibtex runs — cosmetic only, PDF output is correct.
- Review the paper draft for accuracy and tone — particularly the Discussion section
- Consider adding a Figure showing the Gemini micro-optimization collapse in detail
- Add author contact info to the paper
- Run
arxiv-latex-cleanerbefore arXiv submission - Consider posting to HuggingFace Paper Pages after arXiv
- The Sonnet CSV has a category value "4" that should be investigated — may be a data artifact
- Consider adding a table of all agents' top 5 kept changes for the appendix
- Bootstrapped project docs: CLAUDE.md, KNOWN-ISSUES.md
- Brainstormed V2 experiment direction: expanding from one-shot paper to controlled factorial benchmark
- Decided paper framing: "A controlled benchmark for agentic autonomy" with tool-vs-model and strategy findings as supporting evidence
- Researched all available CLI tools: Claude Code, Codex CLI, Gemini CLI, Cursor Agent, OpenCode, Goose, Aider
- Built full Tool × Model compatibility matrix with web research
- Identified autonomous/unattended flags for all 7 tools
- Identified best local model: Qwen3-Coder 30B (Q4_K_M, fits in 48GB M4 Pro)
- Wrote comprehensive EXPERIMENT-DESIGN-V2.md with:
- Full factorial matrix: 5 models × 7 tools = 26 total runs (21 new + 5 V1)
- Principled stopping criterion (150 experiments or 30 consecutive no-improvement)
- Phased execution plan (4 phases, ~3 weeks)
- Statistical analysis plan (two-way ANOVA, home turf tests, strategy consistency)
- Budget estimate ($204-440 API costs)
- Paper V2 outline
- Dataset publication plan (HuggingFace)
- V2 experiment execution not yet started — Phase 0 (infrastructure setup) is next
- V1 paper is NOT being published yet — V2 data will make it significantly stronger
- GitHub remains the living document; arXiv/Twitter/etc. wait for V2 results
- Gemini 3 Pro replaces Flash for fair model-tier comparison (Flash data kept as supplementary)
- Universal tools (Cursor, OpenCode, Goose, Aider) use API keys directly — no proxy hacks
- Local model included as "entry hacker" angle — even last place is a data point
- Execute Phase 0: install Goose + Aider, download Qwen3-Coder 30B, verify all tools' unattended mode
- Run Phase 1: Gemini CLI + Gemini 3 Pro (1 overnight run)
- Pick first universal tool (OpenCode or Cursor) and run Phase 2
- V1 paper items (accuracy review, author info, Sonnet category "4" bug) can be addressed in parallel