Skip to content

Latest commit

 

History

History
122 lines (98 loc) · 6.09 KB

File metadata and controls

122 lines (98 loc) · 6.09 KB

Progress — LLM Agents as Autonomous ML Researchers


Session: 2026-03-30 09:10

Completed

  • Set up local overnight tech-debt agent (adapted from morty-v2 pattern)
    • Rewrote docs/OVERNIGHT_AGENT.md for Claude Desktop local task (was remote trigger)
    • Created prompts/overnight-agent.md — thin wrapper with JSON log spec
    • Created state/ directory (gitignored) for run logs
    • Updated .gitignore with state/ entry
    • Updated CLAUDE.md with Overnight Agent section
  • Previously removed remote CCR trigger (freed slot for higher-priority projects)

In Progress

  • Desktop task not yet created — config reported for manual setup

Issues Encountered

  • None

Next Session Should

  1. Create the Desktop task manually (config below)
  2. Execute Phase 0: install Goose + Aider, download Qwen3-Coder 30B, verify unattended modes
  3. Run Phase 1: Gemini CLI + Gemini 3 Pro
  4. Use template in docs/OVERNIGHT_AGENT.md when filing new tech-debt issues

Session: 2026-03-30 08:15

Completed

  • Migrated KNOWN-ISSUES.md to GitHub Issues (3 issues filed with automation-ready templates)
    • #1: Sonnet CSV category "4" anomaly → tech-debt
    • #2: Tectonic bibtex warnings → deferred
    • #3: Author contact info → needs-design-decision
  • Created 4 GitHub labels for overnight agent triage: tech-debt, needs-design-decision, deferred, in-progress
  • Wrote docs/OVERNIGHT_AGENT.md — overnight agent prompt adapted from CatRunner template
  • Created scheduled overnight trigger (2am Pacific daily) — then removed per priority constraints (max 3 triggers)
  • Deleted KNOWN-ISSUES.md (single source of truth is now GitHub Issues)

In Progress

  • V2 experiment execution not yet started — Phase 0 (infrastructure setup) still next

Issues Encountered

  • Remote trigger API supports create but not delete — deletion requires web UI at claude.ai/code/scheduled

Next Session Should

  1. Execute Phase 0: install Goose + Aider, download Qwen3-Coder 30B, verify all tools' unattended mode
  2. Run Phase 1: Gemini CLI + Gemini 3 Pro (1 overnight run)
  3. If filing new tech-debt issues, use the template in docs/OVERNIGHT_AGENT.md
  4. Overnight agent can be re-enabled when a trigger slot opens up

Session: 2026-03-20 08:30

Completed

  • Loaded all 5 experiments.csv files (819 experiments total across Opus 200K, Opus 1M, Sonnet 1M, Gemini Flash, GPT-5.4)
  • Built unified dataset with CSV parsing fixes for embedded commas (Sonnet, Gemini)
  • Generated 8 publication-quality charts (PDF + PNG) in analysis/
  • Ran all 7 hypothesis tests with quantitative evidence
  • Qualitative analysis: signature moves, failure recovery, stuck loops, creative interventions
  • Wrote comprehensive FINDINGS.md with updated rankings (Opus 1M is #1 at 0.5757)
  • Drafted full LaTeX paper (~3,800 words, 10 pages) with arxiv preprint style
  • Compiled paper to PDF using tectonic (installed via Homebrew)
  • Installed Python dependencies (pandas, matplotlib, seaborn, numpy), tectonic, poppler

In Progress

  • None — both phases complete

Issues Encountered

  • CSVs from Sonnet and Gemini had embedded commas in change_description fields (e.g., ADAM_BETAS (0.8,0.95)->(0.9,0.95)). Fixed with custom parser that splits from both ends.
  • No LaTeX distribution installed; basictex cask required sudo. Used tectonic (Rust-based) as alternative — worked perfectly.
  • pdftoppm installed via Homebrew but Read tool didn't detect it; used manual conversion as workaround.
  • Tectonic shows "internal consistency problem when checking if main.bbl changed" warnings on repeated bibtex runs — cosmetic only, PDF output is correct.

Next Session Should

  1. Review the paper draft for accuracy and tone — particularly the Discussion section
  2. Consider adding a Figure showing the Gemini micro-optimization collapse in detail
  3. Add author contact info to the paper
  4. Run arxiv-latex-cleaner before arXiv submission
  5. Consider posting to HuggingFace Paper Pages after arXiv
  6. The Sonnet CSV has a category value "4" that should be investigated — may be a data artifact
  7. Consider adding a table of all agents' top 5 kept changes for the appendix

Session: 2026-03-22 ~15:00

Completed

  • Bootstrapped project docs: CLAUDE.md, KNOWN-ISSUES.md
  • Brainstormed V2 experiment direction: expanding from one-shot paper to controlled factorial benchmark
  • Decided paper framing: "A controlled benchmark for agentic autonomy" with tool-vs-model and strategy findings as supporting evidence
  • Researched all available CLI tools: Claude Code, Codex CLI, Gemini CLI, Cursor Agent, OpenCode, Goose, Aider
  • Built full Tool × Model compatibility matrix with web research
  • Identified autonomous/unattended flags for all 7 tools
  • Identified best local model: Qwen3-Coder 30B (Q4_K_M, fits in 48GB M4 Pro)
  • Wrote comprehensive EXPERIMENT-DESIGN-V2.md with:
    • Full factorial matrix: 5 models × 7 tools = 26 total runs (21 new + 5 V1)
    • Principled stopping criterion (150 experiments or 30 consecutive no-improvement)
    • Phased execution plan (4 phases, ~3 weeks)
    • Statistical analysis plan (two-way ANOVA, home turf tests, strategy consistency)
    • Budget estimate ($204-440 API costs)
    • Paper V2 outline
    • Dataset publication plan (HuggingFace)

In Progress

  • V2 experiment execution not yet started — Phase 0 (infrastructure setup) is next

Decisions Made

  • V1 paper is NOT being published yet — V2 data will make it significantly stronger
  • GitHub remains the living document; arXiv/Twitter/etc. wait for V2 results
  • Gemini 3 Pro replaces Flash for fair model-tier comparison (Flash data kept as supplementary)
  • Universal tools (Cursor, OpenCode, Goose, Aider) use API keys directly — no proxy hacks
  • Local model included as "entry hacker" angle — even last place is a data point

Next Session Should

  1. Execute Phase 0: install Goose + Aider, download Qwen3-Coder 30B, verify all tools' unattended mode
  2. Run Phase 1: Gemini CLI + Gemini 3 Pro (1 overnight run)
  3. Pick first universal tool (OpenCode or Cursor) and run Phase 2
  4. V1 paper items (accuracy review, author info, Sonnet category "4" bug) can be addressed in parallel