Add experiment-runner agent for ML experiments by andyxhadji · Pull Request #1 · andyxhadji/roborev

andyxhadji · 2026-01-09T03:56:48Z

New agent that runs ML experiments instead of code reviews:

Detects changed files and maps to appropriate experiment scripts
Runs experiments via poetry (uses repo's poetry environment)
Parses MLFlow run_id and experiment_id from output
Constructs Databricks MLFlow URLs (requires DATABRICKS_HOST env var)
Parses F1, Precision, Recall metrics from output
Shows detailed logs on failure (last 100 lines stdout, 50 lines stderr)

Experiment detection logic:

If a .py file in experiments/ changed, run that experiment
Otherwise run experiments/baseline_experiment.py

…orev-dev#15) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

<img width="570" height="164" alt="image" src="https://github.com/user-attachments/assets/8a80823b-67db-413e-b689-61ed0305c612" /> <img width="778" height="637" alt="image" src="https://github.com/user-attachments/assets/6c8078ee-ce6d-4e2c-b57b-395636046dfa" /> ### Repo Filter Modal (`f` key) Press `f` in the queue view to open a searchable filter modal: - Lists all repos with job counts - Type to search/filter repos - Arrow keys or `j`/`k` to navigate - `Enter` to select, `Esc` to cancel - `Esc` in queue view clears active filter - Filter indicator shown in title: `[f: reponame]` When filtered: - Queue shows only jobs from selected repo - Status counts reflect filtered view - Navigation skips non-matching jobs - API fetches full history for filtered repo (`limit=0`) ## Bug Fixes - **Dirty build restart logic**: Only restart daemon when versions actually differ, not on every dirty build - **`getVisibleSelectedIdx` return value**: Return `-1` when no valid selection instead of `0` - **`/api/jobs` limit parameter**: Validate and clamp to `[0, 10000]` range - ## API Changes - `GET /api/jobs?repo=<path>` - Filter by repo root path (was name) - `GET /api/repos` - Now returns `root_path` field in addition to `name` - `GET /api/status` - Returns daemon version in response ## Test Coverage - Filter modal keyboard navigation - Filter selection and clearing - Filtered queue navigation - Zero visible jobs handling - API repo filter and limit parameters - `getVisibleSelectedIdx` edge cases - `/api/status` version field --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…he review list (roborev-dev#17) - Add j/k navigation between reviews in the TUI. - Add left/right arrow key navigation between reviews. - Show review number and repo name in the review screen --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Replaces fixed column widths with dynamic sizing based on terminal width. Changes: - Added columnWidths struct to track dynamic widths for ref, repo, agent - Added calculateColumnWidths() to distribute available space proportionally - Updated renderJobLine() to accept and use dynamic column widths - Increased separator line max width to 200 chars This allows the TUI to properly utilize wide terminals while maintaining readability on narrow terminals. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>

I've been using this to build a clinical data extraction workflow with a custom agent, and it's useful to have longer-running agent jobs! Thought I would contribute this piece back. Add job_timeout_minutes to both global and per-repo config, with ResolveJobTimeout() function following the same priority pattern as ResolveAgent(). Default timeout remains 10 minutes (preserving existing behavior). Priority order: 1. Per-repo config (.roborev.toml) 2. Global config (~/.roborev/config.toml) 3. Default (10 minutes) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>

New agent that runs ML experiments instead of code reviews: - Detects changed files and maps to appropriate experiment scripts - Runs experiments via poetry (uses repo's poetry environment) - Parses MLFlow run_id and experiment_id from output - Constructs Databricks MLFlow URLs (requires DATABRICKS_HOST env var) - Parses F1, Precision, Recall metrics from output - Shows detailed logs on failure (last 100 lines stdout, 50 lines stderr) Experiment detection logic: - If a .py file in experiments/ changed, run that experiment - Otherwise run experiments/baseline_experiment.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replaces fixed column widths with dynamic sizing based on terminal width. Changes: - Added columnWidths struct to track dynamic widths for ref, repo, agent - Added calculateColumnWidths() to distribute available space proportionally - Updated renderJobLine() to accept and use dynamic column widths - Increased separator line max width to 200 chars Column distribution: Ref (40%), Repo (35%), Agent (25%) of available space Fixed columns (Status, Queued, Elapsed, Addr'd) maintain constant widths This allows the TUI to properly utilize wide terminals while maintaining readability on narrow terminals. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add job_timeout_minutes to both global and per-repo config, with ResolveJobTimeout() function following the same priority pattern as ResolveAgent(). Default timeout remains 10 minutes (preserving existing behavior). Priority order: 1. Per-repo config (.roborev.toml) 2. Global config (~/.roborev/config.toml) 3. Default (10 minutes) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add "evaluating cohort" as trigger for evaluation section - Detect markdown table lines (containing |) - Stop capturing at "=== " section headers - Prevent premature section end on empty lines within tables Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Detect and capture sections starting with "=== Header ===" - This captures commit messages and other structured output - Save previous section before starting new one Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Changed from max 120 chars to dynamic width (max 80, m.width-4) - Consistent with review and prompt views Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove the following sections from experiment review output: - Header section: "# Experiment Results", "## Changes Detected", and "Running" line - Evaluation Details section at the bottom with verbose table output This simplifies the review output to show only: - Experiment name and status - MLFlow experiment link - Metrics summary Co-Authored-By: Claude <noreply@anthropic.com>

- Display job ID, repo name, git ref, and agent in title - Show commit subject below title for additional context - Matches the header format used in review view for consistency Co-Authored-By: Claude <noreply@anthropic.com>

Currently only up to 50 recent reviews is shown, this implements the necessary pagination. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Allows specifying an alternative data directory instead of ~/.roborev. Useful for testing, running multiple instances, or custom deployments. Changes: - Add config.DataDir() that checks ROBOREV_DATA_DIR first - Update GlobalConfigPath, DefaultDBPath, RuntimePath, GetCacheDir - Update init command to use config.DataDir() - Add tests for DataDir with and without env var - Document in CLAUDE.md Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

- Add commit message column to queue view between Ref and Repo - Adjust column widths: Message gets 50%, Ref/Repo/Agent reduced - Fix logs view auto-refresh and scroll calculation - Fix header line calculations for review and logs views - Hard-code poetry run python in experiment runner agent - Add evaluation details section to experiment output - Remove unused PythonCmd field from ExperimentRunnerAgent Co-Authored-By: Claude <noreply@anthropic.com>

Resolved conflicts in tui.go by combining: - Upstream pagination loading state check - Our logs auto-refresh functionality Both features now work together correctly.

Simplifies the experiment runner by removing MLFlow URL parsing and metrics extraction, keeping only evaluation details parsing. Co-Authored-By: Claude <noreply@anthropic.com>

- Update review prompts to ask for "No issues found." statement - Add verdict parser that looks for "no issues", "no findings" - Display P (green) / F (red) column in TUI queue view - Default to F on uncertainty (only clear positive signals give P) Closes roborev-dev#14 <img width="853" height="335" alt="image" src="https://github.com/user-attachments/assets/6cbc8b3e-fb4d-447a-844c-aa964279b77c" /> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Resolved merge conflict in tui.go by accepting upstream changes that add a Pass/Fail verdict column to the queue view. Co-Authored-By: Claude <noreply@anthropic.com>

Add missing width specifier for agent column in format string to fix fmt.Sprintf argument mismatch that was causing display errors. Co-Authored-By: Claude <noreply@anthropic.com>

Add missing width specifier for agent column in header format string. Co-Authored-By: Claude <noreply@anthropic.com>

wesm and others added 12 commits January 9, 2026 11:35

Shake and bake the executable path to avoid esoteric hook issues (rob…

22ac34c

…orev-dev#15) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Capture section headers in experiment output

53ba02d

- Detect and capture sections starting with "=== Header ===" - This captures commit messages and other structured output - Save previous section before starting new one Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make logs view use full terminal width

ac33ae7

- Changed from max 120 chars to dynamic width (max 80, m.width-4) - Consistent with review and prompt views Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andyxhadji force-pushed the experiment-runner-agent branch from 06cbae9 to 1114224 Compare January 9, 2026 19:24

andyxhadji and others added 13 commits January 9, 2026 20:12

fix: re-add filter

61e63ac

Add detailed header to logs view matching review view format

bff4f2e

- Display job ID, repo name, git ref, and agent in title - Show commit subject below title for additional context - Matches the header format used in review view for consistency Co-Authored-By: Claude <noreply@anthropic.com>

Implement pagination in TUI (roborev-dev#20)

26cb8b8

Currently only up to 50 recent reviews is shown, this implements the necessary pagination. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

README updates for upcoming 0.5.0 release (roborev-dev#22)

6826bd7

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Update queue screenshot for 0.5.0 (roborev-dev#23)

9b27b48

Merge upstream/main into experiment-runner-agent

f76a494

Resolved conflicts in tui.go by combining: - Upstream pagination loading state check - Our logs auto-refresh functionality Both features now work together correctly.

Remove MLFlow tracking and metrics parsing from experiment runner

037b46f

Simplifies the experiment runner by removing MLFlow URL parsing and metrics extraction, keeping only evaluation details parsing. Co-Authored-By: Claude <noreply@anthropic.com>

Merge upstream/main: Add P/F verdict column to TUI

e15d25e

Resolved merge conflict in tui.go by accepting upstream changes that add a Pass/Fail verdict column to the queue view. Co-Authored-By: Claude <noreply@anthropic.com>

Fix TUI formatting bug in job line rendering

fa59eae

Add missing width specifier for agent column in format string to fix fmt.Sprintf argument mismatch that was causing display errors. Co-Authored-By: Claude <noreply@anthropic.com>

Fix TUI header formatting bug

01c8b7b

Add missing width specifier for agent column in header format string. Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experiment-runner agent for ML experiments#1

Add experiment-runner agent for ML experiments#1
andyxhadji wants to merge 25 commits intomainfrom
experiment-runner-agent

andyxhadji commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andyxhadji commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants