Skip to content

Add experiment-runner agent for ML experiments#1

Open
andyxhadji wants to merge 25 commits intomainfrom
experiment-runner-agent
Open

Add experiment-runner agent for ML experiments#1
andyxhadji wants to merge 25 commits intomainfrom
experiment-runner-agent

Conversation

@andyxhadji
Copy link
Copy Markdown
Owner

New agent that runs ML experiments instead of code reviews:

  • Detects changed files and maps to appropriate experiment scripts
  • Runs experiments via poetry (uses repo's poetry environment)
  • Parses MLFlow run_id and experiment_id from output
  • Constructs Databricks MLFlow URLs (requires DATABRICKS_HOST env var)
  • Parses F1, Precision, Recall metrics from output
  • Shows detailed logs on failure (last 100 lines stdout, 50 lines stderr)

Experiment detection logic:

  • If a .py file in experiments/ changed, run that experiment
  • Otherwise run experiments/baseline_experiment.py

wesm and others added 12 commits January 9, 2026 11:35
<img width="570" height="164" alt="image"
src="https://github.com/user-attachments/assets/8a80823b-67db-413e-b689-61ed0305c612"
/>

<img width="778" height="637" alt="image"
src="https://github.com/user-attachments/assets/6c8078ee-ce6d-4e2c-b57b-395636046dfa"
/>

### Repo Filter Modal (`f` key)

Press `f` in the queue view to open a searchable filter modal:
- Lists all repos with job counts
- Type to search/filter repos
- Arrow keys or `j`/`k` to navigate
- `Enter` to select, `Esc` to cancel
- `Esc` in queue view clears active filter
- Filter indicator shown in title: `[f: reponame]`

When filtered:
- Queue shows only jobs from selected repo
- Status counts reflect filtered view
- Navigation skips non-matching jobs
- API fetches full history for filtered repo (`limit=0`)

## Bug Fixes
- **Dirty build restart logic**: Only restart daemon when versions
actually differ, not on every dirty build
- **`getVisibleSelectedIdx` return value**: Return `-1` when no valid
selection instead of `0`
- **`/api/jobs` limit parameter**: Validate and clamp to `[0, 10000]`
range
- 
## API Changes
- `GET /api/jobs?repo=<path>` - Filter by repo root path (was name)
- `GET /api/repos` - Now returns `root_path` field in addition to `name`
- `GET /api/status` - Returns daemon version in response

## Test Coverage
- Filter modal keyboard navigation
- Filter selection and clearing
- Filtered queue navigation
- Zero visible jobs handling
- API repo filter and limit parameters
- `getVisibleSelectedIdx` edge cases
- `/api/status` version field

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…he review list (roborev-dev#17)

- Add j/k navigation between reviews in the TUI.
- Add left/right arrow key navigation between reviews.
- Show review number and repo name in the review screen

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Replaces fixed column widths with dynamic sizing based on terminal
width.

Changes:
- Added columnWidths struct to track dynamic widths for ref, repo, agent
- Added calculateColumnWidths() to distribute available space
proportionally
- Updated renderJobLine() to accept and use dynamic column widths
- Increased separator line max width to 200 chars

This allows the TUI to properly utilize wide terminals while maintaining
readability on narrow terminals.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
I've been using this to build a clinical data extraction workflow with a
custom agent, and it's useful to have longer-running agent jobs! Thought
I would contribute this piece back.

Add job_timeout_minutes to both global and per-repo config, with
ResolveJobTimeout() function following the same priority pattern as
ResolveAgent(). Default timeout remains 10 minutes (preserving existing
behavior).

Priority order:
1. Per-repo config (.roborev.toml)
2. Global config (~/.roborev/config.toml)
3. Default (10 minutes)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
New agent that runs ML experiments instead of code reviews:
- Detects changed files and maps to appropriate experiment scripts
- Runs experiments via poetry (uses repo's poetry environment)
- Parses MLFlow run_id and experiment_id from output
- Constructs Databricks MLFlow URLs (requires DATABRICKS_HOST env var)
- Parses F1, Precision, Recall metrics from output
- Shows detailed logs on failure (last 100 lines stdout, 50 lines stderr)

Experiment detection logic:
- If a .py file in experiments/ changed, run that experiment
- Otherwise run experiments/baseline_experiment.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaces fixed column widths with dynamic sizing based on terminal width.

Changes:
- Added columnWidths struct to track dynamic widths for ref, repo, agent
- Added calculateColumnWidths() to distribute available space proportionally
- Updated renderJobLine() to accept and use dynamic column widths
- Increased separator line max width to 200 chars

Column distribution: Ref (40%), Repo (35%), Agent (25%) of available space
Fixed columns (Status, Queued, Elapsed, Addr'd) maintain constant widths

This allows the TUI to properly utilize wide terminals while maintaining
readability on narrow terminals.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add job_timeout_minutes to both global and per-repo config, with
ResolveJobTimeout() function following the same priority pattern as
ResolveAgent(). Default timeout remains 10 minutes (preserving
existing behavior).

Priority order:
1. Per-repo config (.roborev.toml)
2. Global config (~/.roborev/config.toml)
3. Default (10 minutes)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add "evaluating cohort" as trigger for evaluation section
- Detect markdown table lines (containing |)
- Stop capturing at "=== " section headers
- Prevent premature section end on empty lines within tables

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Detect and capture sections starting with "=== Header ==="
- This captures commit messages and other structured output
- Save previous section before starting new one

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Changed from max 120 chars to dynamic width (max 80, m.width-4)
- Consistent with review and prompt views

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the following sections from experiment review output:
- Header section: "# Experiment Results", "## Changes Detected", and "Running" line
- Evaluation Details section at the bottom with verbose table output

This simplifies the review output to show only:
- Experiment name and status
- MLFlow experiment link
- Metrics summary

Co-Authored-By: Claude <noreply@anthropic.com>
@andyxhadji andyxhadji force-pushed the experiment-runner-agent branch from 06cbae9 to 1114224 Compare January 9, 2026 19:24
andyxhadji and others added 13 commits January 9, 2026 20:12
- Display job ID, repo name, git ref, and agent in title
- Show commit subject below title for additional context
- Matches the header format used in review view for consistency

Co-Authored-By: Claude <noreply@anthropic.com>
Currently only up to 50 recent reviews is shown, this implements the
necessary pagination.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Allows specifying an alternative data directory instead of ~/.roborev.
Useful for testing, running multiple instances, or custom deployments.

Changes:
- Add config.DataDir() that checks ROBOREV_DATA_DIR first
- Update GlobalConfigPath, DefaultDBPath, RuntimePath, GetCacheDir
- Update init command to use config.DataDir()
- Add tests for DataDir with and without env var
- Document in CLAUDE.md

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Add commit message column to queue view between Ref and Repo
- Adjust column widths: Message gets 50%, Ref/Repo/Agent reduced
- Fix logs view auto-refresh and scroll calculation
- Fix header line calculations for review and logs views
- Hard-code poetry run python in experiment runner agent
- Add evaluation details section to experiment output
- Remove unused PythonCmd field from ExperimentRunnerAgent

Co-Authored-By: Claude <noreply@anthropic.com>
Resolved conflicts in tui.go by combining:
- Upstream pagination loading state check
- Our logs auto-refresh functionality

Both features now work together correctly.
Simplifies the experiment runner by removing MLFlow URL parsing and metrics
extraction, keeping only evaluation details parsing.

Co-Authored-By: Claude <noreply@anthropic.com>
- Update review prompts to ask for "No issues found." statement
- Add verdict parser that looks for "no issues", "no findings"
- Display P (green) / F (red) column in TUI queue view
- Default to F on uncertainty (only clear positive signals give P)

Closes roborev-dev#14

<img width="853" height="335" alt="image"
src="https://github.com/user-attachments/assets/6cbc8b3e-fb4d-447a-844c-aa964279b77c"
/>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Resolved merge conflict in tui.go by accepting upstream changes that add
a Pass/Fail verdict column to the queue view.

Co-Authored-By: Claude <noreply@anthropic.com>
Add missing width specifier for agent column in format string to fix
fmt.Sprintf argument mismatch that was causing display errors.

Co-Authored-By: Claude <noreply@anthropic.com>
Add missing width specifier for agent column in header format string.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants