Real-time training monitor for Parameter Golf runs. Single Python file, no build step, no configuration needed.
pip install -r requirements.txt
python3 dashboard.py --ssh "root@host -p 1234"Open the URL printed to stdout (includes auth token). Replace root@host -p 1234 with your RunPod TCP SSH connection string.
No changes to your training script needed — the dashboard parses standard train_gpt.py output.
# Monitor remote RunPod pod (auto-detects latest .log in /workspace/)
python3 dashboard.py --ssh "root@host -p 1234"
# Monitor specific log file on remote pod
python3 dashboard.py --ssh "root@host -p 1234" --remote-log /workspace/train.log
# Monitor multiple pods simultaneously
python3 dashboard.py --ssh "root@pod1 -p 1234" --ssh "root@pod2 -p 5678"
# View local log file
python3 dashboard.py train.log
# Compare two runs (overlay on chart)
python3 dashboard.py train.log --compare baseline.log
# Custom baseline reference line on chart
python3 dashboard.py train.log --baseline 1.10
# Auto-stop pod when training completes (saves log locally first)
python3 dashboard.py --ssh "root@host -p 1234" --auto-stop --save-dir ./logs
# Discord/Slack notifications on completion or BPB threshold
python3 dashboard.py --ssh "root@host -p 1234" \
--notify-webhook https://discord.com/api/webhooks/... \
--notify-threshold 1.10
# Custom host and port
python3 dashboard.py --host 0.0.0.0 --port 3000You can also upload .log files via the + button in the browser, or drag & drop them onto the page.
| Flag | Default | Description |
|---|---|---|
logs (positional) |
— | Local .log files to display |
--ssh |
— | SSH connection string (repeatable for multi-pod) |
--remote-log |
LATEST |
Log path on remote host |
--compare |
— | Comparison log files (overlay on chart) |
--baseline |
auto | BPB baseline reference line (auto-derives from first val_bpb if unset) |
--host |
127.0.0.1 |
Bind address |
--port |
8050 |
Server port |
--refresh |
180 |
Auto-refresh interval in seconds |
--auto-stop |
off | Save log and stop RunPod pod when training completes |
--save-dir |
. |
Directory to save logs on auto-stop |
--notify-webhook |
— | Discord/Slack webhook URL for notifications |
--notify-threshold |
— | Send notification when BPB crosses below this value |
--insecure-host-key |
off | Use StrictHostKeyChecking=no (not recommended) |
- Val BPB chart with baseline reference line (configurable via
--baseline) - Step count, current BPB, speed (ms/step)
- Progress bar counting down from max wallclock (10 min default)
- Warmup detection
- Progress bar switches to 10-min eval budget countdown
- Turns red if eval exceeds the time limit
- Stage tracker — each phase appears as it completes:
● Training 10m 00s · 4,600 steps
● Eval 4m 15s
● SWA 43 ckpts
● Roundtrip 1.2168 · 14s
● Sliding Window 1.1832 · 76s
● Hedge Mixer 1.1324 · 159s
● Artifact 15.40 MB
Only stages present in the log are shown — works with any eval pipeline.
Automatically detects OOM errors, tracebacks, and killed processes. Shows the error excerpt in the UI and sets the progress bar to red.
When monitoring via SSH, a status pill shows connection health:
- Green — connected and fetching
- Amber — stale (2+ consecutive failures)
- Red — failing (5+ consecutive failures), shows error text
All chart lines are monochrome (matching the current theme) with different dash styles (solid, dash, dot, dashdot) to distinguish runs. Primary run is full opacity, overlays are dimmed.
Click the logo to cycle: dark (default) → kitty (pink/black) → emerald (green/peach). Each theme has its own font and chart colors. Preference saved to localStorage.
--remote-log LATEST (the default) finds the most recently modified .log in /workspace/. When a new run starts, the dashboard switches to it automatically — no restart needed.
With --auto-stop, the dashboard saves the log locally and stops the RunPod pod when training completes. The pod is matched by SSH host IP — never stops the wrong pod. Requires runpodctl installed locally.
All endpoints (except /healthz) require ?token= parameter. The token is generated on startup and printed to stdout.
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/ |
GET | no | Dashboard UI (token embedded in page) |
/api/data |
GET | yes | All run data + config |
/api/upload |
POST | yes | Upload a .log file (max 50 MB) |
/api/remove |
POST | yes | Remove a run by key |
/api/log |
GET | yes | Raw log text (plain text) |
/healthz |
GET | no | Health check: version, uptime, SSH status |
- Binds to
127.0.0.1by default — only accessible from your machine - Auth token required on all data endpoints
- No shell interpolation — all SSH commands use argv lists
- Upload filenames sanitized server-side
- SSH host key verification enabled (
StrictHostKeyChecking=accept-new) - XSS prevention via DOM API (no
innerHTMLwith user data)
- Python 3.10+
fastapi,uvicorn,python-multipart(pip install -r requirements.txt)- Internet for CDN (Plotly.js, Google Fonts) on first page load
- SSH access to remote pod (optional, for remote monitoring)
Parses standard train_gpt.py output from the parameter-golf challenge:
model_params:23662344
world_size:8 gpu:NVIDIA H100 80GB HBM3
max_wallclock_seconds:600.000
warmup_step:5/20
step:100/20000 train_loss:2.5000 train_time:10000ms step_avg:100.00ms
step:100/20000 val_loss:2.3000 val_bpb:1.3600
swa: averaging 44 checkpoints
Serialized model int8+zstd22: 15403955 bytes
final_roundtrip_exact val_loss:2.0545 val_bpb:1.2168
final_sliding_window_exact val_loss:1.9978 val_bpb:1.1832
final_hedge_mixer_exact val_loss:1.9120 val_bpb:1.1324
Unrecognized lines are silently ignored — safe to use with any fork or modified script.
- Use the TCP SSH connection from RunPod (not the proxy one), e.g.
root@203.0.113.1 -p 12345 - Refresh rate: 10s for short runs, 3min for long runs — auto-detected
- If SSH drops, dashboard shows cached data until reconnection
- Logs are fetched incrementally (
tail -c) to minimize bandwidth on long runs
MIT
