Skip to content

fix(warm-pool): Wire dispatch path, fix stale GCS paths#46

Open
lukacf wants to merge 4 commits intomainfrom
fix/warm-pool-wiring
Open

fix(warm-pool): Wire dispatch path, fix stale GCS paths#46
lukacf wants to merge 4 commits intomainfrom
fix/warm-pool-wiring

Conversation

@lukacf
Copy link
Copy Markdown
Owner

@lukacf lukacf commented Mar 20, 2026

Summary

Fixes critical issues found in honest review of v0.3.0 warm pool:

What was broken

  1. Dispatch path was dead codeWarmPoolManager existed but nobody called it. Enabling warm_pool: true had zero effect.
  2. GCS log paths were stale — second job on warm instance would upload logs to the first job's GCS location, corrupting both runs' data.
  3. Exit code path was staleEXIT_CODE_FILE pointed to first run's gcsfuse mount, never updated for subsequent jobs.
  4. Poll scan was O(n)_get_run_handle scanned all running warm instances every 5s. Now single-row lookup by backend_handle.

What's fixed

  • GCERunBackend.launch() passes warm_pool_idle_timeout_seconds through to startup script
  • register_instance() called after fresh launch when pool has capacity
  • Idle loop updates GCS_STDOUT_PATH, GCS_STDERR_PATH, GCS_EXIT_CODE_PATH from new run's run_path
  • Exit code written directly to GCS via gsutil cp, not stale local path
  • GCS_BUCKET exported for idle loop path construction
  • 3 new regression tests for the fixed behaviors

Test plan

  • 25 warm pool unit tests pass (14 Phase 1 + 11 Phase 2)
  • All 2508 unit tests pass
  • Lint + mypy clean
  • Integration: 1209/1210 pass (1 pre-existing flaky CAS concurrency test)

🤖 Generated with Claude Code

Fixes from honest review of v0.3.0 warm pool implementation:

1. DISPATCH WIRED: GCERunBackend.launch() now passes warm_pool_idle_timeout
   through gce_launcher to build_startup_script. register_instance() called
   after fresh launch when pool has capacity. Previously the entire dispatch
   path was dead code — enabling warm pool had no effect.

2. GCS PATHS FIXED: Idle loop now updates GCS_STDOUT_PATH, GCS_STDERR_PATH,
   GCS_EXIT_CODE_PATH from the new run's run_path. Previously all subsequent
   jobs would upload logs to the FIRST run's GCS location.

3. EXIT CODE FIXED: Uses direct gsutil cp to GCS instead of stale
   EXIT_CODE_FILE local path that pointed to the first run's gcsfuse mount.

4. GCS_BUCKET EXPORTED: Idle loop needs bucket name to construct paths for
   subsequent runs. Now exported before idle loop definition.

5. POLL SCAN FIXED: _get_run_handle no longer scans all running warm
   instances. Uses single-row lookup by backend_handle (O(1) not O(n)).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

Code Review — fix(warm-pool): Wire dispatch path, fix stale GCS paths

Good set of targeted fixes overall. The dispatch wiring, O(n)→O(1) poll fix, and instance metadata exit-code approach are all solid improvements. Found one functional bug that will cause all warm-pool re-runs to write logs to wrong GCS paths.


Bug: Double runs/ prefix in idle loop GCS paths

File: src/goldfish/cloud/adapters/gcp/startup_builder.py, lines 891–895

In gce_launcher.py (line 171), run_path is constructed as:

run_path = f"runs/{stage_run_id}"

With bucket_prefix="", the canonical GCS path becomes gs://{bucket}/runs/{stage_run_id}/logs/....

This same value ("runs/stage-abc123") is what gets written into $spec_dir/run_path by warm_pool.py and later read as $new_run_path in the idle loop.

But the new code then constructs:

GCS_STDOUT_PATH="gs://${GCS_BUCKET:-}/runs/$new_run_path/logs/stdout.log"

Which expands to:

gs://{bucket}/runs/runs/stage-abc123/logs/stdout.log
                   ^^^^
                   double prefix

The fix is to drop the hardcoded runs/ prefix since it's already embedded in $new_run_path:

GCS_STDOUT_PATH="gs://${GCS_BUCKET:-}/$new_run_path/logs/stdout.log"
GCS_STDERR_PATH="gs://${GCS_BUCKET:-}/$new_run_path/logs/stderr.log"
GCS_METRICS_PATH="gs://${GCS_BUCKET:-}/$new_run_path/logs/metrics.jsonl"
GCS_SVS_DURING_PATH="gs://${GCS_BUCKET:-}/$new_run_path/outputs/.goldfish/svs_findings_during.json"
GCS_EXIT_CODE_PATH="gs://${GCS_BUCKET:-}/$new_run_path/logs/exit_code.txt"

The three new regression tests don't catch this because test_idle_loop_updates_gcs_paths_for_new_run only asserts that "GCS_STDOUT_PATH=" and "new_run_path" appear in the script — it doesn't verify the path doesn't have a double prefix. A stronger assertion would be:

assert "gs://${{GCS_BUCKET:-}}/runs/$new_run_path" not in script  # no double prefix
assert "gs://${{GCS_BUCKET:-}}/$new_run_path" in script

Minor: Private attribute access in run_backend.py

File: src/goldfish/cloud/adapters/gcp/run_backend.py, line 242

warm_pool_timeout = self._warm_pool._config.idle_timeout_minutes * 60

_config is an internal implementation detail of WarmPoolManager. Consider exposing a method:

# In WarmPoolManager
def idle_timeout_seconds(self) -> int:
    return self._config.idle_timeout_minutes * 60

This avoids run_backend reaching into warm_pool's internals and makes refactoring WarmPoolConfig safer.


Minor: Falsy check on warm_pool_timeout

File: src/goldfish/cloud/adapters/gcp/run_backend.py, line 261

if warm_pool_timeout and self._warm_pool:

warm_pool_timeout is int | None. A timeout of 0 (unlikely, but valid) would skip registration. Prefer:

if warm_pool_timeout is not None and self._warm_pool:

Looks good

  • O(1) single-row DB lookup in _get_run_handle is a clean fix for the scan problem
  • Writing exit code via gsutil cp - instead of the stale gcsfuse path is the right approach
  • Adding instance metadata goldfish_exit_code gives a fast-path fallback for status polling
  • GCS_BUCKET export is correctly placed before the idle loop function definition
  • Test structure is clean and the docstrings explain the bugs well

The runs/ double-prefix bug is the only blocking issue; the rest are non-blocking suggestions.

Fixes all issues from honest review:

1. DISPATCH WIRED IN HOT PATH: warm_pool_manager passed into
   gce_launcher.launch_instance(). Warm pool claim happens AFTER
   scripts are built but BEFORE instance creation — where all the
   pieces (env_map, pre_run_cmds, post_run_cmds, docker_cmd) are
   available. New _build_docker_cmd_script() generates standalone
   Docker run script for warm pool reuse.

2. GCS PATHS UPDATED PER JOB: Idle loop now updates GCS_STDOUT_PATH,
   GCS_STDERR_PATH, GCS_EXIT_CODE_PATH from new run's run_path.
   Exit code written directly to GCS via gsutil, not stale local path.

3. BACKGROUND PROCESSES RESET: Idle loop kills watchdog, supervisor,
   log syncer, and metadata syncer PIDs between jobs. Metadata syncer
   restarted for each new job.

4. POLL SCAN O(1): _get_run_handle uses single-row lookup by
   backend_handle instead of scanning all running warm instances.

5. INTEGRATION TESTS: 8 new tests covering full lifecycle — register,
   claim, ACK, timeout, release, reap, emergency cleanup, pool cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

test

@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

Goldfish Development Guide

For AI assistants working on this codebase. Everything you need to develop Goldfish effectively—compact, scannable, and action-oriented.

Quick Reference

# Development cycle
make lint              # Ruff + mypy via pre-commit - run before commits
make test              # Fast unit tests (<1s) - run frequently
make test-integration  # Integration tests (~2min) - before pushing
make ci                # Full CI suite (lint + all tests)

# First-time setup
uv pip install -e ".[dev]"
make install-hooks     # REQUIRED: installs pre-commit hooks

Golden rule: Never suppress lint errors—always fix the source.


TDD: Test-Driven Development

This codebase uses TDD. Write tests BEFORE implementation.

RED    → Write failing test that defines expected behavior
GREEN  → Write minimal code to make test pass
REFACTOR → Clean up while keeping tests green

Why TDD matters for LLMs:

  • Tests encode intent unambiguously—no guessing what "done" means
  • Failing tests provide immediate feedback on implementation correctness
  • Test-first prevents over-engineering (you only build what's tested)
  • Regression safety: refactoring is safe when tests pass

Workflow:

  1. Understand the requirement
  2. Write test(s) in tests/unit/ or tests/integration/
  3. Run make test → confirm RED (test fails)
  4. Implement the feature
  5. Run make test → confirm GREEN (test passes)
  6. Run make lint → fix any issues
  7. Refactor if needed, keeping tests green

Test naming: test_<what>_<condition>_<expected> e.g., test_get_logs_when_file_missing_returns_none

No exceptions: Even "quick fixes" get tests first. The test documents the bug and prevents regression.


What is Goldfish?

An MCP server enabling Claude Code to conduct ML experiments by managing:

  • Workspaces = isolated experiment environments (copy-based, NO git in user workspace)
  • Versions = immutable snapshots (auto-created on every run, 100% provenance)
  • Pipelines = YAML workflows (stage definitions + signal wiring)
  • Stages = Python modules (run in Docker containers)
  • Signals = typed data flow (dataset, npy, csv, directory, file)

Core invariants:

  • All infrastructure (Docker, GCS, GCE) is hidden from the MCP client
  • User workspace is plain files (no .git) - all versioning in dev repo
  • Every run() syncs and commits BEFORE execution (100% provenance)

Architecture at a Glance

MCP Client (Claude) ─── JSON-RPC ───▶ server.py
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    ▼                     ▼                     ▼
             server_tools/*         context.py            db/database.py
             (40+ MCP tools)     (ServerContext DI)         (SQLite)
                    │                     │
        ┌───────────┼───────────┐         │
        ▼           ▼           ▼         ▼
   workspace/    jobs/      pipeline/   cloud/
   manager.py   stage_      parser.py   protocols.py    ◄── Backend-agnostic
   git_layer.py executor.py             contracts.py        interfaces
                    │                   factory.py
                    │                       │
                    └───────────────────────┤
                                            ▼
                                    cloud/adapters/
                                    ├── local/       ◄── Docker
                                    └── gcp/         ◄── GCE + GCS

Key Files

File Purpose
jobs/stage_executor.py Core: Stage execution + sync + provenance + review
cloud/protocols.py RunBackend, ObjectStorage, ImageBuilder interfaces
cloud/contracts.py BackendCapabilities, RunSpec, RunHandle, BackendStatus
cloud/factory.py AdapterFactory for backend instantiation
cloud/adapters/local/ LocalRunBackend (Docker-based execution)
cloud/adapters/gcp/ GCERunBackend (GCE instances), GCSStorage
pre_run_review.py Pre-run code review using Claude Agent SDK
db/database.py All database operations
workspace/manager.py Workspace CRUD + copy-based mounting
workspace/git_layer.py Git ops + sync_slot_to_branch
server.py MCP server initialization

The Nine Abstractions

1. Workspaces = Copy-Based Isolation

MOUNT:  dev-repo/branch ──copy──▶ user/workspaces/w1/ (plain files, NO .git)
WORK:   Claude edits user/workspaces/w1/
RUN:    user/w1/ ──sync──▶ dev-repo/branch ──commit──▶ execute

Key operations: create_workspace(), mount(), hibernate(), checkpoint()

2. Versions = Git Tags (100% Provenance)

Every run() syncs changes back to dev repo, commits, THEN creates version:

user edits ──sync──▶ commit in dev-repo ──tag──▶ baseline_lstm-v1

Stored in workspace_versions table with created_by: run|checkpoint|manual

3. Pipelines = YAML

stages:
  - name: preprocess
    inputs: {raw: {type: dataset, dataset: sales_v1}}
    outputs: {features: {type: npy}}
  - name: train
    inputs: {features: {from_stage: preprocess, signal: features}}

Parser validates: unique names, type compatibility, no cycles, datasets exist.

4. Stages = Docker Containers

# modules/train.py - runs in container
from goldfish.io import load_input, save_output
features = load_input("features")  # from /mnt/inputs/
save_output("model", model_dir)    # to /mnt/outputs/

5. Signals = Data Flow

Type Format Use Case
dataset External Registered project data
npy NumPy Arrays, embeddings
csv Pandas Tabular data
directory Dir Model checkpoints
file Single file Configs, small outputs

Tracked in signal_lineage table for full provenance.

6. Resource Profiles

# configs/train.yaml
compute:
  profile: "h100-spot"  # Claude writes this

Goldfish resolves to: a3-highgpu-1g, H100 GPU, spot pricing, multi-zone.

Built-in: cpu-small, cpu-large, h100-spot, h100-on-demand, a100-spot, a100-on-demand

7. SVS (Semantic Validation System)

Core System: validation.py, src/goldfish/svs/, integrated into jobs/stage_executor.py

SVS provides defense-in-depth through three phases:

  1. Pre-Run Review: AI-driven static analysis of code/config/diff using the Claude Agent SDK (pre_run_review.py).
  2. Schema Contracts: Mechanistic validation of stage outputs against pipeline.yaml definitions (shape, dtype, kind).
  3. Output Stats: Automatic computation of statistical properties (entropy, null ratio, unique counts) for every signal, stored in signal_lineage.stats_json.

Key Patterns:

  • Enforcement Modes: warning (log only) or blocking (fail stage).
  • Reservoir Sampling: Stats are computed on 10k samples to handle large tensors/CSVs.
  • Fail-Open: AI reviews approve by default on timeout or API error to avoid blocking developer velocity.

Security:

  • Path traversal protection in pre_run_review.py.
  • File size limits (100KB/file) for review context.

8. Cloud Abstraction Layer

Core System: src/goldfish/cloud/ - Backend-agnostic execution and storage.

The cloud abstraction layer isolates provider-specific code (GCP, AWS, local) from core Goldfish logic:

cloud/
├── protocols.py    # Interfaces: RunBackend, ObjectStorage, ImageBuilder
├── contracts.py    # Data types: BackendCapabilities, RunSpec, BackendStatus
├── factory.py      # AdapterFactory for DI
└── adapters/
    ├── local/      # LocalRunBackend (Docker containers)
    └── gcp/        # GCERunBackend (GCE instances), GCSStorage

Key Protocols:

  • RunBackend: Unified interface for execution (launch(), get_status(), terminate(), get_logs())
  • ObjectStorage: Blob storage operations (put(), get(), exists(), delete())
  • ImageBuilder: Docker image building (build(), push())
  • SignalBus: Inter-stage signal coordination and messaging
  • InstanceIdentity: Cloud instance metadata abstraction
  • ImageRegistry: Container image registry operations

BackendCapabilities - Behavior configuration instead of conditionals:

@dataclass
class BackendCapabilities:
    ack_timeout_seconds: float = 1.0    # How long to wait for ACK
    has_launch_delay: bool = False       # GCE has startup delay, local doesn't
    timeout_becomes_pending: bool = False # GCE timeout = sync pending, local = failure
    logs_unavailable_message: str = "Logs not available"
    zone_resolution_method: str = "config"  # "config" or "handle"

Usage Pattern - Always use protocol, never direct launcher:

# GOOD: Protocol-based
result = self.run_backend.launch(spec)
status = self.run_backend.get_status(handle)
logs = self.run_backend.get_logs(handle)

# BAD: Direct launcher access (violates abstraction)
self.gce_launcher.launch_instance(...)  # NEVER do this

Adding a New Backend:

  1. Implement RunBackend protocol in cloud/adapters/your_provider/run_backend.py
  2. Set appropriate BackendCapabilities values
  3. Register in cloud/factory.py
  4. No changes needed in stage_executor.py or execution_tools.py

9. Configuration Flexibility

Defaults Section - Global settings for stage execution:

# goldfish.yaml
defaults:
  timeout_seconds: 7200    # 2 hours (default: 3600)
  log_sync_interval: 15    # Sync logs every 15 seconds (default: 10)
  backend: gce             # Default compute backend: local, gce, kubernetes

Storage Backend Configuration - Multi-provider storage support:

# goldfish.yaml
storage:
  backend: "gcs"  # or "s3", "azure", "local"

  # GCS configuration (when backend: gcs)
  gcs:
    bucket: "my-bucket"
    sources_prefix: "sources/"
    artifacts_prefix: "artifacts/"

  # S3 configuration (when backend: s3) - adapter coming soon
  s3:
    bucket: "my-bucket"
    region: "us-east-1"
    endpoint_url: "http://localhost:9000"  # For MinIO/S3-compatible

  # Azure configuration (when backend: azure) - adapter coming soon
  azure:
    container: "my-container"
    account: "mystorageaccount"

Backend Selection Priority:

  1. New storage: section takes precedence if present
  2. Falls back to legacy gcs: section for backwards compatibility
  3. AdapterFactory.create_storage() handles resolution automatically

Per-Profile Backend Selection - Different compute backends per profile:

# goldfish.yaml
gce:
  project_id: my-project
  profile_overrides:
    # GPU workloads on GCE
    h100-spot:
      zones: ["us-central1-a"]
    # CPU workloads could use different config
    cpu-large:
      zones: ["us-west1-a", "us-west1-b"]

Config Model Hierarchy:

GoldfishConfig
├── defaults: DefaultsConfig          # Global execution defaults
├── storage: StorageConfig | None     # Multi-backend storage (new)
│   ├── backend: "gcs" | "s3" | "azure" | "local"
│   ├── gcs: GCSConfig | None
│   ├── s3: S3StorageConfig | None
│   └── azure: AzureStorageConfig | None
├── gcs: GCSConfig | None             # Legacy GCS config (backwards compat)
├── gce: GCEConfig | None             # GCE compute config
├── jobs: JobsConfig                  # Job execution settings
└── local: LocalConfig                # Local backend simulation

Critical Patterns

Database Access

# ALWAYS use context manager
with self.db._conn() as conn:
    conn.execute("INSERT INTO ...")
# Transaction auto-commits on success, auto-rollbacks on exception

Error Handling

# ALWAYS use specific error types with details
raise WorkspaceNotFoundError(
    f"Workspace '{name}' not found",
    details={"available": available_workspaces}
)

# NEVER expose git internals
# BAD:  raise Exception("fatal: not a valid object name")
# GOOD: raise WorkspaceNotFoundError("Workspace not found")

TypedDict Returns from Database

# When returning TypedDict, ALWAYS use cast()
from typing import cast
return cast(JobRow, dict(row)) if row else None

# For lists:
return [cast(SourceRow, dict(r)) for r in rows]

MCP Tool Pattern

@mcp.tool()
def my_tool(param: str) -> dict:
    """Docstring for Claude."""
    try:
        validate_workspace_name(param)           # 1. Validate
        result = manager.do_thing(param)         # 2. Execute
        ctx.db.record_audit("my_tool", {...})    # 3. Audit
        return {"success": True, "result": result}  # 4. Return
    except GoldfishError as e:
        return {"success": False, "error": e.message}

Security Model (4 Layers)

1. Input Validation (validation.py)

Input Pattern Example
Workspace name ^[a-zA-Z0-9_-]+$ baseline_lstm
Snapshot ID ^snap-[a-f0-9]{8}-\d{8}-\d{6}$ snap-abc12345-20251210-143000
Stage run ID ^stage-[a-f0-9]+$ stage-abc123

2. Path Traversal Protection

# ALWAYS validate paths
def validate_path_within_root(path: Path, root: Path) -> None:
    if not path.resolve().is_relative_to(root.resolve()):
        raise ValidationError("Path traversal")

# ALWAYS check symlinks (TOCTOU prevention)
if path.is_symlink():
    raise InvalidLogPathError("Symlink detected")

3. Docker Sandboxing (cloud/adapters/local/)

# Containers run with:
--memory 4g --cpus 2.0 --pids-limit 100
--user 1000:1000  # non-root
-v inputs:/mnt/inputs:ro  # read-only inputs

4. Git Error Translation (errors.py)

All git errors translated to Goldfish concepts before reaching Claude.


Stage Execution Flow

run("w1", stages=["train"])
         │
         ├─▶ 1. Validate workspace mounted
         ├─▶ 2. SYNC: Copy user/w1 → dev-repo/branch (with delete semantics)
         ├─▶ 3. COMMIT: Auto-commit changes in dev-repo
         ├─▶ 4. PUSH: Push to remote (for GCE execution)
         ├─▶ 5. Auto-version (create git tag from committed SHA)
         ├─▶ 6. Load pipeline, validate stage exists
         ├─▶ 7. Resolve inputs (datasets or upstream signals)
         ├─▶ 8. Build Docker image
         ├─▶ 9. Launch container (local or GCE)
         ├─▶ 10. Monitor status, stream logs
         └─▶ 11. Finalize: register outputs in signal_lineage

Key methods:

  • GitLayer.sync_slot_to_branch() - sync + commit (provenance guard)
  • StageExecutor.run_stage() in jobs/stage_executor.py

Database Schema (Key Tables)

workspace_versions(workspace_name, version, git_sha, created_by, created_at)
stage_runs(id, workspace_name, version, stage_name, status, backend_type, ...)
signal_lineage(stage_run_id, signal_name, signal_type, storage_location)
audit(operation, workspace, details_json, created_at)

Full schema: db/schema.sql


Testing

Structure

tests/
├── unit/           # Over 2400 tests, <1s, pure logic, all mocked
├── integration/    # Over 1200 tests, ~2min, real DB + git
├── e2e/            # Full Docker tests
│   └── deluxe/     # GCE tests (@pytest.mark.deluxe_gce)
└── conftest.py     # Fixtures: test_db, temp_git_repo

Key Fixtures

test_db        # Fresh SQLite with schema
temp_git_repo  # Initialized git repo with main branch
test_config    # GoldfishConfig for testing

Writing Tests

def test_feature(test_db, temp_git_repo):
    """What + Why in docstring."""
    manager = WorkspaceManager(db=test_db, ...)
    result = manager.create_workspace("test", "goal")
    assert result.name == "test"
    # Always verify DB state too
    with test_db._conn() as conn:
        row = conn.execute("SELECT ...").fetchone()
        assert row is not None

DO and DON'T

DO DON'T
make lint before committing # type: ignore (fix the issue)
Specific error types (WorkspaceNotFoundError) Expose git terminology to MCP clients
cast() for TypedDict database returns Bare except: (use except Exception:)
Validate all inputs before operations raise X without from e when re-raising
Record audit log for user-facing operations Commit with failing tests or lint
Write tests for new functionality Skip input validation
Focus on what needs to be done, not when Provide time estimates (AI is ~100x faster than you think)

Adding New Features

New MCP Tool

  1. Add to appropriate server_tools/*.py
  2. Follow the tool pattern (validate → execute → audit → return)
  3. Add tests in tests/integration/
  4. Update tool count in README if significant

New Database Table

  1. Add schema to db/schema.sql
  2. Add CRUD methods to db/database.py
  3. Add TypedDict to db/types.py
  4. Add tests

New Signal Type

  1. Update SignalDef in models.py
  2. Update pipeline/parser.py validation
  3. Update io/__init__.py load/save handling
  4. Add tests

Debugging

# Database state (in dev repo)
sqlite3 ../myproject-dev/.goldfish/goldfish.db "SELECT * FROM stage_runs ORDER BY started_at DESC LIMIT 5"

# Git state (dev repo has all branches/tags)
cd ../myproject-dev && git log --all --oneline --graph

# Check workspace mount metadata
cat workspaces/w1/.goldfish-mount

# Docker
docker ps                           # Running containers
docker logs goldfish-workspace-v1   # Container logs

# Verbose logging
import logging; logging.basicConfig(level=logging.DEBUG)

File Quick Reference

Component Files
Entry server.py, cli.py, __main__.py
Context context.py (ServerContext DI)
Models models.py (Pydantic), db/types.py (TypedDict)
Validation validation.py, errors.py
Cloud cloud/protocols.py, cloud/contracts.py, cloud/factory.py, cloud/adapters/
Workspace workspace/manager.py, workspace/git_layer.py (copy-based + sync)
Execution jobs/stage_executor.py, jobs/pipeline_executor.py
Pipeline pipeline/parser.py, pipeline/manager.py
Infra infra/docker_builder.py (use cloud/ adapters for execution)
Data datasets/registry.py, sources/registry.py
State state/state_md.py (per-workspace + global STATE.md)
IO io/__init__.py (container load_input/save_output)
Tools server_tools/*.py (40+ MCP tools)

Conventions

  • Ruff for linting/formatting (via pre-commit)
  • mypy strict mode for type checking
  • Google-style docstrings for public APIs
  • Semantic error types (not generic Exception)
  • Context managers for database transactions
  • cast() for TypedDict returns from SQLite

Common Fixes

Error Fix
TypedDict return type mismatch return cast(JobRow, dict(row))
Closure captures None-able var Assign to local: registry = self.registry then use in closure
no-any-return from mypy Add explicit type annotation to return variable
Forward reference error Add from __future__ import annotations
E402 module import order Move ALL imports to top, constants below

When getting ruff, mypy, or test errors: never cheat with ignores—always fix properly.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

PR Review: fix(warm-pool): Wire dispatch path, fix stale GCS paths

Good set of critical bug fixes — the dead dispatch path and stale GCS paths are real correctness issues. The startup script fixes are well-reasoned and the regression tests are a solid addition. A few issues worth addressing before merge.


CRITICAL

1. O(n) scan introduced in run_backend.py (the very anti-pattern fixed in _stage_executor_impl.py)

This fires on every launch() call and scans all running warm instances. The PR description calls out the O(n) scan as a fix in _stage_executor_impl.py, but then introduces the same pattern here. The simplest fix: have launch_instance() return a boolean flag indicating whether a warm instance was reused, or do a single-row SQL query like the _stage_executor_impl.py fix does. The misleading comment above this block (zone != empty string means it was warm) describes a check that never actually happens.

2. Private attribute access on WarmPoolManager

run_backend.py reaches into self._warm_pool._config and self._warm_pool._db directly. This breaks encapsulation and makes both classes fragile to internal refactors. WarmPoolManager should expose public methods/properties: e.g. idle_timeout_seconds_for(profile). The DB lookup should also live inside WarmPoolManager itself — an is_registered(instance_name) method would eliminate issues 1 and 2 together.


MEDIUM

3. Shell injection risk in _build_docker_cmd_script

Env var keys are not shell-quoted — a key with a space or dollar-sign breaks the command or enables injection. Validate keys against the pattern ^[A-Z_][A-Z0-9_]*$. Also, image and cmd are interpolated directly without quoting. Wrap with shlex.quote().

4. Hardcoded Docker mounts diverge from startup_builder.py

The warm pool docker command and the cold-start command are now maintained separately. If startup_builder.py ever adds a mount (e.g. for a new signal type), warm pool runs will silently lack it. Extract a shared _docker_mounts() helper used by both paths.

5. Any type for warm_pool_manager parameter

gce_launcher.py is in the same package as warm_pool.py. Use TYPE_CHECKING to import WarmPoolManager properly, or define a minimal Protocol with try_claim().


MINOR

6. import shlex inside method body — move to module-level imports per project conventions.

7. GCS_BUCKET not exported in bash

The comment says Export bucket name... but the bash line does not use export. While script-scope variables are visible to functions in the same script, use export GCS_BUCKET to match the stated intent.


TESTS

Coverage is good — lifecycle tests hit the real DB and all three startup script regressions are covered. Two observations:

  • test_register_and_claim_roundtrip calls test_db.claim_warm_instance() directly rather than manager.try_claim(). The comment says Claim (without signal bus, falls through) but try_claim is not invoked. This tests the DB layer (fine), but the name and comment are misleading. Rename or add a companion test that actually calls try_claim with no signal bus.

  • test_try_claim_with_ack_timeout_releases patches time.sleep but still iterates the full 30-iteration ACK poll loop. Consider making the poll count configurable or using a shorter count in tests to keep CI fast.


SUMMARY

The core bugs are real and the approach is sound. The two critical items (O(n) scan and private attribute access) are easy fixes and should be addressed before merge — keeping them would undermine the quality improvements the PR is trying to make.

Review feedback: if we reach the code after launch_instance(), the warm
pool claim either wasn't attempted or returned None. This is always a
fresh launch — no need to scan warm_instances to check.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

Hello from test

@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

Goldfish Development Guide

For AI assistants working on this codebase. Everything you need to develop Goldfish effectively—compact, scannable, and action-oriented.

Quick Reference

# Development cycle
make lint              # Ruff + mypy via pre-commit - run before commits
make test              # Fast unit tests (<1s) - run frequently
make test-integration  # Integration tests (~2min) - before pushing
make ci                # Full CI suite (lint + all tests)

# First-time setup
uv pip install -e ".[dev]"
make install-hooks     # REQUIRED: installs pre-commit hooks

Golden rule: Never suppress lint errors—always fix the source.


TDD: Test-Driven Development

This codebase uses TDD. Write tests BEFORE implementation.

RED    → Write failing test that defines expected behavior
GREEN  → Write minimal code to make test pass
REFACTOR → Clean up while keeping tests green

Why TDD matters for LLMs:

  • Tests encode intent unambiguously—no guessing what "done" means
  • Failing tests provide immediate feedback on implementation correctness
  • Test-first prevents over-engineering (you only build what's tested)
  • Regression safety: refactoring is safe when tests pass

Workflow:

  1. Understand the requirement
  2. Write test(s) in tests/unit/ or tests/integration/
  3. Run make test → confirm RED (test fails)
  4. Implement the feature
  5. Run make test → confirm GREEN (test passes)
  6. Run make lint → fix any issues
  7. Refactor if needed, keeping tests green

Test naming: test_<what>_<condition>_<expected> e.g., test_get_logs_when_file_missing_returns_none

No exceptions: Even "quick fixes" get tests first. The test documents the bug and prevents regression.


What is Goldfish?

An MCP server enabling Claude Code to conduct ML experiments by managing:

  • Workspaces = isolated experiment environments (copy-based, NO git in user workspace)
  • Versions = immutable snapshots (auto-created on every run, 100% provenance)
  • Pipelines = YAML workflows (stage definitions + signal wiring)
  • Stages = Python modules (run in Docker containers)
  • Signals = typed data flow (dataset, npy, csv, directory, file)

Core invariants:

  • All infrastructure (Docker, GCS, GCE) is hidden from the MCP client
  • User workspace is plain files (no .git) - all versioning in dev repo
  • Every run() syncs and commits BEFORE execution (100% provenance)

Architecture at a Glance

MCP Client (Claude) ─── JSON-RPC ───▶ server.py
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    ▼                     ▼                     ▼
             server_tools/*         context.py            db/database.py
             (40+ MCP tools)     (ServerContext DI)         (SQLite)
                    │                     │
        ┌───────────┼───────────┐         │
        ▼           ▼           ▼         ▼
   workspace/    jobs/      pipeline/   cloud/
   manager.py   stage_      parser.py   protocols.py    ◄── Backend-agnostic
   git_layer.py executor.py             contracts.py        interfaces
                    │                   factory.py
                    │                       │
                    └───────────────────────┤
                                            ▼
                                    cloud/adapters/
                                    ├── local/       ◄── Docker
                                    └── gcp/         ◄── GCE + GCS

Key Files

File Purpose
jobs/stage_executor.py Core: Stage execution + sync + provenance + review
cloud/protocols.py RunBackend, ObjectStorage, ImageBuilder interfaces
cloud/contracts.py BackendCapabilities, RunSpec, RunHandle, BackendStatus
cloud/factory.py AdapterFactory for backend instantiation
cloud/adapters/local/ LocalRunBackend (Docker-based execution)
cloud/adapters/gcp/ GCERunBackend (GCE instances), GCSStorage
pre_run_review.py Pre-run code review using Claude Agent SDK
db/database.py All database operations
workspace/manager.py Workspace CRUD + copy-based mounting
workspace/git_layer.py Git ops + sync_slot_to_branch
server.py MCP server initialization

The Nine Abstractions

1. Workspaces = Copy-Based Isolation

MOUNT:  dev-repo/branch ──copy──▶ user/workspaces/w1/ (plain files, NO .git)
WORK:   Claude edits user/workspaces/w1/
RUN:    user/w1/ ──sync──▶ dev-repo/branch ──commit──▶ execute

Key operations: create_workspace(), mount(), hibernate(), checkpoint()

2. Versions = Git Tags (100% Provenance)

Every run() syncs changes back to dev repo, commits, THEN creates version:

user edits ──sync──▶ commit in dev-repo ──tag──▶ baseline_lstm-v1

Stored in workspace_versions table with created_by: run|checkpoint|manual

3. Pipelines = YAML

stages:
  - name: preprocess
    inputs: {raw: {type: dataset, dataset: sales_v1}}
    outputs: {features: {type: npy}}
  - name: train
    inputs: {features: {from_stage: preprocess, signal: features}}

Parser validates: unique names, type compatibility, no cycles, datasets exist.

4. Stages = Docker Containers

# modules/train.py - runs in container
from goldfish.io import load_input, save_output
features = load_input("features")  # from /mnt/inputs/
save_output("model", model_dir)    # to /mnt/outputs/

5. Signals = Data Flow

Type Format Use Case
dataset External Registered project data
npy NumPy Arrays, embeddings
csv Pandas Tabular data
directory Dir Model checkpoints
file Single file Configs, small outputs

Tracked in signal_lineage table for full provenance.

6. Resource Profiles

# configs/train.yaml
compute:
  profile: "h100-spot"  # Claude writes this

Goldfish resolves to: a3-highgpu-1g, H100 GPU, spot pricing, multi-zone.

Built-in: cpu-small, cpu-large, h100-spot, h100-on-demand, a100-spot, a100-on-demand

7. SVS (Semantic Validation System)

Core System: validation.py, src/goldfish/svs/, integrated into jobs/stage_executor.py

SVS provides defense-in-depth through three phases:

  1. Pre-Run Review: AI-driven static analysis of code/config/diff using the Claude Agent SDK (pre_run_review.py).
  2. Schema Contracts: Mechanistic validation of stage outputs against pipeline.yaml definitions (shape, dtype, kind).
  3. Output Stats: Automatic computation of statistical properties (entropy, null ratio, unique counts) for every signal, stored in signal_lineage.stats_json.

Key Patterns:

  • Enforcement Modes: warning (log only) or blocking (fail stage).
  • Reservoir Sampling: Stats are computed on 10k samples to handle large tensors/CSVs.
  • Fail-Open: AI reviews approve by default on timeout or API error to avoid blocking developer velocity.

Security:

  • Path traversal protection in pre_run_review.py.
  • File size limits (100KB/file) for review context.

8. Cloud Abstraction Layer

Core System: src/goldfish/cloud/ - Backend-agnostic execution and storage.

The cloud abstraction layer isolates provider-specific code (GCP, AWS, local) from core Goldfish logic:

cloud/
├── protocols.py    # Interfaces: RunBackend, ObjectStorage, ImageBuilder
├── contracts.py    # Data types: BackendCapabilities, RunSpec, BackendStatus
├── factory.py      # AdapterFactory for DI
└── adapters/
    ├── local/      # LocalRunBackend (Docker containers)
    └── gcp/        # GCERunBackend (GCE instances), GCSStorage

Key Protocols:

  • RunBackend: Unified interface for execution (launch(), get_status(), terminate(), get_logs())
  • ObjectStorage: Blob storage operations (put(), get(), exists(), delete())
  • ImageBuilder: Docker image building (build(), push())
  • SignalBus: Inter-stage signal coordination and messaging
  • InstanceIdentity: Cloud instance metadata abstraction
  • ImageRegistry: Container image registry operations

BackendCapabilities - Behavior configuration instead of conditionals:

@dataclass
class BackendCapabilities:
    ack_timeout_seconds: float = 1.0    # How long to wait for ACK
    has_launch_delay: bool = False       # GCE has startup delay, local doesn't
    timeout_becomes_pending: bool = False # GCE timeout = sync pending, local = failure
    logs_unavailable_message: str = "Logs not available"
    zone_resolution_method: str = "config"  # "config" or "handle"

Usage Pattern - Always use protocol, never direct launcher:

# GOOD: Protocol-based
result = self.run_backend.launch(spec)
status = self.run_backend.get_status(handle)
logs = self.run_backend.get_logs(handle)

# BAD: Direct launcher access (violates abstraction)
self.gce_launcher.launch_instance(...)  # NEVER do this

Adding a New Backend:

  1. Implement RunBackend protocol in cloud/adapters/your_provider/run_backend.py
  2. Set appropriate BackendCapabilities values
  3. Register in cloud/factory.py
  4. No changes needed in stage_executor.py or execution_tools.py

9. Configuration Flexibility

Defaults Section - Global settings for stage execution:

# goldfish.yaml
defaults:
  timeout_seconds: 7200    # 2 hours (default: 3600)
  log_sync_interval: 15    # Sync logs every 15 seconds (default: 10)
  backend: gce             # Default compute backend: local, gce, kubernetes

Storage Backend Configuration - Multi-provider storage support:

# goldfish.yaml
storage:
  backend: "gcs"  # or "s3", "azure", "local"

  # GCS configuration (when backend: gcs)
  gcs:
    bucket: "my-bucket"
    sources_prefix: "sources/"
    artifacts_prefix: "artifacts/"

  # S3 configuration (when backend: s3) - adapter coming soon
  s3:
    bucket: "my-bucket"
    region: "us-east-1"
    endpoint_url: "http://localhost:9000"  # For MinIO/S3-compatible

  # Azure configuration (when backend: azure) - adapter coming soon
  azure:
    container: "my-container"
    account: "mystorageaccount"

Backend Selection Priority:

  1. New storage: section takes precedence if present
  2. Falls back to legacy gcs: section for backwards compatibility
  3. AdapterFactory.create_storage() handles resolution automatically

Per-Profile Backend Selection - Different compute backends per profile:

# goldfish.yaml
gce:
  project_id: my-project
  profile_overrides:
    # GPU workloads on GCE
    h100-spot:
      zones: ["us-central1-a"]
    # CPU workloads could use different config
    cpu-large:
      zones: ["us-west1-a", "us-west1-b"]

Config Model Hierarchy:

GoldfishConfig
├── defaults: DefaultsConfig          # Global execution defaults
├── storage: StorageConfig | None     # Multi-backend storage (new)
│   ├── backend: "gcs" | "s3" | "azure" | "local"
│   ├── gcs: GCSConfig | None
│   ├── s3: S3StorageConfig | None
│   └── azure: AzureStorageConfig | None
├── gcs: GCSConfig | None             # Legacy GCS config (backwards compat)
├── gce: GCEConfig | None             # GCE compute config
├── jobs: JobsConfig                  # Job execution settings
└── local: LocalConfig                # Local backend simulation

Critical Patterns

Database Access

# ALWAYS use context manager
with self.db._conn() as conn:
    conn.execute("INSERT INTO ...")
# Transaction auto-commits on success, auto-rollbacks on exception

Error Handling

# ALWAYS use specific error types with details
raise WorkspaceNotFoundError(
    f"Workspace '{name}' not found",
    details={"available": available_workspaces}
)

# NEVER expose git internals
# BAD:  raise Exception("fatal: not a valid object name")
# GOOD: raise WorkspaceNotFoundError("Workspace not found")

TypedDict Returns from Database

# When returning TypedDict, ALWAYS use cast()
from typing import cast
return cast(JobRow, dict(row)) if row else None

# For lists:
return [cast(SourceRow, dict(r)) for r in rows]

MCP Tool Pattern

@mcp.tool()
def my_tool(param: str) -> dict:
    """Docstring for Claude."""
    try:
        validate_workspace_name(param)           # 1. Validate
        result = manager.do_thing(param)         # 2. Execute
        ctx.db.record_audit("my_tool", {...})    # 3. Audit
        return {"success": True, "result": result}  # 4. Return
    except GoldfishError as e:
        return {"success": False, "error": e.message}

Security Model (4 Layers)

1. Input Validation (validation.py)

Input Pattern Example
Workspace name ^[a-zA-Z0-9_-]+$ baseline_lstm
Snapshot ID ^snap-[a-f0-9]{8}-\d{8}-\d{6}$ snap-abc12345-20251210-143000
Stage run ID ^stage-[a-f0-9]+$ stage-abc123

2. Path Traversal Protection

# ALWAYS validate paths
def validate_path_within_root(path: Path, root: Path) -> None:
    if not path.resolve().is_relative_to(root.resolve()):
        raise ValidationError("Path traversal")

# ALWAYS check symlinks (TOCTOU prevention)
if path.is_symlink():
    raise InvalidLogPathError("Symlink detected")

3. Docker Sandboxing (cloud/adapters/local/)

# Containers run with:
--memory 4g --cpus 2.0 --pids-limit 100
--user 1000:1000  # non-root
-v inputs:/mnt/inputs:ro  # read-only inputs

4. Git Error Translation (errors.py)

All git errors translated to Goldfish concepts before reaching Claude.


Stage Execution Flow

run("w1", stages=["train"])
         │
         ├─▶ 1. Validate workspace mounted
         ├─▶ 2. SYNC: Copy user/w1 → dev-repo/branch (with delete semantics)
         ├─▶ 3. COMMIT: Auto-commit changes in dev-repo
         ├─▶ 4. PUSH: Push to remote (for GCE execution)
         ├─▶ 5. Auto-version (create git tag from committed SHA)
         ├─▶ 6. Load pipeline, validate stage exists
         ├─▶ 7. Resolve inputs (datasets or upstream signals)
         ├─▶ 8. Build Docker image
         ├─▶ 9. Launch container (local or GCE)
         ├─▶ 10. Monitor status, stream logs
         └─▶ 11. Finalize: register outputs in signal_lineage

Key methods:

  • GitLayer.sync_slot_to_branch() - sync + commit (provenance guard)
  • StageExecutor.run_stage() in jobs/stage_executor.py

Database Schema (Key Tables)

workspace_versions(workspace_name, version, git_sha, created_by, created_at)
stage_runs(id, workspace_name, version, stage_name, status, backend_type, ...)
signal_lineage(stage_run_id, signal_name, signal_type, storage_location)
audit(operation, workspace, details_json, created_at)

Full schema: db/schema.sql


Testing

Structure

tests/
├── unit/           # Over 2400 tests, <1s, pure logic, all mocked
├── integration/    # Over 1200 tests, ~2min, real DB + git
├── e2e/            # Full Docker tests
│   └── deluxe/     # GCE tests (@pytest.mark.deluxe_gce)
└── conftest.py     # Fixtures: test_db, temp_git_repo

Key Fixtures

test_db        # Fresh SQLite with schema
temp_git_repo  # Initialized git repo with main branch
test_config    # GoldfishConfig for testing

Writing Tests

def test_feature(test_db, temp_git_repo):
    """What + Why in docstring."""
    manager = WorkspaceManager(db=test_db, ...)
    result = manager.create_workspace("test", "goal")
    assert result.name == "test"
    # Always verify DB state too
    with test_db._conn() as conn:
        row = conn.execute("SELECT ...").fetchone()
        assert row is not None

DO and DON'T

DO DON'T
make lint before committing # type: ignore (fix the issue)
Specific error types (WorkspaceNotFoundError) Expose git terminology to MCP clients
cast() for TypedDict database returns Bare except: (use except Exception:)
Validate all inputs before operations raise X without from e when re-raising
Record audit log for user-facing operations Commit with failing tests or lint
Write tests for new functionality Skip input validation
Focus on what needs to be done, not when Provide time estimates (AI is ~100x faster than you think)

Adding New Features

New MCP Tool

  1. Add to appropriate server_tools/*.py
  2. Follow the tool pattern (validate → execute → audit → return)
  3. Add tests in tests/integration/
  4. Update tool count in README if significant

New Database Table

  1. Add schema to db/schema.sql
  2. Add CRUD methods to db/database.py
  3. Add TypedDict to db/types.py
  4. Add tests

New Signal Type

  1. Update SignalDef in models.py
  2. Update pipeline/parser.py validation
  3. Update io/__init__.py load/save handling
  4. Add tests

Debugging

# Database state (in dev repo)
sqlite3 ../myproject-dev/.goldfish/goldfish.db "SELECT * FROM stage_runs ORDER BY started_at DESC LIMIT 5"

# Git state (dev repo has all branches/tags)
cd ../myproject-dev && git log --all --oneline --graph

# Check workspace mount metadata
cat workspaces/w1/.goldfish-mount

# Docker
docker ps                           # Running containers
docker logs goldfish-workspace-v1   # Container logs

# Verbose logging
import logging; logging.basicConfig(level=logging.DEBUG)

File Quick Reference

Component Files
Entry server.py, cli.py, __main__.py
Context context.py (ServerContext DI)
Models models.py (Pydantic), db/types.py (TypedDict)
Validation validation.py, errors.py
Cloud cloud/protocols.py, cloud/contracts.py, cloud/factory.py, cloud/adapters/
Workspace workspace/manager.py, workspace/git_layer.py (copy-based + sync)
Execution jobs/stage_executor.py, jobs/pipeline_executor.py
Pipeline pipeline/parser.py, pipeline/manager.py
Infra infra/docker_builder.py (use cloud/ adapters for execution)
Data datasets/registry.py, sources/registry.py
State state/state_md.py (per-workspace + global STATE.md)
IO io/__init__.py (container load_input/save_output)
Tools server_tools/*.py (40+ MCP tools)

Conventions

  • Ruff for linting/formatting (via pre-commit)
  • mypy strict mode for type checking
  • Google-style docstrings for public APIs
  • Semantic error types (not generic Exception)
  • Context managers for database transactions
  • cast() for TypedDict returns from SQLite

Common Fixes

Error Fix
TypedDict return type mismatch return cast(JobRow, dict(row))
Closure captures None-able var Assign to local: registry = self.registry then use in closure
no-any-return from mypy Add explicit type annotation to return variable
Forward reference error Add from __future__ import annotations
E402 module import order Move ALL imports to top, constants below

When getting ruff, mypy, or test errors: never cheat with ignores—always fix properly.

P1 fixes:
1. Status polling: get_status() checks GCS exit code for warm instances
   even when VM is RUNNING (idle loop keeps VM alive). Without this,
   runs stay RUNNING indefinitely.

2. Skip register on warm reuse: GCELaunchResult.warm_reuse flag prevents
   double-inserting the same instance_name (PK violation).

3. Register as 'running' not 'idle': Fresh launches registered as running
   so the reaper won't delete them mid-job. Release to idle happens when
   the startup script enters the idle loop.

4. Guard kill against PID 0: ${PID:-0} sends SIGTERM to entire process
   group. Now checks PID is set and non-zero before killing.

5. Backend_handle identity: Warm pool claims use stage_run_id for GCS
   path resolution so logs/exit code go to the correct run.

6. Double runs/ prefix: run_path is now just stage_run_id (not
   runs/stage_run_id), matching the idle loop's gs://bucket/runs/$path.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@claude
Copy link
Copy Markdown

claude Bot commented Mar 20, 2026

Code Review — fix(warm-pool): Wire dispatch path, fix stale GCS paths

The four bugs targeted here are real and the fixes are well-motivated. Below is my independent analysis of the changed code.


Bug — Shell injection in _build_docker_cmd_script (gce_launcher.py)

env_flags = " ".join(f"-e {k}={shlex.quote(v)}" for k, v in env_map.items())

Only the value is quoted with shlex.quote; the key k is interpolated raw. Env var keys are typically safe, but if any key contains a space or shell metacharacter the generated bash script becomes malformed or injectable. Likewise image and cmd are interpolated directly into the heredoc without quoting. Minimal fix:

env_flags = " ".join(f"-e {shlex.quote(k)}={shlex.quote(v)}" for k, v in env_map.items())

Bug — Falsy check skips pool registration for timeout=0 (run_backend.py ~261)

if not is_warm and warm_pool_timeout and self._warm_pool:

warm_pool_timeout is int | None. A value of 0 is falsy, so a zero-second timeout would silently skip register_instance(). Zero is unusual here, but the intent is "not None":

if not is_warm and warm_pool_timeout is not None and self._warm_pool:

Design — Private attribute access on WarmPoolManager (run_backend.py ~240)

warm_pool_timeout = self._warm_pool._config.idle_timeout_minutes * 60

config is a private attribute. WarmPoolManager should expose a public idle_timeout_seconds() -> int property, and run_backend.py should call that instead. Same principle for any other . accesses added by this PR.


Design — Any type annotation on warm_pool_manager (gce_launcher.py ~136)

warm_pool_manager: Any | None = None,

Any defeats mypy. Use TYPE_CHECKING to avoid circular imports and annotate with the concrete type WarmPoolManager | None.


Design — import shlex inside method body (gce_launcher.py ~409)

Move import shlex to module-level. Inline imports are a code smell and will be flagged by Ruff in CI.


Design — Hardcoded Docker mounts risk skew with startup_builder.py (gce_launcher.py ~428)

The mounts string (which hard-codes -v source:dest pairs for entrypoint.sh, gcs, inputs, and outputs) duplicates the mount list from startup_builder.py. If a new mount is added there (e.g. a secrets volume), warm pool reuse runs will silently lack it. Consider extracting a shared _docker_mount_flags() helper called from both sites.


Test coverage gaps

test_idle_loop_updates_gcs_paths_for_new_run only asserts that GCS_STDOUT_PATH= and new_run_path appear somewhere in the generated script — not that the assembled path is well-formed. A stronger assertion would also check for the double-prefix guard: assert "runs/runs/" not in script.

test_try_claim_with_ack_timeout_releases patches time.sleep but the ACK poll loop still iterates 30 times. Consider parameterising the retry count so the test is readable and the number is explicit.


Positive callouts

  • O(1) DB lookup in _stage_executor_impl.py — SELECT 1 ... WHERE instance_name = ? is a clean, correct fix.
  • gsutil cp - for the exit code write avoids the stale gcsfuse mount entirely — correct approach.
  • status='running' on registration is semantically right; the reaper will not touch a mid-job instance.
  • Process kill guards are defensive — guarding against kill 0 killing the whole process group is important.
  • GCS_BUCKET placement before idle_loop_section ensures the variable is in scope for all path construction in the idle loop.
  • Integration test structure in test_warm_pool_lifecycle.py is well-organised with clear docstrings and realistic state-machine transitions.

Summary

Severity Count Items
Bug 2 Shell injection in env key interpolation; falsy timeout=0 check
Design 4 Private attr access, Any type, inline import, mount skew risk
Test 2 Path-correctness assertion missing, ACK poll count

The dispatch wiring fix and GCS path staleness fix are both correct and necessary. The two bugs above should be addressed before merge; the design items are lower priority but improve long-term maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant