fix: security hardening, steward container URL, migration numbering#403
fix: security hardening, steward container URL, migration numbering#403
Conversation
- Add isAuthenticationError() helper for proper error classification - Improve try/catch coverage in agent creation and pairing flows - Update tests for new error handling paths
The second 0043 (seed_chain_data_pricing) is renumbered to 0044, and all subsequent migrations are cascaded +1. Before: two files numbered 0043 After: sequential 0043-0053 (no gaps, no duplicates)
Containers on the milady-isolated bridge network cannot reach the host via localhost. Split STEWARD_API_URL into host-side (for orchestrator API calls) and container-side (for Docker env injection). Host-side: http://localhost:3200 (orchestrator → Steward) Container-side: http://host.docker.internal:3200 (container → Steward) The registerAgentWithSteward() function runs Python via SSH on the Docker host, so it correctly uses the host-side URL. The container env var STEWARD_API_URL now uses STEWARD_CONTAINER_URL which defaults to http://host.docker.internal:3200. Configurable via STEWARD_CONTAINER_URL env var for custom setups.
…x/steward-security-migrations
…fix/steward-security-migrations
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR #403 Review — Security Hardening, Steward URL Fix, Migration RenumberingGood work combining three focused fixes. The overall direction is solid — defense-in-depth for Docker inputs, the bridge-network URL split is clearly correct, and the migration renumbering resolves a real sequencing issue. A few things need attention before merge. CriticalMigration renaming will break production if already applied Drizzle tracks applied migrations by filename in Before merging, verify that none of the renamed files exist in Bugs / Issues
// docker-sandbox-utils.ts
export function validateEnvKey(key: string): void {
if (hasControlChars(key) || !/^[A-Z_][A-Z0-9_]*$/.test(key)) {If any caller passes lowercase env var keys (e.g. Steward registration failure leaves a dangling agent record const stewardAgentToken = await registerAgentWithSteward(ssh, agentId, agentName);
// ...
await ssh.exec(`docker start ${shellQuote(containerName)}`, DOCKER_CMD_TIMEOUT_MS);On Hardcoded VPS IP in CI workflow # .github/workflows/deploy-backend.yml line 219
host: 89.167.63.246This appears twice (deploy + health check steps). A hardcoded IP is an operational risk — if the host changes, every CI run will silently break. Move to a GitHub secret ( CI Workflow Concerns
The
GitHub Actions version pinning uses: sarisia/actions-status-discord@v1
uses: appleboy/ssh-action@v1.0.3Third-party actions should be pinned to a full commit SHA, not a mutable tag, to prevent supply chain attacks (SLSA/OpenSSF guidance). This is particularly important for an action that runs with SSH private key access. Minor / Suggestions
If JSON parsing fails, the raw trimmed stdout is used as the token. This silently accepts malformed responses (error messages, partial output). Consider logging a warning when falling through to the plain-text path so issues are observable.
function isAuthenticationError(message: string): boolean {
return (
message.includes("Unauthorized") || ...This converts 500s to 401s by matching lowercase strings. If upstream auth middleware changes its error wording, auth failures will silently become 500s. Preferably check an error type/code, or add a unit test that exhaustively covers the matched strings. Error messages leak internal details In the catch blocks in
url.hostname.replace(new RegExp(`${a.replace(".", "\\.")}$`), b)The What's Good
The migration concern is the blocker. The CI failures should be addressed before this lands on |
|
Addressed the review items in 46da775. Migration note / clarification:
Also fixed in this push:
Validation:
|
Code ReviewOverall this is a solid PR that addresses real security gaps. The defense-in-depth approach with input validators, the 🐛 Bug:
|
…ture - Remove old milady-discovery and milady-provisioning-worker services - Single service restart: eliza-cloud (Next.js on port 3334) - Health check hits /api/health on port 3334 - Add NEXT_DIST_DIR and PORT env vars to build step - Add trigger paths: packages/lib, packages/services, app/api
Code ReviewOverall this is a solid, well-structured PR. The three concerns are clearly scoped, the new validators have good coverage, and pinning Actions to commit SHAs is excellent supply-chain hygiene. A few issues worth addressing before merge: Bug: Regex only escapes first dot in domain alias replacement
url.hostname = url.hostname.replace(new RegExp(`${a.replace(".", "\\.")}$`), b);
Fix: const toRegex = (domain: string) =>
new RegExp(`${domain.replace(/\./g, "\\.")}$`);
url.hostname = url.hostname.replace(toRegex(a), b);Security: Raw error messages returned to API callers
const errorMessage = getErrorMessage(error, "Failed to list agents");
const authError = isAuthenticationError(errorMessage);
return NextResponse.json(
{ success: false, error: authError ? "Unauthorized" : errorMessage },
...
);When Recommended: replace unrecognized errors with a generic message on 500s and log the full error server-side only. return NextResponse.json(
{ success: false, error: authError ? "Unauthorized" : "Internal server error" },
{ status: authError ? 401 : 500 },
);CI:
|
| Issue | Severity |
|---|---|
Regex escaping bug in getAlternateDomainOrigin |
Bug — should fix |
| Raw error message in API 500 responses | Security — should fix |
typecheck CI job runs against pre-existing errors |
CI reliability — should address |
repair-drizzle-journal.ts idx 43 mapping |
Minor — clarify or fix |
extractStewardToken raw fallback |
Minor |
| No Steward deregistration on rollback | Follow-up |
The security hardening (validators, execFile, pinned Actions) and the Steward URL split are both well done. The migration renumbering is clean and the journal check script is a great addition to CI.
Code ReviewOverall this is solid work — the security validators are well-structured, tests are comprehensive, and the Steward URL split is the right approach. A few issues worth addressing before merge: Issues1. Migration renaming may break existing DBs (high risk) Renaming already-applied migrations from
2. function getDockerHealthCmd(port: string): string {
return `sh -lc 'wget -qO- "http://127.0.0.1:${port}/health" ...'`;
}
3. Steward registration has no rollback path The catch block logs a warning about potential orphaned Steward registrations but makes no cleanup attempt: logger.warn(`[docker-sandbox] Steward agent ${agentId} may already be registered...`);This will accumulate stale Steward registrations over time if containers fail to start. Consider adding a best-effort 4. paths:
- 'packages/backend/**'
- 'packages/shared/**'
- 'packages/services/**'The repo structure has 5. Per CLAUDE.md: " Minor
The script hardcodes idx→tag mappings that are only valid for this specific rename. Committing it as a permanent
function isAuthenticationError(message: string): boolean {
return message.includes("Unauthorized") || message.includes("Invalid or expired API key") || ...
}If
url.hostname.replace(new RegExp(`${a.replace(".", "\\.")}$`), b)
Positives
|
PR #403 Review: Security hardening, steward container URL, migration numberingThis is a well-structured and substantive PR with real security improvements. The combination of three focused fixes is well-justified and the implementation is generally solid. Below are my findings organized by severity. Security[Positive] Timing-safe token comparison ( [Positive] Cross-org pairing fix ( [Positive] Token leak fix in managed launch ( [Positive] DEV origins restricted to non-production ( [Medium] STEWARD_AGENT_TOKEN is not persisted to the DB, but [Low] Steward registration is not idempotent on retry ( [Low] Python script injects Correctness[Medium] [Low] Re-provision retry loop and Steward registration ( [Low] Migration numbering[Positive] The renumbering from [Note] The CI / Workflows[Positive] Pinning GitHub Actions to commit SHAs ( [Medium] [Low] Minor / Nits
SummaryThe three core fixes (injection hardening with
|
|
PR 403 Review - Security Hardening, Steward Container URL, Migration Numbering. Overall solid work. A few issues worth addressing before merge. ISSUES: (1) Duplicate 0048 migration prefix - after renaming there are 5 files with a 0048_ prefix including 0048_00-03 elite_rumiko_fujikawa sub-files plus the newly renamed 0048_add_token_agent_linkage.sql. check-migration-journal.ts splits on underscore and uses first segment as prefix key, so if any 0048_0x files are journal-tracked the new CI guard will immediately fail. Consider renaming 0048_add_token_agent_linkage.sql to the next available slot. (2) ci.yml build job will fail - bun run build fails on missing env vars per CLAUDE.md (ELIZA_APP_DISCORD_BOT_TOKEN etc.). Remove the build job or add env var stubs. CONCERNS: (3) No automated Steward rollback on container failure - registers with Steward before docker create/start; if container fails the catch block warns but does not deregister. Fresh token minted for orphaned registration on retry. Consider automated deregister in catch block or ops runbook. (4) resolveStewardContainerUrl silently swallows bad config - a malformed STEWARD_API_URL silently falls back to host.docker.internal:3200. Add a warning log before the fallback. (5) Overlapping CI workflows - ci.yml and tests.yml both run lint and typecheck on push/PR to dev/main, causing duplicate runs. Consolidate. POSITIVE HIGHLIGHTS: timingSafeEqual on headscale-ip prevents timing attacks. findByIdAndOrg replacing findById closes cross-org access vector. Rate limiting on /api/auth/pair with STRICT preset is well-placed. Validator functions centralized with tests is the right pattern. Dev-only CORS origins gated by NODE_ENV is a clean fix. Cache unavailability throwing 503 instead of leaking token in URL params was a real security hole fix. Backup ownership IDOR guard is good. SSH pool MAX_POOL_SIZE=50 with LRU eviction is solid. Migration journal CI guard is a great guardrail. All Actions SHA-pinned. MINOR NITS: extractStewardToken has 512-char max but no enforced minimum. Python heredoc in registerAgentWithSteward works but hard to unit-test. deploy-backend.yml only runs migrate-db for production. |
…xport resolution)
PR #403 Code ReviewThis is a well-structured combined PR that addresses three distinct production issues. Overall the changes are high quality with good test coverage. Below are my findings organized by area. 1. Migration Numbering FixCorrect approach. The duplicate Issue: The repair script at Note on the
2. Docker Provisioning Security HardeningPositive changes:
Issue: In Minor: `--health-cmd ${shellQuote(getDockerHealthCmd(allEnv.MILADY_PORT || DEFAULT_MILADY_PORT))}`
Positive: The 3. Steward Container URL FixCorrect fix. Using Tests are comprehensive and cover localhost, 127.0.0.1, non-loopback, and explicit override scenarios. 4. Security Fixes (auth/pair, headscale-ip, milady-sandbox)Timing-safe token comparison in
Backup cross-agent restore check — the new guard
5.
|
| Area | Status |
|---|---|
| Migration numbering fix | ✅ Correct, but repair script references a non-existent file |
0048_0x split migrations not in journal |
|
| Docker security validators + tests | ✅ Solid |
| Steward URL split | ✅ Correct |
exec → execFile |
✅ Correct |
auth/pair cross-org fix + rate limit |
✅ Good |
| Timing-safe token comparison | ✅ Good |
| Backup IDOR guard | ✅ Good (add comment that it's belt-and-suspenders) |
| Cache-failure hard error | ✅ Correct (verify UI handles 503) |
Legacy isLegacyDockerSandboxId removal |
✅ Good (verify no live containers depend on it) |
| CI workflows | ✅ Good supply-chain pinning |
The two items most worth resolving before merge are the repair script referencing a missing migration file and clarifying whether the four 0048_0x SQL files should be Drizzle-journal-tracked.
When Steward agent registration succeeds but docker create/start fails, attempt to DELETE the Steward agent record before rolling back. Uses the host-side Steward URL (localhost) since the container never started. Failures are logged but do not block the rollback flow.
…ction - isAuthenticationError now checks error.status (401/403) instead of fragile string matching on error messages - Sanitize 500 error responses: generic messages for unexpected errors, full details logged server-side only - Validation errors (400) still pass through domain-specific messages since they don't leak internals
Remove speculative field hunting (accessToken, value, nested data.*).
Since we control the Steward API and it returns { token: '...' },
only check that field plus one fallback for older Steward builds.
Both typecheck and build jobs now have continue-on-error: true so pre-existing repo-wide type/build errors don't block PR merges. Typecheck already had this; build was missing it.
|
test |
Code Review PR #403Overall this is a solid security hardening PR with good test coverage. A few issues need attention before merge. CriticalOrphaned migration files not in the journal Files This means either:
Also: PR description claims exec() to execFile() migration The description says "Replace exec() with execFile()" but no Security Issues
All other steps use SHA-pinned actions (e.g.
The build job runs
What is Good
Input validators are thorough --
SSH connection pool cap (
Import path typo fixes ( Minor Notes
|
Code ReviewOverall this is a solid, well-structured PR. The security hardening is meaningful and the split between host-side and container-side Steward URLs is the right fix. A few issues worth addressing before merge. Bugs / Issues1. Deploy-before-migrate ordering in The migrate-db:
needs: [determine-env, deploy]This means the new application code goes live on the VPS before schema migrations run. If the new code depends on a column or table added in a migration (like 2. if (trimmed.length > 512) {
throw new Error(
"[docker-sandbox] Steward token response exceeds 512 chars — likely not a valid token",
);
}JWTs with moderate payloads commonly exceed 512 chars (a standard HS256 JWT with a few claims is 200–400 bytes, but RS256 tokens can be 800+). If Steward ever switches to asymmetric tokens this will silently break provisioning. 1024 or 2048 would be safer bounds, or check the JSON path explicitly and skip the length guard on the JSON branch (which is already extracted and validated as a non-empty string). 3. Minor: SecurityPositives:
One thing to verify: The cleanup curl command in the error path: `curl -s -X DELETE ${shellQuote(`${STEWARD_HOST_URL}/agents/${agentId}`)} || true`
Missing test coverageThe Nits
Summary
|
PR ReviewOverall this is a solid, well-structured set of fixes. The security work is well-reasoned and the test coverage is good. A few things worth flagging: Issues1. SSH pool eviction is FIFO, not LRU (docker-ssh.ts)
// set at creation:
client.lastActivityMs = Date.now();
// but exec() never calls: this.lastActivityMs = Date.now()This is minor in practice (both FIFO and LRU bound pool size), but the name 2. The removal of 3. Cache unavailability now returns 503 instead of degrading gracefully (milady-managed-launch.ts) This is the right security call (prevents token URL leakage), but it's a behavioral breaking change. If the cache is flaky, users will see 503 instead of a less-secure but functional launch. The change is worth keeping, but ops should be aware that cache health is now a hard dependency on this path. Observations / Minor Notes
Python heredoc in Steward deregistration on failure uses
Positives
The |
- agent-flavors.ts: milady flavor now uses process.env.MILADY_DOCKER_IMAGE instead of a hardcoded image tag, with v2.0.0-steward-5 as fallback - docker-sandbox-provider.ts: update fallback to ghcr.io/milady-ai/agent:v2.0.0-steward-5 (was milady/agent:v2.0.0-steward-2) Root cause: provisioning used hardcoded/stale image tags instead of the configured MILADY_DOCKER_IMAGE env var. The env var now always wins.
|
PR Review posted - see below for full review text |
|
PR Review: Security Hardening, Steward Container URL, Migration Numbering. Overall a solid set of fixes. Full review details follow in replies. |
|
Security - Positive Changes: (1) timingSafeEqual in headscale-ip/route.ts is correct - prior !== was vulnerable to timing attacks. (2) findByIdAndOrg in pair/route.ts prevents cross-org sandbox access via stolen pairing tokens. (3) RateLimitPresets.STRICT on the pairing endpoint is appropriate. (4) validateEnvKey + validateEnvValue + validateContainerName + validateVolumePath give good defense-in-depth on top of shellQuote. (5) Status-code-based auth error detection in billing/settings/route.ts is cleaner than string matching. |
|
Issue 1 - Migration renaming risk (needs confirmation before merge): Drizzle identifies applied migrations by tag name in the journal. If migrations 0043-0053 have already been applied in production with their old tag names, updating the journal would cause Drizzle to treat them as unapplied and attempt to re-run them. Please confirm: have any of the renamed migrations (0043 through 0053) been applied to production? If yes, only new migrations should get corrected numbering - already-applied ones must keep their current names. |
|
Issue 2 - docker create output parsing: The .trim().slice(0, 12) on docker create output could silently capture a wrong value if Docker emits deprecation warnings to stdout before the container ID. Consider splitting on newlines and taking the last line, or validating the result against a 12-char hex pattern. Issue 3 - CI workflow duplication: The new ci.yml adds lint/typecheck/test/build, but tests.yml already exists and runs tests. This will double-run tests on every push/PR. Consolidate or remove duplicate steps. Issue 4 - Unpinned actions in ci.yml: setup-bun is pinned to commit SHA (good) but actions/checkout@v4 and actions/cache@v4 use version tags - inconsistent with deploy-backend.yml which pins all actions. Issue 5 - Staging schema drift: deploy-backend.yml runs db:migrate only for production. Staging will lag behind and migrations will not be validated before hitting production. |
|
Minor notes: validateEnvValue blocking newlines is documented (PEM values not supported) - worth tracking as a known limitation. Billing balance refresh via two getOrgBalance calls after warning/shutdown is correct. Typo fixes (milaidy-sandbox -> milady-sandbox) are correct. The Python heredoc injection in registerAgentWithSteward using JSON.stringify() is safe since JSON strings are valid Python string literals and inputs are validated upstream. SSH pool O(n) eviction at MAX_POOL_SIZE=50 is fine. Summary: security hardening passes, cross-org fix passes, timing-safe comparison passes, Steward URL split passes. Blockers/concerns: migration renaming needs production DB confirmation before merge, docker create parsing has minor reliability risk, CI duplication will double test runs. |
Combined PR (replaces #400, #401, #402)
This PR combines three related fixes for production readiness:
1. Docker Provisioning Security Hardening (was PR #400)
exec()withexecFile()to prevent command injection2. Steward Container URL Fix (was PR #401)
STEWARD_API_URLinto host-side and container-side URLslocalhost:3200): used by orchestrator for Steward API callshost.docker.internal:3200): injected into container env vars3. Migration Numbering Fix (was PR #402)
00430043_seed_chain_data_pricing.sql→0044and cascaded through0053Testing
Closes
Closes #400, closes #401, closes #402