replace openai-access-gateway with litellm by jordanstephens · Pull Request #2012 · DefangLabs/defang

jordanstephens · 2026-04-02T23:54:44Z

Description

This PR replaces the openai-access-gateway with litellm when users deploy compose files with the docker model provider.

For example:

  chat:
    provider:
      type: model
      options:
        model: chat-default
    x-defang-llm: true

Will get transformed into:

  chat:
    command:
      - --drop_params
      - --model
      - chat-default
    environment:
      LITELLM_MASTER_KEY: ""
    image: litellm/litellm:latest
    networks:
      model_provider_private: null
    ports:
      - mode: host
        target: 4000
        protocol: tcp
    x-defang-llm: true

Linked Issues

Checklist

I have performed a self-review of my code
I have added appropriate tests
I have updated the Defang CLI docs and/or README to reflect my changes, if necessary

Summary by CodeRabbit

New Features
- Switched model/LLM runtime to LiteLLM with updated images, service URL/host port (4000), provider-aware model identifiers, and LITELLM_MASTER_KEY / API-key propagation.
Bug Fixes
- Improved rollout feedback with per-iteration pending-service reporting and a deployment-complete message.
- More reliable GCE instance-group audit handling via live label lookup and safer handling when labels are missing or unavailable.
Tests
- Added and expanded unit tests and fixtures for LiteLLM behavior and GCP audit-log parsing; new warnings for sensitive env vars.
Chores
- Minor README wording and build metadata updates.

coderabbitai · 2026-04-02T23:54:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Fetch provider AccountInfo once and thread it into model/provider fixups; switch model-provider services to LiteLLM (image, command, port 4000, LITELLM_MASTER_KEY behavior); change GCP BYOC log parsing to query live instance-group-manager labels; update test fixtures, warnings, docs, and supporting mocks.

Changes

Cohort / File(s)	Summary
Core Fixup Logic `src/pkg/cli/compose/fixup.go`	Call `provider.AccountInfo(ctx)` once; thread `*client.AccountInfo` into `fixupModel`/`fixupModelProvider` and `makeAccessGatewayService`; add `modelWithProvider`; change provider-service generation to use litellm image, command args, port=4000, and LITELLM_MASTER_KEY/OPENAI_API_KEY propagation.
LLM Test Fixtures `src/testdata/llm/compose.yaml`, `src/testdata/llm/compose.yaml.fixup`, `src/testdata/llm/compose.yaml.golden`	Replace access-gateway images with `litellm/...` variants, add/update `alt-repo`, and change host-exposed target ports from 80 → 4000 where applicable.
Model & Provider Test Fixtures `src/testdata/models/compose.yaml.fixup`, `src/testdata/provider/compose.yaml.fixup`, `src/testdata/models/compose.yaml.warnings`, `src/testdata/provider/compose.yaml.warnings`	Update images to `litellm/...:v1.82.3-stable.patch.3`, set explicit `--drop_params --model ... --alias ...` commands, change exposed target ports to 4000, replace seeded `OPENAI_API_KEY` with `LITELLM_MASTER_KEY`, and add related warnings.
GCP BYOC & Logs `src/pkg/cli/client/byoc/gcp/query.go`, `src/pkg/cli/client/byoc/gcp/stream.go`, `src/pkg/cli/client/byoc/gcp/query_test.go`, `src/pkg/cli/client/byoc/gcp/stream_test.go`, `src/pkg/cli/client/byoc/gcp/byoc_test.go`	Remove payload-label filters from insert/patch query; add `GetInstanceGroupManagerLabels` to client interface and mocks; change log parsing to derive project/region/manager from resource labels and fetch live IG manager labels; add tests for these flows.
GCP Cloud Implementation `src/pkg/clouds/gcp/compute.go`	Add exported `GetInstanceGroupManagerLabels(ctx, project, region, name)` returning labels from `AllInstancesConfig.Properties.Labels` and wrap errors with context.
Client & Mocks `src/pkg/cli/client/mock.go`	Add `MockProvider.AccountInfo(context.Context) (*AccountInfo, error)` stub returning empty AccountInfo.
Runtime/UX tweaks `src/pkg/cli/subscribe.go`, `src/pkg/cli/tailAndMonitor.go`	Recompute/log pending services each loop and log deployment-complete info when CD task exits successfully.
Misc Tests & Build/Docs `src/go.mod`, `pkgs/defang/cli.nix`, `pkgs/npm/README.md`, `src/README.md`	Move a genproto require entry, update Nix vendorHash, and minor README wording tweaks.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (FixupServices)
    participant Provider as Provider (AccountInfo)
    participant Fixup as fixupModel / fixupModelProvider
    participant Maker as makeAccessGatewayService
    participant Service as Dependent Service(s)

    CLI->>Provider: AccountInfo(ctx)
    Provider-->>CLI: *AccountInfo (may be empty)
    CLI->>Fixup: fixupModel(..., accountInfo)
    Fixup->>Maker: makeAccessGatewayService(..., accountInfo)
    Maker->>Maker: choose image/port/command (litellm, port=4000) and derive LITELLM_MASTER_KEY
    Maker->>Service: inject envs (LITELLM_MASTER_KEY / OPENAI_API_KEY fallback) and endpoint URLs
    Service-->>CLI: updated service compose config

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

automatically update the nix vendor hash #1672 — Updates Nix vendoring/vendorHash (related build artifact change).
Defang Agent #1687 — Overlaps edits to vendoring and compose fixup tests (related file-level changes).

Suggested labels

go, high

Suggested reviewers

lionello
edwardrf

Poem

🐰 I hopped through ports and keys tonight,

I braided models, set the light,
Four‑thousand footsteps down the lane,
Master key snug, no secrets slain,
Tests hum softly — spring's in sight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'replace openai-access-gateway with litellm' directly and clearly describes the primary objective of the changeset: replacing the openai-access-gateway service with litellm across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jordan/litellm

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

src/pkg/cli/compose/fixup.go

src/testdata/llm/compose.yaml

src/testdata/llm/compose.yaml.fixup

src/testdata/llm/compose.yaml.golden

src/testdata/llm/compose.yaml.fixup

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/pkg/cli/compose/fixup.go (1)
362-387: ⚠️ Potential issue | 🔴 Critical

The generated *_URL is missing the port and uses an incorrect API path.

Line 366 generates http://<service>/api/v1/, but after switching to LiteLLM on port 4000 (line 387), dependents need both the port and the correct OpenAI-compatible API path. On the shared model_provider_private network, the URL without an explicit port defaults to port 80, so clients cannot reach the LiteLLM listener. Additionally, LiteLLM's OpenAI-compatible API uses /v1/, not /api/v1/. Dependent services will fail unless they can reach the correct endpoint on port 4000.
💡 Suggested fix
-	urlVal := "http://" + svccfg.Name + "/api/v1/"
+	urlVal := fmt.Sprintf("http://%s:%d/v1", svccfg.Name, liteLLMPort)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 362 - 387, The generated endpoint
URL in makeAccessGatewayService is incorrect: change urlVal to include the
LiteLLM port and the OpenAI-compatible path (use "http://<service>:<port>/v1/"),
referencing liteLLMPort and svccfg.Name, and assign it into the service
environment under the endpointEnvVar (and ensure modelEnvVar is also set to the
passed model if needed); update the env map population logic in
makeAccessGatewayService so svccfg.Environment[endpointEnvVar] = urlVal (and
svccfg.Environment[modelEnvVar] = model) instead of leaving the URL without port
and using /api/v1/.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 232-239: The current fixupLLM uses GetImageRepo which strips parts
incorrectly when a registry includes a port (e.g.
registry.example:5000/litellm:latest), causing the "/litellm" suffix check to
fail; change fixupLLM to compute the image path without removing a registry port
by stripping only the image tag/digest (remove the substring after the last ':'
or '@' only if it occurs after the last '/'), then run strings.HasSuffix on that
sanitized path; update references in fixupLLM to use svccfg.Image (or a local
sanitizedImage variable) for the suffix check and preserve adding the host port
using liteLLMPort if the suffix matches and svccfg.Ports is empty.

---

Outside diff comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 362-387: The generated endpoint URL in makeAccessGatewayService is
incorrect: change urlVal to include the LiteLLM port and the OpenAI-compatible
path (use "http://<service>:<port>/v1/"), referencing liteLLMPort and
svccfg.Name, and assign it into the service environment under the endpointEnvVar
(and ensure modelEnvVar is also set to the passed model if needed); update the
env map population logic in makeAccessGatewayService so
svccfg.Environment[endpointEnvVar] = urlVal (and svccfg.Environment[modelEnvVar]
= model) instead of leaving the URL without port and using /api/v1/.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e90edff0-1afb-496e-b665-e6d2c7d6af51

📥 Commits

Reviewing files that changed from the base of the PR and between 9098072 and e614bb1.

📒 Files selected for processing (6)

src/pkg/cli/compose/fixup.go
src/testdata/llm/compose.yaml
src/testdata/llm/compose.yaml.fixup
src/testdata/llm/compose.yaml.golden
src/testdata/models/compose.yaml.fixup
src/testdata/provider/compose.yaml.fixup

coderabbitai · 2026-04-06T18:35:03Z

src/pkg/cli/compose/fixup.go

 func fixupLLM(svccfg *composeTypes.ServiceConfig) {
 	image := GetImageRepo(svccfg.Image)
-	if strings.HasSuffix(image, "/openai-access-gateway") && len(svccfg.Ports) == 0 {
+	if strings.HasSuffix(image, "/litellm") && len(svccfg.Ports) == 0 {
 		// HACK: we must have at least one host port to get a CNAME for the service
-		var port uint32 = 80
+		// litellm listens on 4000 by default
+		var port uint32 = liteLLMPort
 		term.Debugf("service %q: adding LLM host port %d", svccfg.Name, port)
 		svccfg.Ports = []composeTypes.ServicePortConfig{{Target: port, Mode: Mode_HOST, Protocol: Protocol_TCP}}


⚠️ Potential issue | 🟡 Minor

Handle registry ports before matching /litellm.

This suffix check inherits GetImageRepo's first-: split, so registry.example:5000/litellm:latest is reduced to registry.example before the match. Those LiteLLM images will skip the 4000-port fixup entirely.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/pkg/cli/compose/fixup.go` around lines 232 - 239, The current fixupLLM uses GetImageRepo which strips parts incorrectly when a registry includes a port (e.g. registry.example:5000/litellm:latest), causing the "/litellm" suffix check to fail; change fixupLLM to compute the image path without removing a registry port by stripping only the image tag/digest (remove the substring after the last ':' or '@' only if it occurs after the last '/'), then run strings.HasSuffix on that sanitized path; update references in fixupLLM to use svccfg.Image (or a local sanitizedImage variable) for the suffix check and preserve adding the host port using liteLLMPort if the suffix matches and svccfg.Ports is empty.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/pkg/cli/compose/fixup.go (1)

437-452: ⚠️ Potential issue | 🟠 Major

Potential nil map assignment on dependency.Networks.

At line 442, dependency.Networks[modelProviderNetwork] = nil will panic if dependency.Networks is nil. The same pattern at line 466 has the same issue. The code at lines 400-406 correctly handles this case for svccfg.Networks.

🐛 Proposed fix

 	for _, dependency := range project.Services {
 		if _, ok := dependency.DependsOn[svccfg.Name]; ok {
 			if dependency.Environment == nil {
 				dependency.Environment = make(composeTypes.MappingWithEquals)
 			}
+			if dependency.Networks == nil {
+				dependency.Networks = make(map[string]*composeTypes.ServiceNetworkConfig)
+			}
 			dependency.Networks[modelProviderNetwork] = nil

And similarly for the dependency.Models block around line 466:

 			if dependency.Environment == nil {
 				dependency.Environment = make(composeTypes.MappingWithEquals)
 			}
+			if dependency.Networks == nil {
+				dependency.Networks = make(map[string]*composeTypes.ServiceNetworkConfig)
+			}
 			dependency.Networks[modelProviderNetwork] = nil

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 437 - 452, The code assigns into
dependency.Networks without ensuring the map is initialized (causing a panic
when nil); update the loop over project.Services that references
dependency.Networks to check if dependency.Networks == nil and if so set
dependency.Networks = make(composeTypes.MappingWithEquals) before setting
dependency.Networks[modelProviderNetwork] = nil, and apply the same nil-check +
initialization pattern used for svccfg.Networks to the other similar block that
modifies dependency.Models (the second occurrence around the later block) so
both maps are safely initialized before assignment.

🧹 Nitpick comments (3)

src/pkg/cli/subscribe.go (1)
78-85: Verbose logging on every message iteration.

This logs on every iteration of the subscribe loop, which could produce excessive output if the stream is chatty. Additionally, the log executes before the msg == nil check (line 87), so it prints even when there's nothing meaningful to process.

Consider either:

Moving this logging after the msg == nil check, or

Only logging when pendingServices changes from the previous iteration, or

Logging at a lower level (e.g., term.Debug) since tailAndMonitor.go now also logs a waiting message.
♻️ Option: Move logging after nil check and reduce verbosity
+		if msg == nil {
+			continue
+		}
+
 		pendingServices := []string{}
 		for _, service := range services {
 			if serviceStates[service] != targetState {
 				pendingServices = append(pendingServices, service)
 			}
 		}
 
-		term.Infof("Waiting for %q to be in state %s...\n", pendingServices, targetState) // TODO: don't print in Go-routine
+		term.Debugf("Pending services waiting for state %s: %v\n", targetState, pendingServices) // TODO: don't print in Go-routine
 
-		if msg == nil {
-			continue
-		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/subscribe.go` around lines 78 - 85, The current loop logs via
term.Infof every iteration and does so before checking msg == nil, causing
excessive/irrelevant output; move the term.Infof call so it executes after the
nil-check for msg, and reduce verbosity by either using term.Debug (or a lower
log level) or only emitting when pendingServices (computed from services,
serviceStates, targetState) changes from its previous value; update the call
site where term.Infof is used (the pendingServices computation and logging
block) and ensure you track the lastPendingServices value to compare before
logging if choosing the “only when changed” approach.
src/pkg/clouds/gcp/compute.go (1)
14-18: Consider caching or reusing the Compute client.

A new compute.Service client is created on every call to GetInstanceGroupManagerLabels. If this method is called frequently (e.g., for each audit log entry during a deployment), this could introduce unnecessary latency and resource overhead from repeated client initialization.

This is likely acceptable for the current use case (sparse audit log entries), but if performance becomes a concern, consider passing a pre-initialized client or caching it at the Gcp struct level.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/gcp/compute.go` around lines 14 - 18, The
GetInstanceGroupManagerLabels function creates a new compute.Service on every
call; to reduce repeated client initialization, add a reusable client on the Gcp
struct (e.g., a field computeSvc *compute.Service) or accept a pre-initialized
client, initialize it lazily in a method like ensureComputeService() and reuse
it in GetInstanceGroupManagerLabels, and protect lazy init with appropriate
synchronization (e.g., sync.Mutex or sync.Once) to make compute.Service creation
thread-safe.
src/pkg/cli/compose/fixup.go (1)
40-44: Consider escalating the warning when AccountInfo is needed for model services.

When AccountInfo fails and the project contains model services, the resulting empty AccountInfo will cause provider-specific environment variables (like AWS_REGION, VERTEXAI_PROJECT, VERTEXAI_LOCATION) to be omitted. This could lead to runtime failures that are difficult to diagnose since the error is only logged at debug level.

Consider logging a more visible warning when model services exist in the project and AccountInfo fails.
💡 Suggested improvement
 	accountInfo, err := provider.AccountInfo(ctx)
 	if err != nil {
 		term.Debugf("failed to get account info to fixup services: %v", err)
 		accountInfo = &client.AccountInfo{}
+		// Check if any model services exist that might need account info
+		hasModelServices := len(project.Models) > 0
+		if !hasModelServices {
+			for _, svc := range project.Services {
+				if svc.Provider != nil && svc.Provider.Type == "model" {
+					hasModelServices = true
+					break
+				}
+			}
+		}
+		if hasModelServices {
+			term.Warnf("failed to get account info; model services may not have required cloud provider configuration: %v", err)
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 40 - 44, When
provider.AccountInfo(ctx) fails, the code currently only calls term.Debugf and
falls back to an empty client.AccountInfo, which hides failures that matter when
the project contains model services; update the logic around
provider.AccountInfo (the call to provider.AccountInfo, the accountInfo variable
and the debug log) to check whether any model services are present and, if so,
log a visible warning (e.g., term.Warnf/term.Warningf) that includes the error
details and notes that provider-specific env vars (AWS_REGION, VERTEXAI_PROJECT,
VERTEXAI_LOCATION) may be omitted, while still falling back to accountInfo =
&client.AccountInfo{} to preserve current behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 411-434: The LITELLM_MASTER_KEY derivation loop can panic and
continues after a fatal inconsistency: ensure you nil-check
service.Environment["OPENAI_API_KEY"] before dereferencing (i.e., verify key !=
nil before using *key) when accumulating into openAIKey, and on detecting
multiple different OPENAI_API_KEY values (where currently term.Errorf is called)
stop further processing immediately instead of break — propagate the error or
return from the enclosing function (or set an explicit failure path) so
liteLLMMasterKey is not set from an inconsistent first value; update references
in this block (openAIKey, service.Environment["OPENAI_API_KEY"], term.Errorf,
liteLLMMasterKey, svccfg.Environment) accordingly.
- Line 398: The image tag assigned to svccfg.Image is incorrect; update the
value of svccfg.Image from "litellm/litellm:v1.82.3-stable.patch.3" to the
available tag "litellm/litellm:v1.82.3-stable.patch.2" so the code uses the
existing Docker image.

---

Outside diff comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 437-452: The code assigns into dependency.Networks without
ensuring the map is initialized (causing a panic when nil); update the loop over
project.Services that references dependency.Networks to check if
dependency.Networks == nil and if so set dependency.Networks =
make(composeTypes.MappingWithEquals) before setting
dependency.Networks[modelProviderNetwork] = nil, and apply the same nil-check +
initialization pattern used for svccfg.Networks to the other similar block that
modifies dependency.Models (the second occurrence around the later block) so
both maps are safely initialized before assignment.

---

Nitpick comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 40-44: When provider.AccountInfo(ctx) fails, the code currently
only calls term.Debugf and falls back to an empty client.AccountInfo, which
hides failures that matter when the project contains model services; update the
logic around provider.AccountInfo (the call to provider.AccountInfo, the
accountInfo variable and the debug log) to check whether any model services are
present and, if so, log a visible warning (e.g., term.Warnf/term.Warningf) that
includes the error details and notes that provider-specific env vars
(AWS_REGION, VERTEXAI_PROJECT, VERTEXAI_LOCATION) may be omitted, while still
falling back to accountInfo = &client.AccountInfo{} to preserve current
behavior.

In `@src/pkg/cli/subscribe.go`:
- Around line 78-85: The current loop logs via term.Infof every iteration and
does so before checking msg == nil, causing excessive/irrelevant output; move
the term.Infof call so it executes after the nil-check for msg, and reduce
verbosity by either using term.Debug (or a lower log level) or only emitting
when pendingServices (computed from services, serviceStates, targetState)
changes from its previous value; update the call site where term.Infof is used
(the pendingServices computation and logging block) and ensure you track the
lastPendingServices value to compare before logging if choosing the “only when
changed” approach.

In `@src/pkg/clouds/gcp/compute.go`:
- Around line 14-18: The GetInstanceGroupManagerLabels function creates a new
compute.Service on every call; to reduce repeated client initialization, add a
reusable client on the Gcp struct (e.g., a field computeSvc *compute.Service) or
accept a pre-initialized client, initialize it lazily in a method like
ensureComputeService() and reuse it in GetInstanceGroupManagerLabels, and
protect lazy init with appropriate synchronization (e.g., sync.Mutex or
sync.Once) to make compute.Service creation thread-safe.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 67f36a31-3342-44b2-a2d6-5a69183e29d3

📥 Commits

Reviewing files that changed from the base of the PR and between e614bb1 and eb7a99b.

📒 Files selected for processing (23)

pkgs/defang/cli.nix
pkgs/npm/README.md
src/README.md
src/go.mod
src/pkg/cli/client/byoc/gcp/byoc_test.go
src/pkg/cli/client/byoc/gcp/query.go
src/pkg/cli/client/byoc/gcp/query_test.go
src/pkg/cli/client/byoc/gcp/stream.go
src/pkg/cli/client/byoc/gcp/stream_test.go
src/pkg/cli/client/mock.go
src/pkg/cli/compose/fixup.go
src/pkg/cli/subscribe.go
src/pkg/cli/tailAndMonitor.go
src/pkg/clouds/aws/login.go
src/pkg/clouds/gcp/compute.go
src/pkg/clouds/gcp/login.go
src/testdata/llm/compose.yaml
src/testdata/llm/compose.yaml.fixup
src/testdata/llm/compose.yaml.golden
src/testdata/models/compose.yaml.fixup
src/testdata/models/compose.yaml.warnings
src/testdata/provider/compose.yaml.fixup
src/testdata/provider/compose.yaml.warnings

✅ Files skipped from review due to trivial changes (6)

pkgs/npm/README.md
src/README.md
src/pkg/clouds/aws/login.go
pkgs/defang/cli.nix
src/testdata/models/compose.yaml.warnings
src/testdata/provider/compose.yaml.warnings

🚧 Files skipped from review as they are similar to previous changes (5)

src/testdata/llm/compose.yaml.fixup
src/testdata/llm/compose.yaml.golden
src/testdata/models/compose.yaml.fixup
src/testdata/provider/compose.yaml.fixup
src/testdata/llm/compose.yaml

src/pkg/cli/compose/fixup.go

coderabbitai

♻️ Duplicate comments (2)

src/pkg/cli/compose/fixup.go (2)
238-246: ⚠️ Potential issue | 🟡 Minor

Registry port in image names breaks /litellm suffix check.

GetImageRepo splits on the first :, so registry.example:5000/litellm:latest becomes registry.example, losing the /litellm path. Such images won't get the port 4000 fixup.

Consider splitting only on the last : that follows a / (to separate tag from path), or use a proper image reference parser.

,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 238 - 246, The suffix check in
fixupLLM fails because GetImageRepo splits on the first ':' and drops the path
for registry URLs (e.g., registry:5000/...); update GetImageRepo usage or its
implementation so it preserves the repository path before checking suffix:
either change GetImageRepo to strip the image tag by splitting on the last ':'
only if that colon appears after the last '/' (so registry:port is not treated
as a tag), or replace GetImageRepo with a proper image reference parser and then
call fixupLLM using the parsed repository/name to correctly detect
strings.HasSuffix(repo, "/litellm") in the fixupLLM function.
424-446: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference risk when reading OPENAI_API_KEY.

At line 431, *key is dereferenced without checking if key is nil. In compose files, environment variables can be declared without a value (OPENAI_API_KEY: or OPENAI_API_KEY: null), resulting in a nil pointer that will panic on dereference.

Additionally, at line 433-434, after logging the error about multiple different keys, execution continues with break and uses whatever openAIKey was found first—potentially inconsistent behavior.

,
🐛 Proposed fix
 		if _, ok := service.DependsOn[svccfg.Name]; ok {
 			if key, ok := service.Environment["OPENAI_API_KEY"]; ok {
+				if key == nil {
+					continue // skip nil (unset) keys
+				}
 				if openAIKey == "" {
 					openAIKey = *key
 				} else if *key != openAIKey {
 					term.Errorf("multiple different OPENAI_API_KEY values found in services depending on %q", svccfg.Name)
-					break
+					return // abort with inconsistent configuration
 				}
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 424 - 446, The loop that reads
OPENAI_API_KEY from dependent services (involving svccfg, project.Services and
openAIKey) can dereference a nil *string (key) and may continue using an
inconsistent first-found value after detecting differing keys; fix it by
checking key != nil before using *key, skip nil/empty values, and when you
detect multiple different non-nil keys (the condition where term.Errorf is
called) clear openAIKey (e.g., set openAIKey = "") or otherwise mark it invalid
and break so you don't keep using the first value, then proceed to set
LITELLM_MASTER_KEY as before (falling back to the default when openAIKey is
empty).

🧹 Nitpick comments (1)

src/testdata/llm/compose.yaml (1)
2-10: Good addition of alt-repo test case for custom registry.

This tests the scenario of litellm images from alternate registries. Note that line 9 uses target: "4000" as a string—compose-go handles both, but using an unquoted integer would be more conventional:
target: 4000
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/testdata/llm/compose.yaml` around lines 2 - 10, The ports entry for the
alt-repo service uses a string for target ("4000"); update the ports block for
service name alt-repo so target is an integer (target: 4000) instead of a quoted
string; locate the ports mapping under the alt-repo service in
src/testdata/llm/compose.yaml and remove the quotes around 4000 to make it an
unquoted integer.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 238-246: The suffix check in fixupLLM fails because GetImageRepo
splits on the first ':' and drops the path for registry URLs (e.g.,
registry:5000/...); update GetImageRepo usage or its implementation so it
preserves the repository path before checking suffix: either change GetImageRepo
to strip the image tag by splitting on the last ':' only if that colon appears
after the last '/' (so registry:port is not treated as a tag), or replace
GetImageRepo with a proper image reference parser and then call fixupLLM using
the parsed repository/name to correctly detect strings.HasSuffix(repo,
"/litellm") in the fixupLLM function.
- Around line 424-446: The loop that reads OPENAI_API_KEY from dependent
services (involving svccfg, project.Services and openAIKey) can dereference a
nil *string (key) and may continue using an inconsistent first-found value after
detecting differing keys; fix it by checking key != nil before using *key, skip
nil/empty values, and when you detect multiple different non-nil keys (the
condition where term.Errorf is called) clear openAIKey (e.g., set openAIKey =
"") or otherwise mark it invalid and break so you don't keep using the first
value, then proceed to set LITELLM_MASTER_KEY as before (falling back to the
default when openAIKey is empty).

---

Nitpick comments:
In `@src/testdata/llm/compose.yaml`:
- Around line 2-10: The ports entry for the alt-repo service uses a string for
target ("4000"); update the ports block for service name alt-repo so target is
an integer (target: 4000) instead of a quoted string; locate the ports mapping
under the alt-repo service in src/testdata/llm/compose.yaml and remove the
quotes around 4000 to make it an unquoted integer.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4094c22-2f2d-4ea5-a0af-a3b0af72a866

📥 Commits

Reviewing files that changed from the base of the PR and between 4013d22 and 67c00bf.

📒 Files selected for processing (10)

src/pkg/cli/client/mock.go
src/pkg/cli/compose/fixup.go
src/pkg/cli/subscribe.go
src/testdata/llm/compose.yaml
src/testdata/llm/compose.yaml.fixup
src/testdata/llm/compose.yaml.golden
src/testdata/models/compose.yaml.fixup
src/testdata/models/compose.yaml.warnings
src/testdata/provider/compose.yaml.fixup
src/testdata/provider/compose.yaml.warnings

✅ Files skipped from review due to trivial changes (3)

src/testdata/models/compose.yaml.warnings
src/testdata/provider/compose.yaml.warnings
src/testdata/provider/compose.yaml.fixup

🚧 Files skipped from review as they are similar to previous changes (4)

src/pkg/cli/subscribe.go
src/testdata/llm/compose.yaml.golden
src/testdata/llm/compose.yaml.fixup
src/testdata/models/compose.yaml.fixup

G101 is a gosec rule ID, not a standalone linter name. Using it in //nolint directives caused golangci-lint to warn about unknown linters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GCE allInstancesConfig.properties.labels is a map<string,string>, not a list of {key,value} structs. The query filters were using the list format (labels.key="defang-service" / labels.value="...") which never matched any audit log entries, so gce_instance_group_manager events were never returned by Cloud Logging. Even if events had arrived, the parser was iterating over the field as a list (GetListInStruct) which always returned nil, leaving the computeEngineRootTriggers map empty. As a result, all gce_instance_group addInstances events were silently dropped and WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services. Fix the query to use map-style key access: labels."defang-service"=~"^(svc)$" Fix the parser to use GetValueInStruct with the label name as a path key, replacing the 10-line list iteration with a single call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GCE audit logs for regionInstanceGroupManagers.patch only carry the fields that changed (e.g. the new instance template version). The allInstancesConfig.properties.labels — where the defang-service label lives — is absent from the request body for every update after the initial create. As a result, the computeEngineRootTriggers map was never populated and all gce_instance_group addInstances events were silently dropped, so WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services. Fix: instead of reading labels from the audit log request body, read the instance group manager name, project, and region from the always- present entry.Resource.Labels and call the GCE REST API to get the live resource's allInstancesConfig.properties.labels. This mirrors the fallback used by the server-side fabric_gcp.go implementation. Add GetInstanceGroupManagerLabels to GcpLogsClient and implement it using the already-present google.golang.org/api/compute/v1 dependency (no new deps required). Also add the missing isQuotaError helper to the gcpquota debug tool, which was preventing the pre-commit lint check from passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PATCH requests for regionInstanceGroupManagers only carry changed fields (e.g. a new instance template reference). When Pulumi re-deploys a CE service, it patches the instance template without including allInstancesConfig.properties.labels in the request body. The Cloud Logging filter on those absent label fields never matched, so no gce_instance_group_manager events were returned for re-deploys, leaving computeEngineRootTriggers empty and causing all gce_instance_group addInstances events to be silently dropped. The parser already handles service-specific filtering by reading labels from the live MIG resource via GetInstanceGroupManagerLabels, so the query-level label filters are redundant and harmful. Remove them and keep only the method name and operation.first filters, consistent with how AddComputeEngineInstanceGroupAddInstances works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…O/WZAY=

Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>

Cover the three bugs fixed in this branch: - TestAddComputeEngineInstanceGroupInsertOrPatch: asserts the query contains no allInstancesConfig or defang-* label filters (guarding against the old list-format filters that never matched) - TestActivityParser_GceInstanceGroupManager: table-driven tests for the gce_instance_group_manager parser path — happy path, API error, nil labels, missing defang-service label, and missing root_trigger_id - TestActivityParser_GceInstanceGroupFlow: end-to-end test that a manager insert/patch entry populates the trigger map and a subsequent addInstances entry uses it to emit DEPLOYMENT_COMPLETED - TestActivityParser_GceInstanceGroupDropsUnknownTrigger: events with an unrecognized root_trigger_id are silently dropped Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Instead of * Waiting for ["app" "worker" "chat" "embedding"] to be in state DEPLOYMENT_COMPLETED... * Waiting for ["app" "worker" "embedding"] to be in state DEPLOYMENT_COMPLETED... * Waiting for ["app" "worker"] to be in state DEPLOYMENT_COMPLETED... I prefer * Waiting for services to finish deploying: ["app" "worker" "chat" "embedding"] * Waiting for services to finish deploying: ["app" "worker" "embedding"] * Waiting for services to finish deploying: ["app" "worker"] message

i always find myself grepping for 'with state' which only feels tangentially related. A better prefix is "Service update:"

Co-authored-by: Lio李歐 <lionello@users.noreply.github.com> Co-authored-by: Jordan Stephens <jordan@stephens.io>

Range-iterating project.Services yields value copies, so assigning a newly created DependsOn map to the local copy was silently dropped. Switch to keyed iteration and write the service back whenever it was changed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Split makeAccessGatewayService into two focused helpers: - configureAccessGateway: resolves the model for the target cloud provider, sets up the LiteLLM container (image, command, network, port), and derives LITELLM_MASTER_KEY - wireDependentServices: injects URL/model/key env vars and network membership into every service that depends on the model service Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jordanstephens added the wip Work in progress; don't merge label Apr 2, 2026

lionello approved these changes Apr 3, 2026

View reviewed changes

jordanstephens commented Apr 6, 2026

View reviewed changes

src/testdata/llm/compose.yaml.fixup Show resolved Hide resolved

jordanstephens marked this pull request as ready for review April 6, 2026 18:20

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

jordanstephens force-pushed the jordan/litellm branch from e614bb1 to eb7a99b Compare April 8, 2026 18:35

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

src/pkg/cli/compose/fixup.go Show resolved Hide resolved

src/pkg/cli/compose/fixup.go Outdated Show resolved Hide resolved

jordanstephens force-pushed the jordan/litellm branch from 4013d22 to 67c00bf Compare April 8, 2026 22:27

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

jordanstephens force-pushed the jordan/litellm branch from 67c00bf to 1fa0947 Compare April 9, 2026 00:06

jordanstephens and others added 19 commits April 9, 2026 09:37

fix: remove invalid G101 linter name from nolint directives

1c7d3b8

G101 is a gosec rule ID, not a standalone linter name. Using it in //nolint directives caused golangci-lint to warn about unknown linters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

log which services we are waiting for

780617d

Update Nix vendorHash to sha256-DxRBE7mugWJ2NqBiIDNazg/mb+zjZkgNjpTDJ…

19a40d6

…O/WZAY=

Apply suggestions from code review

2eb94e8

Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>

chore: go mod tidy — promote genproto/googleapis/api to direct dep

4f691f2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

make service status updates more easily grep-able

669a060

i always find myself grepping for 'with state' which only feels tangentially related. A better prefix is "Service update:"

replace openai-access-gateway with litellm

07d1d04

Apply suggestions from code review

3ebf282

Co-authored-by: Lio李歐 <lionello@users.noreply.github.com> Co-authored-by: Jordan Stephens <jordan@stephens.io>

restore original test cases

352df39

use a named constant for litellm port

e8998eb

llm service url should have port

854f0ed

configure LITELLM_MASTER_KEY

092c122

configure litellm per provider

7949f7e

jordanstephens and others added 2 commits April 9, 2026 09:37

update nix vendor hash

9352b23

jordanstephens force-pushed the jordan/litellm branch from 0ee34a9 to 9352b23 Compare April 9, 2026 16:46

Conversation

jordanstephens commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Linked Issues

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jordanstephens commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading