Skip to content

replace openai-access-gateway with litellm#2012

Open
jordanstephens wants to merge 21 commits intomainfrom
jordan/litellm
Open

replace openai-access-gateway with litellm#2012
jordanstephens wants to merge 21 commits intomainfrom
jordan/litellm

Conversation

@jordanstephens
Copy link
Copy Markdown
Member

@jordanstephens jordanstephens commented Apr 2, 2026

Description

This PR replaces the openai-access-gateway with litellm when users deploy compose files with the docker model provider.

For example:

  chat:
    provider:
      type: model
      options:
        model: chat-default
    x-defang-llm: true

Will get transformed into:

  chat:
    command:
      - --drop_params
      - --model
      - chat-default
    environment:
      LITELLM_MASTER_KEY: ""
    image: litellm/litellm:latest
    networks:
      model_provider_private: null
    ports:
      - mode: host
        target: 4000
        protocol: tcp
    x-defang-llm: true

Linked Issues

Checklist

  • I have performed a self-review of my code
  • I have added appropriate tests
  • I have updated the Defang CLI docs and/or README to reflect my changes, if necessary

Summary by CodeRabbit

  • New Features

    • Switched model/LLM runtime to LiteLLM with updated images, service URL/host port (4000), provider-aware model identifiers, and LITELLM_MASTER_KEY / API-key propagation.
  • Bug Fixes

    • Improved rollout feedback with per-iteration pending-service reporting and a deployment-complete message.
    • More reliable GCE instance-group audit handling via live label lookup and safer handling when labels are missing or unavailable.
  • Tests

    • Added and expanded unit tests and fixtures for LiteLLM behavior and GCP audit-log parsing; new warnings for sensitive env vars.
  • Chores

    • Minor README wording and build metadata updates.

@jordanstephens jordanstephens added the wip Work in progress; don't merge label Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Fetch provider AccountInfo once and thread it into model/provider fixups; switch model-provider services to LiteLLM (image, command, port 4000, LITELLM_MASTER_KEY behavior); change GCP BYOC log parsing to query live instance-group-manager labels; update test fixtures, warnings, docs, and supporting mocks.

Changes

Cohort / File(s) Summary
Core Fixup Logic
src/pkg/cli/compose/fixup.go
Call provider.AccountInfo(ctx) once; thread *client.AccountInfo into fixupModel/fixupModelProvider and makeAccessGatewayService; add modelWithProvider; change provider-service generation to use litellm image, command args, port=4000, and LITELLM_MASTER_KEY/OPENAI_API_KEY propagation.
LLM Test Fixtures
src/testdata/llm/compose.yaml, src/testdata/llm/compose.yaml.fixup, src/testdata/llm/compose.yaml.golden
Replace access-gateway images with litellm/... variants, add/update alt-repo, and change host-exposed target ports from 80 → 4000 where applicable.
Model & Provider Test Fixtures
src/testdata/models/compose.yaml.fixup, src/testdata/provider/compose.yaml.fixup, src/testdata/models/compose.yaml.warnings, src/testdata/provider/compose.yaml.warnings
Update images to litellm/...:v1.82.3-stable.patch.3, set explicit --drop_params --model ... --alias ... commands, change exposed target ports to 4000, replace seeded OPENAI_API_KEY with LITELLM_MASTER_KEY, and add related warnings.
GCP BYOC & Logs
src/pkg/cli/client/byoc/gcp/query.go, src/pkg/cli/client/byoc/gcp/stream.go, src/pkg/cli/client/byoc/gcp/query_test.go, src/pkg/cli/client/byoc/gcp/stream_test.go, src/pkg/cli/client/byoc/gcp/byoc_test.go
Remove payload-label filters from insert/patch query; add GetInstanceGroupManagerLabels to client interface and mocks; change log parsing to derive project/region/manager from resource labels and fetch live IG manager labels; add tests for these flows.
GCP Cloud Implementation
src/pkg/clouds/gcp/compute.go
Add exported GetInstanceGroupManagerLabels(ctx, project, region, name) returning labels from AllInstancesConfig.Properties.Labels and wrap errors with context.
Client & Mocks
src/pkg/cli/client/mock.go
Add MockProvider.AccountInfo(context.Context) (*AccountInfo, error) stub returning empty AccountInfo.
Runtime/UX tweaks
src/pkg/cli/subscribe.go, src/pkg/cli/tailAndMonitor.go
Recompute/log pending services each loop and log deployment-complete info when CD task exits successfully.
Misc Tests & Build/Docs
src/go.mod, pkgs/defang/cli.nix, pkgs/npm/README.md, src/README.md
Move a genproto require entry, update Nix vendorHash, and minor README wording tweaks.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (FixupServices)
    participant Provider as Provider (AccountInfo)
    participant Fixup as fixupModel / fixupModelProvider
    participant Maker as makeAccessGatewayService
    participant Service as Dependent Service(s)

    CLI->>Provider: AccountInfo(ctx)
    Provider-->>CLI: *AccountInfo (may be empty)
    CLI->>Fixup: fixupModel(..., accountInfo)
    Fixup->>Maker: makeAccessGatewayService(..., accountInfo)
    Maker->>Maker: choose image/port/command (litellm, port=4000) and derive LITELLM_MASTER_KEY
    Maker->>Service: inject envs (LITELLM_MASTER_KEY / OPENAI_API_KEY fallback) and endpoint URLs
    Service-->>CLI: updated service compose config
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

go, high

Suggested reviewers

  • lionello
  • edwardrf

Poem

🐰 I hopped through ports and keys tonight,

I braided models, set the light,
Four‑thousand footsteps down the lane,
Master key snug, no secrets slain,
Tests hum softly — spring's in sight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'replace openai-access-gateway with litellm' directly and clearly describes the primary objective of the changeset: replacing the openai-access-gateway service with litellm across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jordan/litellm

Comment @coderabbitai help to get the list of available commands and usage tips.

@jordanstephens jordanstephens marked this pull request as ready for review April 6, 2026 18:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pkg/cli/compose/fixup.go (1)

362-387: ⚠️ Potential issue | 🔴 Critical

The generated *_URL is missing the port and uses an incorrect API path.

Line 366 generates http://<service>/api/v1/, but after switching to LiteLLM on port 4000 (line 387), dependents need both the port and the correct OpenAI-compatible API path. On the shared model_provider_private network, the URL without an explicit port defaults to port 80, so clients cannot reach the LiteLLM listener. Additionally, LiteLLM's OpenAI-compatible API uses /v1/, not /api/v1/. Dependent services will fail unless they can reach the correct endpoint on port 4000.

💡 Suggested fix
-	urlVal := "http://" + svccfg.Name + "/api/v1/"
+	urlVal := fmt.Sprintf("http://%s:%d/v1", svccfg.Name, liteLLMPort)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 362 - 387, The generated endpoint
URL in makeAccessGatewayService is incorrect: change urlVal to include the
LiteLLM port and the OpenAI-compatible path (use "http://<service>:<port>/v1/"),
referencing liteLLMPort and svccfg.Name, and assign it into the service
environment under the endpointEnvVar (and ensure modelEnvVar is also set to the
passed model if needed); update the env map population logic in
makeAccessGatewayService so svccfg.Environment[endpointEnvVar] = urlVal (and
svccfg.Environment[modelEnvVar] = model) instead of leaving the URL without port
and using /api/v1/.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 232-239: The current fixupLLM uses GetImageRepo which strips parts
incorrectly when a registry includes a port (e.g.
registry.example:5000/litellm:latest), causing the "/litellm" suffix check to
fail; change fixupLLM to compute the image path without removing a registry port
by stripping only the image tag/digest (remove the substring after the last ':'
or '@' only if it occurs after the last '/'), then run strings.HasSuffix on that
sanitized path; update references in fixupLLM to use svccfg.Image (or a local
sanitizedImage variable) for the suffix check and preserve adding the host port
using liteLLMPort if the suffix matches and svccfg.Ports is empty.

---

Outside diff comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 362-387: The generated endpoint URL in makeAccessGatewayService is
incorrect: change urlVal to include the LiteLLM port and the OpenAI-compatible
path (use "http://<service>:<port>/v1/"), referencing liteLLMPort and
svccfg.Name, and assign it into the service environment under the endpointEnvVar
(and ensure modelEnvVar is also set to the passed model if needed); update the
env map population logic in makeAccessGatewayService so
svccfg.Environment[endpointEnvVar] = urlVal (and svccfg.Environment[modelEnvVar]
= model) instead of leaving the URL without port and using /api/v1/.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e90edff0-1afb-496e-b665-e6d2c7d6af51

📥 Commits

Reviewing files that changed from the base of the PR and between 9098072 and e614bb1.

📒 Files selected for processing (6)
  • src/pkg/cli/compose/fixup.go
  • src/testdata/llm/compose.yaml
  • src/testdata/llm/compose.yaml.fixup
  • src/testdata/llm/compose.yaml.golden
  • src/testdata/models/compose.yaml.fixup
  • src/testdata/provider/compose.yaml.fixup

Comment on lines 232 to 239
func fixupLLM(svccfg *composeTypes.ServiceConfig) {
image := GetImageRepo(svccfg.Image)
if strings.HasSuffix(image, "/openai-access-gateway") && len(svccfg.Ports) == 0 {
if strings.HasSuffix(image, "/litellm") && len(svccfg.Ports) == 0 {
// HACK: we must have at least one host port to get a CNAME for the service
var port uint32 = 80
// litellm listens on 4000 by default
var port uint32 = liteLLMPort
term.Debugf("service %q: adding LLM host port %d", svccfg.Name, port)
svccfg.Ports = []composeTypes.ServicePortConfig{{Target: port, Mode: Mode_HOST, Protocol: Protocol_TCP}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle registry ports before matching /litellm.

This suffix check inherits GetImageRepo's first-: split, so registry.example:5000/litellm:latest is reduced to registry.example before the match. Those LiteLLM images will skip the 4000-port fixup entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 232 - 239, The current fixupLLM
uses GetImageRepo which strips parts incorrectly when a registry includes a port
(e.g. registry.example:5000/litellm:latest), causing the "/litellm" suffix check
to fail; change fixupLLM to compute the image path without removing a registry
port by stripping only the image tag/digest (remove the substring after the last
':' or '@' only if it occurs after the last '/'), then run strings.HasSuffix on
that sanitized path; update references in fixupLLM to use svccfg.Image (or a
local sanitizedImage variable) for the suffix check and preserve adding the host
port using liteLLMPort if the suffix matches and svccfg.Ports is empty.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pkg/cli/compose/fixup.go (1)

437-452: ⚠️ Potential issue | 🟠 Major

Potential nil map assignment on dependency.Networks.

At line 442, dependency.Networks[modelProviderNetwork] = nil will panic if dependency.Networks is nil. The same pattern at line 466 has the same issue. The code at lines 400-406 correctly handles this case for svccfg.Networks.

🐛 Proposed fix
 	for _, dependency := range project.Services {
 		if _, ok := dependency.DependsOn[svccfg.Name]; ok {
 			if dependency.Environment == nil {
 				dependency.Environment = make(composeTypes.MappingWithEquals)
 			}
+			if dependency.Networks == nil {
+				dependency.Networks = make(map[string]*composeTypes.ServiceNetworkConfig)
+			}
 			dependency.Networks[modelProviderNetwork] = nil

And similarly for the dependency.Models block around line 466:

 			if dependency.Environment == nil {
 				dependency.Environment = make(composeTypes.MappingWithEquals)
 			}
+			if dependency.Networks == nil {
+				dependency.Networks = make(map[string]*composeTypes.ServiceNetworkConfig)
+			}
 			dependency.Networks[modelProviderNetwork] = nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 437 - 452, The code assigns into
dependency.Networks without ensuring the map is initialized (causing a panic
when nil); update the loop over project.Services that references
dependency.Networks to check if dependency.Networks == nil and if so set
dependency.Networks = make(composeTypes.MappingWithEquals) before setting
dependency.Networks[modelProviderNetwork] = nil, and apply the same nil-check +
initialization pattern used for svccfg.Networks to the other similar block that
modifies dependency.Models (the second occurrence around the later block) so
both maps are safely initialized before assignment.
🧹 Nitpick comments (3)
src/pkg/cli/subscribe.go (1)

78-85: Verbose logging on every message iteration.

This logs on every iteration of the subscribe loop, which could produce excessive output if the stream is chatty. Additionally, the log executes before the msg == nil check (line 87), so it prints even when there's nothing meaningful to process.

Consider either:

  1. Moving this logging after the msg == nil check, or
  2. Only logging when pendingServices changes from the previous iteration, or
  3. Logging at a lower level (e.g., term.Debug) since tailAndMonitor.go now also logs a waiting message.
♻️ Option: Move logging after nil check and reduce verbosity
+		if msg == nil {
+			continue
+		}
+
 		pendingServices := []string{}
 		for _, service := range services {
 			if serviceStates[service] != targetState {
 				pendingServices = append(pendingServices, service)
 			}
 		}
 
-		term.Infof("Waiting for %q to be in state %s...\n", pendingServices, targetState) // TODO: don't print in Go-routine
+		term.Debugf("Pending services waiting for state %s: %v\n", targetState, pendingServices) // TODO: don't print in Go-routine
 
-		if msg == nil {
-			continue
-		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/subscribe.go` around lines 78 - 85, The current loop logs via
term.Infof every iteration and does so before checking msg == nil, causing
excessive/irrelevant output; move the term.Infof call so it executes after the
nil-check for msg, and reduce verbosity by either using term.Debug (or a lower
log level) or only emitting when pendingServices (computed from services,
serviceStates, targetState) changes from its previous value; update the call
site where term.Infof is used (the pendingServices computation and logging
block) and ensure you track the lastPendingServices value to compare before
logging if choosing the “only when changed” approach.
src/pkg/clouds/gcp/compute.go (1)

14-18: Consider caching or reusing the Compute client.

A new compute.Service client is created on every call to GetInstanceGroupManagerLabels. If this method is called frequently (e.g., for each audit log entry during a deployment), this could introduce unnecessary latency and resource overhead from repeated client initialization.

This is likely acceptable for the current use case (sparse audit log entries), but if performance becomes a concern, consider passing a pre-initialized client or caching it at the Gcp struct level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/gcp/compute.go` around lines 14 - 18, The
GetInstanceGroupManagerLabels function creates a new compute.Service on every
call; to reduce repeated client initialization, add a reusable client on the Gcp
struct (e.g., a field computeSvc *compute.Service) or accept a pre-initialized
client, initialize it lazily in a method like ensureComputeService() and reuse
it in GetInstanceGroupManagerLabels, and protect lazy init with appropriate
synchronization (e.g., sync.Mutex or sync.Once) to make compute.Service creation
thread-safe.
src/pkg/cli/compose/fixup.go (1)

40-44: Consider escalating the warning when AccountInfo is needed for model services.

When AccountInfo fails and the project contains model services, the resulting empty AccountInfo will cause provider-specific environment variables (like AWS_REGION, VERTEXAI_PROJECT, VERTEXAI_LOCATION) to be omitted. This could lead to runtime failures that are difficult to diagnose since the error is only logged at debug level.

Consider logging a more visible warning when model services exist in the project and AccountInfo fails.

💡 Suggested improvement
 	accountInfo, err := provider.AccountInfo(ctx)
 	if err != nil {
 		term.Debugf("failed to get account info to fixup services: %v", err)
 		accountInfo = &client.AccountInfo{}
+		// Check if any model services exist that might need account info
+		hasModelServices := len(project.Models) > 0
+		if !hasModelServices {
+			for _, svc := range project.Services {
+				if svc.Provider != nil && svc.Provider.Type == "model" {
+					hasModelServices = true
+					break
+				}
+			}
+		}
+		if hasModelServices {
+			term.Warnf("failed to get account info; model services may not have required cloud provider configuration: %v", err)
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 40 - 44, When
provider.AccountInfo(ctx) fails, the code currently only calls term.Debugf and
falls back to an empty client.AccountInfo, which hides failures that matter when
the project contains model services; update the logic around
provider.AccountInfo (the call to provider.AccountInfo, the accountInfo variable
and the debug log) to check whether any model services are present and, if so,
log a visible warning (e.g., term.Warnf/term.Warningf) that includes the error
details and notes that provider-specific env vars (AWS_REGION, VERTEXAI_PROJECT,
VERTEXAI_LOCATION) may be omitted, while still falling back to accountInfo =
&client.AccountInfo{} to preserve current behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 411-434: The LITELLM_MASTER_KEY derivation loop can panic and
continues after a fatal inconsistency: ensure you nil-check
service.Environment["OPENAI_API_KEY"] before dereferencing (i.e., verify key !=
nil before using *key) when accumulating into openAIKey, and on detecting
multiple different OPENAI_API_KEY values (where currently term.Errorf is called)
stop further processing immediately instead of break — propagate the error or
return from the enclosing function (or set an explicit failure path) so
liteLLMMasterKey is not set from an inconsistent first value; update references
in this block (openAIKey, service.Environment["OPENAI_API_KEY"], term.Errorf,
liteLLMMasterKey, svccfg.Environment) accordingly.
- Line 398: The image tag assigned to svccfg.Image is incorrect; update the
value of svccfg.Image from "litellm/litellm:v1.82.3-stable.patch.3" to the
available tag "litellm/litellm:v1.82.3-stable.patch.2" so the code uses the
existing Docker image.

---

Outside diff comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 437-452: The code assigns into dependency.Networks without
ensuring the map is initialized (causing a panic when nil); update the loop over
project.Services that references dependency.Networks to check if
dependency.Networks == nil and if so set dependency.Networks =
make(composeTypes.MappingWithEquals) before setting
dependency.Networks[modelProviderNetwork] = nil, and apply the same nil-check +
initialization pattern used for svccfg.Networks to the other similar block that
modifies dependency.Models (the second occurrence around the later block) so
both maps are safely initialized before assignment.

---

Nitpick comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 40-44: When provider.AccountInfo(ctx) fails, the code currently
only calls term.Debugf and falls back to an empty client.AccountInfo, which
hides failures that matter when the project contains model services; update the
logic around provider.AccountInfo (the call to provider.AccountInfo, the
accountInfo variable and the debug log) to check whether any model services are
present and, if so, log a visible warning (e.g., term.Warnf/term.Warningf) that
includes the error details and notes that provider-specific env vars
(AWS_REGION, VERTEXAI_PROJECT, VERTEXAI_LOCATION) may be omitted, while still
falling back to accountInfo = &client.AccountInfo{} to preserve current
behavior.

In `@src/pkg/cli/subscribe.go`:
- Around line 78-85: The current loop logs via term.Infof every iteration and
does so before checking msg == nil, causing excessive/irrelevant output; move
the term.Infof call so it executes after the nil-check for msg, and reduce
verbosity by either using term.Debug (or a lower log level) or only emitting
when pendingServices (computed from services, serviceStates, targetState)
changes from its previous value; update the call site where term.Infof is used
(the pendingServices computation and logging block) and ensure you track the
lastPendingServices value to compare before logging if choosing the “only when
changed” approach.

In `@src/pkg/clouds/gcp/compute.go`:
- Around line 14-18: The GetInstanceGroupManagerLabels function creates a new
compute.Service on every call; to reduce repeated client initialization, add a
reusable client on the Gcp struct (e.g., a field computeSvc *compute.Service) or
accept a pre-initialized client, initialize it lazily in a method like
ensureComputeService() and reuse it in GetInstanceGroupManagerLabels, and
protect lazy init with appropriate synchronization (e.g., sync.Mutex or
sync.Once) to make compute.Service creation thread-safe.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 67f36a31-3342-44b2-a2d6-5a69183e29d3

📥 Commits

Reviewing files that changed from the base of the PR and between e614bb1 and eb7a99b.

📒 Files selected for processing (23)
  • pkgs/defang/cli.nix
  • pkgs/npm/README.md
  • src/README.md
  • src/go.mod
  • src/pkg/cli/client/byoc/gcp/byoc_test.go
  • src/pkg/cli/client/byoc/gcp/query.go
  • src/pkg/cli/client/byoc/gcp/query_test.go
  • src/pkg/cli/client/byoc/gcp/stream.go
  • src/pkg/cli/client/byoc/gcp/stream_test.go
  • src/pkg/cli/client/mock.go
  • src/pkg/cli/compose/fixup.go
  • src/pkg/cli/subscribe.go
  • src/pkg/cli/tailAndMonitor.go
  • src/pkg/clouds/aws/login.go
  • src/pkg/clouds/gcp/compute.go
  • src/pkg/clouds/gcp/login.go
  • src/testdata/llm/compose.yaml
  • src/testdata/llm/compose.yaml.fixup
  • src/testdata/llm/compose.yaml.golden
  • src/testdata/models/compose.yaml.fixup
  • src/testdata/models/compose.yaml.warnings
  • src/testdata/provider/compose.yaml.fixup
  • src/testdata/provider/compose.yaml.warnings
✅ Files skipped from review due to trivial changes (6)
  • pkgs/npm/README.md
  • src/README.md
  • src/pkg/clouds/aws/login.go
  • pkgs/defang/cli.nix
  • src/testdata/models/compose.yaml.warnings
  • src/testdata/provider/compose.yaml.warnings
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/testdata/llm/compose.yaml.fixup
  • src/testdata/llm/compose.yaml.golden
  • src/testdata/models/compose.yaml.fixup
  • src/testdata/provider/compose.yaml.fixup
  • src/testdata/llm/compose.yaml

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/pkg/cli/compose/fixup.go (2)

238-246: ⚠️ Potential issue | 🟡 Minor

Registry port in image names breaks /litellm suffix check.

GetImageRepo splits on the first :, so registry.example:5000/litellm:latest becomes registry.example, losing the /litellm path. Such images won't get the port 4000 fixup.

Consider splitting only on the last : that follows a / (to separate tag from path), or use a proper image reference parser.

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 238 - 246, The suffix check in
fixupLLM fails because GetImageRepo splits on the first ':' and drops the path
for registry URLs (e.g., registry:5000/...); update GetImageRepo usage or its
implementation so it preserves the repository path before checking suffix:
either change GetImageRepo to strip the image tag by splitting on the last ':'
only if that colon appears after the last '/' (so registry:port is not treated
as a tag), or replace GetImageRepo with a proper image reference parser and then
call fixupLLM using the parsed repository/name to correctly detect
strings.HasSuffix(repo, "/litellm") in the fixupLLM function.

424-446: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference risk when reading OPENAI_API_KEY.

At line 431, *key is dereferenced without checking if key is nil. In compose files, environment variables can be declared without a value (OPENAI_API_KEY: or OPENAI_API_KEY: null), resulting in a nil pointer that will panic on dereference.

Additionally, at line 433-434, after logging the error about multiple different keys, execution continues with break and uses whatever openAIKey was found first—potentially inconsistent behavior.

,

🐛 Proposed fix
 		if _, ok := service.DependsOn[svccfg.Name]; ok {
 			if key, ok := service.Environment["OPENAI_API_KEY"]; ok {
+				if key == nil {
+					continue // skip nil (unset) keys
+				}
 				if openAIKey == "" {
 					openAIKey = *key
 				} else if *key != openAIKey {
 					term.Errorf("multiple different OPENAI_API_KEY values found in services depending on %q", svccfg.Name)
-					break
+					return // abort with inconsistent configuration
 				}
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/fixup.go` around lines 424 - 446, The loop that reads
OPENAI_API_KEY from dependent services (involving svccfg, project.Services and
openAIKey) can dereference a nil *string (key) and may continue using an
inconsistent first-found value after detecting differing keys; fix it by
checking key != nil before using *key, skip nil/empty values, and when you
detect multiple different non-nil keys (the condition where term.Errorf is
called) clear openAIKey (e.g., set openAIKey = "") or otherwise mark it invalid
and break so you don't keep using the first value, then proceed to set
LITELLM_MASTER_KEY as before (falling back to the default when openAIKey is
empty).
🧹 Nitpick comments (1)
src/testdata/llm/compose.yaml (1)

2-10: Good addition of alt-repo test case for custom registry.

This tests the scenario of litellm images from alternate registries. Note that line 9 uses target: "4000" as a string—compose-go handles both, but using an unquoted integer would be more conventional:

target: 4000
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/testdata/llm/compose.yaml` around lines 2 - 10, The ports entry for the
alt-repo service uses a string for target ("4000"); update the ports block for
service name alt-repo so target is an integer (target: 4000) instead of a quoted
string; locate the ports mapping under the alt-repo service in
src/testdata/llm/compose.yaml and remove the quotes around 4000 to make it an
unquoted integer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/pkg/cli/compose/fixup.go`:
- Around line 238-246: The suffix check in fixupLLM fails because GetImageRepo
splits on the first ':' and drops the path for registry URLs (e.g.,
registry:5000/...); update GetImageRepo usage or its implementation so it
preserves the repository path before checking suffix: either change GetImageRepo
to strip the image tag by splitting on the last ':' only if that colon appears
after the last '/' (so registry:port is not treated as a tag), or replace
GetImageRepo with a proper image reference parser and then call fixupLLM using
the parsed repository/name to correctly detect strings.HasSuffix(repo,
"/litellm") in the fixupLLM function.
- Around line 424-446: The loop that reads OPENAI_API_KEY from dependent
services (involving svccfg, project.Services and openAIKey) can dereference a
nil *string (key) and may continue using an inconsistent first-found value after
detecting differing keys; fix it by checking key != nil before using *key, skip
nil/empty values, and when you detect multiple different non-nil keys (the
condition where term.Errorf is called) clear openAIKey (e.g., set openAIKey =
"") or otherwise mark it invalid and break so you don't keep using the first
value, then proceed to set LITELLM_MASTER_KEY as before (falling back to the
default when openAIKey is empty).

---

Nitpick comments:
In `@src/testdata/llm/compose.yaml`:
- Around line 2-10: The ports entry for the alt-repo service uses a string for
target ("4000"); update the ports block for service name alt-repo so target is
an integer (target: 4000) instead of a quoted string; locate the ports mapping
under the alt-repo service in src/testdata/llm/compose.yaml and remove the
quotes around 4000 to make it an unquoted integer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4094c22-2f2d-4ea5-a0af-a3b0af72a866

📥 Commits

Reviewing files that changed from the base of the PR and between 4013d22 and 67c00bf.

📒 Files selected for processing (10)
  • src/pkg/cli/client/mock.go
  • src/pkg/cli/compose/fixup.go
  • src/pkg/cli/subscribe.go
  • src/testdata/llm/compose.yaml
  • src/testdata/llm/compose.yaml.fixup
  • src/testdata/llm/compose.yaml.golden
  • src/testdata/models/compose.yaml.fixup
  • src/testdata/models/compose.yaml.warnings
  • src/testdata/provider/compose.yaml.fixup
  • src/testdata/provider/compose.yaml.warnings
✅ Files skipped from review due to trivial changes (3)
  • src/testdata/models/compose.yaml.warnings
  • src/testdata/provider/compose.yaml.warnings
  • src/testdata/provider/compose.yaml.fixup
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/pkg/cli/subscribe.go
  • src/testdata/llm/compose.yaml.golden
  • src/testdata/llm/compose.yaml.fixup
  • src/testdata/models/compose.yaml.fixup

jordanstephens and others added 19 commits April 9, 2026 09:37
G101 is a gosec rule ID, not a standalone linter name. Using it in
//nolint directives caused golangci-lint to warn about unknown linters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCE allInstancesConfig.properties.labels is a map<string,string>, not a
list of {key,value} structs. The query filters were using the list format
(labels.key="defang-service" / labels.value="...") which never matched any
audit log entries, so gce_instance_group_manager events were never returned
by Cloud Logging. Even if events had arrived, the parser was iterating over
the field as a list (GetListInStruct) which always returned nil, leaving the
computeEngineRootTriggers map empty. As a result, all gce_instance_group
addInstances events were silently dropped and WaitServiceState never
received DEPLOYMENT_COMPLETED for Compute Engine services.

Fix the query to use map-style key access:
  labels."defang-service"=~"^(svc)$"

Fix the parser to use GetValueInStruct with the label name as a path key,
replacing the 10-line list iteration with a single call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCE audit logs for regionInstanceGroupManagers.patch only carry the
fields that changed (e.g. the new instance template version). The
allInstancesConfig.properties.labels — where the defang-service label
lives — is absent from the request body for every update after the
initial create. As a result, the computeEngineRootTriggers map was
never populated and all gce_instance_group addInstances events were
silently dropped, so WaitServiceState never received DEPLOYMENT_COMPLETED
for Compute Engine services.

Fix: instead of reading labels from the audit log request body, read
the instance group manager name, project, and region from the always-
present entry.Resource.Labels and call the GCE REST API to get the
live resource's allInstancesConfig.properties.labels. This mirrors the
fallback used by the server-side fabric_gcp.go implementation.

Add GetInstanceGroupManagerLabels to GcpLogsClient and implement it
using the already-present google.golang.org/api/compute/v1 dependency
(no new deps required).

Also add the missing isQuotaError helper to the gcpquota debug tool,
which was preventing the pre-commit lint check from passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PATCH requests for regionInstanceGroupManagers only carry changed fields
(e.g. a new instance template reference). When Pulumi re-deploys a CE
service, it patches the instance template without including
allInstancesConfig.properties.labels in the request body. The Cloud
Logging filter on those absent label fields never matched, so no
gce_instance_group_manager events were returned for re-deploys, leaving
computeEngineRootTriggers empty and causing all gce_instance_group
addInstances events to be silently dropped.

The parser already handles service-specific filtering by reading labels
from the live MIG resource via GetInstanceGroupManagerLabels, so the
query-level label filters are redundant and harmful. Remove them and keep
only the method name and operation.first filters, consistent with how
AddComputeEngineInstanceGroupAddInstances works.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>
Cover the three bugs fixed in this branch:
- TestAddComputeEngineInstanceGroupInsertOrPatch: asserts the query
  contains no allInstancesConfig or defang-* label filters (guarding
  against the old list-format filters that never matched)
- TestActivityParser_GceInstanceGroupManager: table-driven tests for the
  gce_instance_group_manager parser path — happy path, API error, nil
  labels, missing defang-service label, and missing root_trigger_id
- TestActivityParser_GceInstanceGroupFlow: end-to-end test that a
  manager insert/patch entry populates the trigger map and a subsequent
  addInstances entry uses it to emit DEPLOYMENT_COMPLETED
- TestActivityParser_GceInstanceGroupDropsUnknownTrigger: events with
  an unrecognized root_trigger_id are silently dropped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of

 * Waiting for ["app" "worker" "chat" "embedding"] to be in state DEPLOYMENT_COMPLETED...
 * Waiting for ["app" "worker" "embedding"] to be in state DEPLOYMENT_COMPLETED...
 * Waiting for ["app" "worker"] to be in state DEPLOYMENT_COMPLETED...

I prefer

 * Waiting for services to finish deploying: ["app" "worker" "chat" "embedding"]
 * Waiting for services to finish deploying: ["app" "worker" "embedding"]
 * Waiting for services to finish deploying: ["app" "worker"]

message
i always find myself grepping for 'with state' which only feels
tangentially related. A better prefix is "Service update:"
Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>
Co-authored-by: Jordan Stephens <jordan@stephens.io>
Range-iterating project.Services yields value copies, so assigning a newly
created DependsOn map to the local copy was silently dropped. Switch to
keyed iteration and write the service back whenever it was changed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jordanstephens and others added 2 commits April 9, 2026 09:37
Split makeAccessGatewayService into two focused helpers:
- configureAccessGateway: resolves the model for the target cloud provider,
  sets up the LiteLLM container (image, command, network, port), and
  derives LITELLM_MASTER_KEY
- wireDependentServices: injects URL/model/key env vars and network
  membership into every service that depends on the model service

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wip Work in progress; don't merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants