Skip to content

Azure support#2027

Open
edwardrf wants to merge 12 commits intomainfrom
edw/azure
Open

Azure support#2027
edwardrf wants to merge 12 commits intomainfrom
edw/azure

Conversation

@edwardrf
Copy link
Copy Markdown
Contributor

@edwardrf edwardrf commented Apr 8, 2026

Description

Add support azure

Linked Issues

Checklist

  • I have performed a self-review of my code
  • I have added appropriate tests
  • I have updated the Defang CLI docs and/or README to reflect my changes, if necessary

Summary by CodeRabbit

  • New Features
    • Added Azure as a supported cloud provider for BYOC deployments with Azure Container Instances and Container Apps integration.
    • Added support for Azure App Configuration for managing application settings and secrets.
    • Added Azure Container Registry support for building and managing container images.
    • Enhanced provider selection with Azure environment variable detection and subscription ID configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 685f37da-37cf-460a-843d-89727b901660

📥 Commits

Reviewing files that changed from the base of the PR and between f11f37e and 2166bd5.

⛔ Files ignored due to path filters (1)
  • src/go.sum is excluded by !**/*.sum
📒 Files selected for processing (3)
  • flake.nix
  • pkgs/defang/cli.nix
  • src/go.mod
✅ Files skipped from review due to trivial changes (1)
  • pkgs/defang/cli.nix
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/go.mod
  • flake.nix

📝 Walkthrough

Walkthrough

This PR adds comprehensive Azure cloud provider support to the Defang CLI. It introduces Azure BYOC (Bring Your Own Cloud) implementation with container orchestration via Azure Container Instances, configuration management via App Configuration, container registry integration via Azure Container Registry, and log streaming. Supporting changes include Azure SDK dependencies, provider infrastructure updates, HTTP utilities for blob uploads, and nix flake tooling additions.

Changes

Cohort / File(s) Summary
Nix Flake & Dependencies
flake.nix, pkgs/defang/cli.nix, src/go.mod
Added azure-cli and pulumiPackages.pulumi-go to build inputs; reordered existing packages. Updated Go module with Azure SDK dependencies (azcore, azidentity, azappconfig, ARM clients for AppConfig/Containers/Registry/Storage/MSI/Resources/Blob) and bumped github.com/golang-jwt/jwt/v5 to v5.3.0.
Provider Infrastructure
src/pkg/cli/client/provider_id.go, src/pkg/cli/client/region.go, src/pkg/cli/connect.go, src/cmd/cli/command/estimate.go
Added ProviderAzure constant, updated Name(), Value(), SetValue() methods. Added RegionDefaultAzure = "westus" and region env var handling. Integrated Azure provider into connection logic via azure.NewByocProvider. Updated provider selection precedence to check AWS → Azure → DigitalOcean → GCP.
Azure Base Infrastructure
src/pkg/clouds/azure/location.go, src/pkg/clouds/azure/common.go
Introduced Location type with 80+ location constants and helper methods. Implemented Azure base struct with credential/client factory methods (NewCreds, NewStorageAccountsClient, NewBlobContainersClient) and token credential timeout wrapper.
Azure Container Instances (ACI)
src/pkg/clouds/azure/aci/common.go, src/pkg/clouds/azure/aci/run.go, src/pkg/clouds/azure/aci/blob.go, src/pkg/clouds/azure/aci/setup.go, src/pkg/clouds/azure/aci/status.go, src/pkg/clouds/azure/aci/stop.go, src/pkg/clouds/azure/aci/tail.go, src/pkg/clouds/azure/aci/*_test.go
Implemented complete ACI lifecycle: container group creation/startup via Run, resource group/storage setup via SetUpResourceGroup/SetUpStorageAccount, managed identity configuration via SetUpManagedIdentity, status polling via GetContainerGroupStatus, stopping via Stop, blob iteration/download via IterateBlobs/DownloadBlob, and log streaming via Tail/QueryLogs/PollLogs/StreamLogs supporting websocket attachment and polled log queries.
Azure Container Apps (ACA)
src/pkg/clouds/azure/aca/common.go, src/pkg/clouds/azure/aca/tail.go
Added container app log resolution via ResolveLogTarget and log streaming via WatchLogs/StreamLogs using Server-Sent Events endpoints.
Azure Container Registry (ACR)
src/pkg/clouds/azure/acr/types.go, src/pkg/clouds/azure/acr/acr.go, src/pkg/clouds/azure/acr/tail.go, src/pkg/clouds/azure/acr/*_test.go
Implemented ACR task scheduling via RunTask with environment/secret encoding, run status monitoring via GetRunStatus, log URL retrieval via GetRunLogURL, run cancellation via CancelRun, and log tailing via TailRunLogs with HTTP range-request based chunked reading.
Azure App Configuration
src/pkg/clouds/azure/appcfg/appcfg.go
Implemented configuration store discovery/creation via SetUp, key-value management via PutSetting/DeleteSetting/ListSettings with prefix filtering and read-write key resolution.
Azure BYOC Provider Client
src/pkg/cli/client/byoc/azure/byoc.go, src/pkg/cli/client/byoc/azure/login.go
Implemented ByocAzure provider with CD execution via CdCommand (Pulumi+ACI), deployment status via GetDeploymentStatus and GetProjectUpdate, config management via ListConfig/PutConfig/DeleteConfig, blob upload URLs via CreateUploadURL, and log tailing via QueryLogs. Supports follow mode for concurrent ACI+ACA log polling and snapshot mode. Stubbed unsupported operations (Delete, GetService, TearDownCD, etc.). Added placeholder Authenticate method.
CLI Session & Wizard
src/pkg/session/session.go, src/pkg/stacks/wizard.go, src/pkg/stacks/selector_test.go, src/pkg/utils.go
Added Azure provider mismatch warnings via AzureInEnv() helper. Extended wizard to elicit Azure Subscription ID when Azure provider selected. Updated stack selector test mocks to include Azure in provider options.
HTTP Utilities
src/pkg/http/post.go, src/pkg/http/put.go, src/pkg/cli/compose/context.go
Introduced PostWithHeader and PutWithHeader functions for custom header injection. Updated uploadArchive to use PutWithHeader with explicit Content-Type and X-Ms-Blob-Type: BlockBlob headers; broadened success check to accept HTTP 200/201; changed URL transformation to use single strings.Replace for GCP prefix conversion without query parameter stripping.
Protocol Buffers & Miscellaneous
src/protos/io/defang/v1/fabric.proto, src/pkg/cli/client/byoc/aws/byoc.go, src/pkg/cli/client/byoc/gcp/byoc.go, src/pkg/cli/tailAndMonitor.go, src/pkg/agent/tools/estimate_test.go
Added Provider.AZURE = 5 enum value and reformatted Service.deprecated option. Added inline comments in AWS/GCP BYOC files. Increased tail cancellation delay from 2s to 6s. Updated estimate test expected error string to include Azure in provider list.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as Defang CLI
    participant ACI as Azure Container<br/>Instances
    participant ACR as Azure Container<br/>Registry
    participant AppCfg as Azure App<br/>Configuration
    participant Blob as Azure Blob<br/>Storage
    participant Pulumi as Pulumi Backend<br/>(S3/Blob)

    User->>CLI: deploy --provider=azure
    CLI->>CLI: NewByocProvider(Azure)
    CLI->>AppCfg: SetUp() - discover/create store
    CLI->>ACI: SetUpResourceGroup()
    CLI->>ACI: SetUpStorageAccount()
    CLI->>ACI: SetUpManagedIdentity()
    
    CLI->>ACR: SetUpRegistry()
    User->>CLI: build/push container
    CLI->>ACR: RunTask(image, command)
    ACR->>ACR: schedule task run
    CLI->>ACR: TailRunLogs() - poll completion
    
    CLI->>Blob: CreateUploadURL() - SAS URL
    CLI->>Blob: PUT ProjectUpdate protobuf
    
    CLI->>ACI: CdCommand(env, args)
    ACI->>ACI: Run container group with CD
    ACI->>Pulumi: store state via Pulumi backend
    
    CLI->>ACI: QueryLogs() - poll for output
    ACI->>CLI: return log entries
    CLI->>User: display deployment logs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • lionello
  • jordanstephens

Poem

🐰 Azure skies descend upon the warren,
With containers, configs, and registries worn,
Where ACI hops and ACR builds with might,
The defang CLI glows azure-bright!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Azure support' directly describes the main change in the changeset—adding comprehensive Azure provider support across multiple components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch edw/azure

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=warning msg="[linters_context] running gomodguard failed: unable to read module file go.mod: current working directory must have a go.mod file: if you are not using go modules it is suggested to disable this linter"
level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cmd/cli/command/estimate.go (1)

70-75: ⚠️ Potential issue | 🟡 Minor

Add Azure and DigitalOcean to providerDescription map.

The auto-selection logic now supports Azure and DigitalOcean (lines 91-94), but the providerDescription map used by the survey prompt (line 106) is missing entries for these providers. This results in empty descriptions when users select them.

🔧 Proposed fix
 var providerDescription = map[client.ProviderID]string{
 	client.ProviderDefang: "The Defang Playground is a free platform intended for testing purposes only.",
 	client.ProviderAWS:    "Deploy to AWS using the AWS_* environment variables or the AWS CLI configuration.",
 	client.ProviderDO:     "Deploy to DigitalOcean using the DIGITALOCEAN_TOKEN, SPACES_ACCESS_KEY_ID, and SPACES_SECRET_ACCESS_KEY environment variables.",
 	client.ProviderGCP:    "Deploy to Google Cloud Platform using gcloud Application Default Credentials.",
+	client.ProviderAzure:  "Deploy to Azure using the AZURE_* environment variables or Azure CLI authentication.",
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cmd/cli/command/estimate.go` around lines 70 - 75, The
providerDescription map is missing entries for Azure (and ensure DigitalOcean is
present) so the survey shows empty descriptions; update the providerDescription
variable to include a descriptive string for client.ProviderAzure (e.g., mention
Azure CLI/ARM/ADC or environment variables) and verify/adjust the
client.ProviderDO entry text for DigitalOcean so both client.ProviderAzure and
client.ProviderDO have clear descriptions used by the survey prompt.
🟡 Minor comments (7)
src/pkg/http/post.go-48-50 (1)

48-50: ⚠️ Potential issue | 🟡 Minor

Avoid forcing an empty Content-Type header.

At Line 49, Content-Type is always sent, even when contentType == "". That changes previous behavior (header omitted) and can break strict endpoints.

Proposed fix
 func PostWithContext(ctx context.Context, url, contentType string, body io.Reader) (*http.Response, error) {
-	return PostWithHeader(ctx, url, http.Header{"Content-Type": []string{contentType}}, body)
+	if contentType == "" {
+		return PostWithHeader(ctx, url, nil, body)
+	}
+	return PostWithHeader(ctx, url, http.Header{"Content-Type": []string{contentType}}, body)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/http/post.go` around lines 48 - 50, PostWithContext currently always
adds a "Content-Type" header (even when contentType == ""), which forces an
empty header value and changes previous behavior; update PostWithContext so it
only passes a header to PostWithHeader when contentType is non-empty—i.e., if
contentType == "" call PostWithHeader with a nil/empty header (or omit the
header argument) and otherwise call PostWithHeader with
http.Header{"Content-Type": []string{contentType}} to preserve prior behavior;
refer to PostWithContext and PostWithHeader to locate and implement this
conditional header construction.
src/pkg/clouds/azure/aci/setup_test.go-17-22 (1)

17-22: ⚠️ Potential issue | 🟡 Minor

Use t.Fatalf to fail fast if SetUpResourceGroup fails.

Using t.Errorf allows the test to continue to TearDown even if resource group creation failed, which will likely cause a confusing cascading failure. Since TearDown depends on a successfully created resource group, use t.Fatalf or check for prior failures.

💚 Proposed fix
 	t.Run("SetUpResourceGroup", func(t *testing.T) {
 		err := c.SetUpResourceGroup(context.Background())
 		if err != nil {
-			t.Errorf("Failed to set up resource group: %v", err)
+			t.Fatalf("Failed to set up resource group: %v", err)
 		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/setup_test.go` around lines 17 - 22, Replace the
non-fatal assertion in the test's "SetUpResourceGroup" subtest so the test fails
immediately if resource group creation fails: change the call that currently
uses t.Errorf when c.SetUpResourceGroup(ctx) returns an error to t.Fatalf (or
otherwise abort the test) to prevent subsequent TearDown from running against an
uncreated resource; update the subtest around SetUpResourceGroup and references
to c.SetUpResourceGroup/context.Background() accordingly.
src/pkg/clouds/azure/aca/common.go-152-175 (1)

152-175: ⚠️ Potential issue | 🟡 Minor

Potential nil pointer dereferences when accessing Azure SDK response fields.

The code assumes that pointer fields like Containers[0].Name, Value[0].Name, and Value[0].Properties.Containers[0].Name are non-nil after checking slice length. Azure SDK types often have nil pointer fields even when elements exist.

🛡️ Suggested nil checks
 		// Opportunistically pick the container name from the app template.
 		if container == "" && app.Properties.Template != nil && len(app.Properties.Template.Containers) > 0 {
-			container = *app.Properties.Template.Containers[0].Name
+			if name := app.Properties.Template.Containers[0].Name; name != nil {
+				container = *name
+			}
 		}
 	}

 	if replica == "" {
 		// ...
 		if len(list.Value) == 0 {
 			return "", "", "", fmt.Errorf("no replicas found for revision %q", revision)
 		}
-		replica = *list.Value[0].Name
+		if list.Value[0].Name == nil {
+			return "", "", "", fmt.Errorf("replica name is nil for revision %q", revision)
+		}
+		replica = *list.Value[0].Name

 		// Opportunistically pick the container from the replica if still unset.
-		if container == "" && len(list.Value[0].Properties.Containers) > 0 {
-			container = *list.Value[0].Properties.Containers[0].Name
+		if container == "" && list.Value[0].Properties != nil && len(list.Value[0].Properties.Containers) > 0 {
+			if name := list.Value[0].Properties.Containers[0].Name; name != nil {
+				container = *name
+			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/common.go` around lines 152 - 175, The code
dereferences Azure SDK pointer fields without nil checks (e.g.,
app.Properties.Template.Containers[0].Name, list.Value[0].Name,
list.Value[0].Properties.Containers[0].Name) which can panic; update the logic
in the block that calls c.newReplicasClient() and processes replicas (and the
earlier template-based container pick) to guard every pointer access: verify
app.Properties, app.Properties.Template, the container slice length, and that
Containers[0] and its Name are non-nil before dereferencing; after ListReplicas
also check list.Value is non-nil, that list.Value[0] is non-nil, that
list.Value[0].Name is non-nil, that list.Value[0].Properties is non-nil, and
that its Containers[0] and Name are non-nil, and return a clear error if any
expected field is nil rather than dereferencing.
src/pkg/clouds/azure/aci/tail_test.go-23-28 (1)

23-28: ⚠️ Potential issue | 🟡 Minor

Uncomment or remove the resource group cleanup.

The commented-out TearDown call will leave Azure resource groups lingering when this test is eventually enabled. Either uncomment the cleanup or remove the empty t.Cleanup block entirely.

🧹 Proposed fix
 	t.Cleanup(func() {
-		// err := containerInstance.TearDown(ctx)
-		// if err != nil {
-		// 	t.Fatalf("Failed to tear down container instance: %v", err)
-		// }
+		err := containerInstance.TearDown(ctx)
+		if err != nil {
+			t.Logf("Failed to tear down container instance: %v", err)
+		}
 	})

Note: Changed t.Fatalf to t.Logf in cleanup since fatals in cleanup can mask the original test result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/tail_test.go` around lines 23 - 28, Uncomment and
run the resource group cleanup by calling containerInstance.TearDown(ctx) inside
the existing t.Cleanup closure (or remove the empty t.Cleanup if you
intentionally want no cleanup); if you restore the call, handle errors by
logging rather than fatalling in cleanup (use t.Logf to report any TearDown
error) and ensure ctx and containerInstance are in scope for the cleanup closure
so the test won't leave Azure resource groups lingering.
src/pkg/clouds/azure/aca/tail.go-54-74 (1)

54-74: ⚠️ Potential issue | 🟡 Minor

Silent error swallowing may hide configuration or authentication issues.

The poll() function silently returns on client creation errors (line 57) and paging errors (line 63). This could mask authentication failures or API issues, making debugging difficult. Consider logging these errors at debug level.

🔧 Proposed fix to add debug logging
 		poll := func() {
 			client, err := c.newContainerAppsClient()
 			if err != nil {
+				// Log at debug level to aid troubleshooting
 				return
 			}
 			pager := client.NewListByResourceGroupPager(c.ResourceGroup, nil)
 			for pager.More() {
 				page, err := pager.NextPage(ctx)
 				if err != nil {
+					// Log at debug level to aid troubleshooting
 					return
 				}

Note: You'll need to import a logging package or pass a logger to surface these errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/tail.go` around lines 54 - 74, poll currently
swallows errors from c.newContainerAppsClient() and pager.NextPage(), which
hides auth/config issues; update the poll function to log those errors at debug
level before returning (use the logger available on the receiver struct c or
add/pass one if missing), e.g., log the error and contextual info (resource
group, operation) when c.newContainerAppsClient() fails and when
pager.NextPage(ctx) returns an error, then preserve the original control flow
(return after logging) so failures are visible for debugging; update references
in the poll function around c.newContainerAppsClient,
client.NewListByResourceGroupPager and pager.NextPage accordingly.
src/pkg/clouds/azure/acr/acr.go-207-214 (1)

207-214: ⚠️ Potential issue | 🟡 Minor

Potential nil pointer dereference on props.Status.

props.Status is dereferenced at line 209 without a nil check. If the API returns a run with no status, this will panic.

🔧 Proposed fix
+	if props.Status == nil {
+		return nil, fmt.Errorf("run %s has no status", runID)
+	}
+
 	status := &RunStatus{
 		RunID:  runID,
 		Status: *props.Status,
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/acr.go` around lines 207 - 214, The code
dereferences props.Status when constructing a RunStatus, which can panic if
props.Status is nil; update the construction (in the block that returns
*RunStatus) to check props.Status for nil before using *props.Status (e.g., set
statusStr := "unknown" or empty string and overwrite it only if props.Status !=
nil), assign that statusStr to RunStatus.Status, and keep the existing nil check
for props.RunErrorMessage; references: RunStatus, props.Status,
props.RunErrorMessage.
src/pkg/clouds/azure/acr/acr.go-90-105 (1)

90-105: ⚠️ Potential issue | 🟡 Minor

RegistryName is set before creation is confirmed.

a.RegistryName is assigned at line 90 before BeginCreate or PollUntilDone succeeds. If registry creation fails, the struct is left in an inconsistent state with a name that doesn't exist.

🔧 Proposed fix to set name after successful creation
-	a.RegistryName = registryName
-
 	poller, err := client.BeginCreate(ctx, a.resourceGroupName, registryName, armcontainerregistry.Registry{
 		Location: a.Location.Ptr(),
 		SKU: &armcontainerregistry.SKU{
 			Name: to.Ptr(armcontainerregistry.SKUNameBasic),
 		},
 	}, nil)
 	if err != nil {
 		return fmt.Errorf("failed to create container registry: %w", err)
 	}

 	_, err = poller.PollUntilDone(ctx, nil)
 	if err != nil {
 		return fmt.Errorf("failed to poll registry creation: %w", err)
 	}

+	a.RegistryName = registryName
 	return nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/acr.go` around lines 90 - 105, Do not assign
a.RegistryName until after successful creation: move the a.RegistryName =
registryName assignment to after client.BeginCreate and poller.PollUntilDone
complete without error so the struct isn't left in an inconsistent state;
specifically, keep the calls to client.BeginCreate(...) and
poller.PollUntilDone(ctx, nil) as-is, and only set a.RegistryName = registryName
after the second call returns nil error (use the existing variables poller and
err to gate the assignment).
🧹 Nitpick comments (15)
src/pkg/cli/compose/context.go (1)

262-264: Azure-specific header in shared upload code.

The X-Ms-Blob-Type: BlockBlob header is Azure-specific. As noted by the HACK comment, this should ideally be handled in the Azure provider rather than in shared compose code. For now, this works but may cause issues if other storage backends reject unknown headers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/compose/context.go` around lines 262 - 264, Shared compose upload
code currently sets the Azure-specific header "X-Ms-Blob-Type" on the common
header variable before calling http.PutWithHeader; remove that Azure-specific
header from the shared function (leave Content-Type intact) and instead add
logic in the Azure provider to inject "X-Ms-Blob-Type: BlockBlob" for uploads to
Azure blobs. Locate the header construction and http.PutWithHeader call (the
header variable and the call to http.PutWithHeader) to remove
header.Set("X-Ms-Blob-Type", "BlockBlob") here, and update the Azure provider
upload implementation to set that header when targeting BlockBlob storage.
Ensure other providers are unaffected and tests cover that Azure uploads include
the header while other backends do not.
src/pkg/http/put.go (1)

24-30: Consider preserving any default request headers.

The current implementation replaces the entire Header map with the caller-provided header. While http.NewRequestWithContext currently initializes an empty Header map, a more defensive approach would be to merge headers:

♻️ Optional: Merge headers instead of replacing
 func PutWithHeader(ctx context.Context, url string, header http.Header, body io.Reader) (*http.Response, error) {
 	req, err := http.NewRequestWithContext(ctx, http.MethodPut, url, body)
 	if err != nil {
 		return nil, err
 	}
-	req.Header = header
+	for key, values := range header {
+		for _, value := range values {
+			req.Header.Add(key, value)
+		}
+	}
 	return DefaultClient.Do(req)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/http/put.go` around lines 24 - 30, PutWithHeader currently overwrites
the request's Header map instead of merging with any existing/default headers;
change the logic so after creating the request with http.NewRequestWithContext
you iterate over the provided header (if non-nil) and copy each key/value into
req.Header using Add/Set rather than replacing the map, preserving any defaults;
keep the current return path via DefaultClient.Do(req) and ensure nil-safety
when header is nil.
src/pkg/clouds/azure/common.go (2)

53-79: Deduplicate ARM storage client-factory creation.

Line 53-79 repeats the same NewCreds() + armstorage.NewClientFactory(...) flow. A small helper would reduce repetition and keep future client additions consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/common.go` around lines 53 - 79, Create a small helper
on the Azure type to encapsulate the duplicated NewCreds() +
armstorage.NewClientFactory(...) logic (e.g., a method named
newStorageClientFactory or storageClientFactory that returns
(*armstorage.ClientFactory, error}); have Azure.NewStorageAccountsClient and
Azure.NewBlobContainersClient call that helper and return
clientFactory.NewAccountsClient() and clientFactory.NewBlobContainersClient()
respectively, propagating errors unchanged (use the same error wrapping used
today when creating the factory).

81-93: Remove commented-out constructor stub from production code.

The commented block is stale and adds noise. Prefer deleting it or tracking it in an issue/PR note.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/common.go` around lines 81 - 93, Remove the stale
commented-out constructor stub for NewRoleAssignmentsClient in the Azure type:
delete the entire commented block that references func (a Azure)
NewRoleAssignmentsClient(), armauthorization.RoleAssignmentsClient, and related
NewRoleAssignmentsClient creation logic so production code contains no
dead/commented code; if this logic needs tracking, add a separate ticket/PR note
instead of leaving commented code in src/pkg/clouds/azure/common.go.
src/pkg/cli/tailAndMonitor.go (1)

65-66: Consider documenting or making the delay configurable.

The delay before cancelling tail was tripled from 2s to 6s. While this ensures log capture, it also increases wait time for all providers. Consider:

  1. Adding a comment explaining why 6s was chosen (Azure log propagation latency?).
  2. Making this provider-specific if Azure needs more time but AWS/GCP don't.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/tailAndMonitor.go` around lines 65 - 66, The hardcoded 6s sleep
before cancelling the tail (pkg.SleepWithContext(ctx, 6*time.Second) followed by
cancelTail(errMonitoringDone)) should be documented and made configurable; add a
short comment near that call explaining why 6s was chosen (e.g., Azure log
propagation) and replace the literal with a configurable value or
provider-specific constant (e.g., tailCancelDelay or providerTailDelays map used
by pkg.SleepWithContext), wiring a flag/env/config option so Azure can use 6s
while AWS/GCP use a shorter delay, and keep cancelTail(errMonitoringDone)
behavior unchanged.
src/pkg/clouds/azure/aci/setup.go (2)

51-53: Consider resolving the TODO for storage account cleanup.

The TODO suggests uncertainty about storage account deletion. If the storage account is created within the same resource group (as SetUpStorageAccount does), it will be deleted with the resource group. However, if AZURE_STORAGE_ACCOUNT is used, the external account won't be affected—which is likely the correct behavior.

Would you like me to open an issue to track clarifying this behavior?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/setup.go` around lines 51 - 53, Replace the TODO
about deleting the storage account with a clear comment documenting the intended
behavior: explain that when SetUpStorageAccount creates a storage account in the
same resource group it will be removed when the resource group is deleted, while
an externally provided AZURE_STORAGE_ACCOUNT must not and will not be deleted by
this code; remove or avoid adding any deletion logic here and, if desired, add
an explicit TODO to open an issue for lifecycle policy decisions rather than
leaving ambiguous text.

108-134: Remove commented-out debug code.

This large block of commented-out code appears to be debug/experimental code. Consider removing it to improve readability, or if it's needed for future work, track it in an issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/setup.go` around lines 108 - 134, Remove the large
commented-out debug block containing getCurrentUserObjectID,
assignBlobDataContributor, NewStorageAccountsClient, ListKeys, and RegenerateKey
usage (and related prints and error handling around storageAccount and
c.resourceGroupName); either delete these commented lines to improve readability
or extract them into a tracked issue/PR with a short TODO comment referencing
that issue if you want to preserve the intent for future work.
src/pkg/clouds/azure/aca/common.go (1)

71-74: Consider adding timeouts to HTTP requests.

Using http.DefaultClient for REST API calls has no timeout configured, which could cause the caller to hang indefinitely if the Azure management endpoint is slow or unresponsive. Consider using a client with a timeout or leveraging the context's deadline.

♻️ Suggested improvement
+var httpClient = &http.Client{Timeout: 30 * time.Second}
+
 func (c *ContainerApp) getAuthToken(ctx context.Context, appName string) (string, error) {
 	// ...
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)

Alternatively, create a client per-request that respects the context deadline.

Also applies to: 109-112

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/common.go` around lines 71 - 74, The code uses
http.DefaultClient.Do(req) (see the resp, err := http.DefaultClient.Do(req)
call) which has no timeout; change to use a context-aware HTTP client or honor
the request context: ensure the request is created/associated with a context
(req = req.WithContext(ctx) or accept a context param) and replace
http.DefaultClient with an http.Client configured with a timeout or with
Transport that respects context deadlines; update the other occurrences noted
(around the second call at lines 109-112) to the same pattern so all Azure
management calls won't hang indefinitely.
src/pkg/cli/client/byoc/azure/login.go (1)

5-7: Authentication stub needs implementation.

This method silently succeeds without performing any authentication. The ctx and interactive parameters are unused. Before production use, this should integrate with Azure's authentication SDK (e.g., azidentity.NewDefaultAzureCredential or interactive browser flow).

Do you want me to open an issue to track the Azure authentication implementation?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/azure/login.go` around lines 5 - 7, The Authenticate
method on ByocAzure currently returns nil and ignores ctx and interactive;
implement real Azure authentication by using Azure Identity (e.g., call
azidentity.NewDefaultAzureCredential or the interactive browser credential when
interactive==true) and store the resulting credentials on the ByocAzure instance
so subsequent calls can use them; ensure you propagate ctx into credential
creation and return any errors from credential initialization (update
ByocAzure.Authenticate to perform the credential creation and assignment, and
handle both default and interactive flows).
src/pkg/clouds/azure/aci/tail_test.go (1)

11-12: Test is permanently skipped.

Consider adding a comment explaining when/how this test should be run manually, or use t.Skip("reason") with an explanation instead of t.SkipNow().

📝 Suggested improvement
 func TestTail(t *testing.T) {
-	t.SkipNow() // too slow for CI
+	t.Skip("Integration test: run manually with 'go test -tags=integration' against a real Azure subscription")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/tail_test.go` around lines 11 - 12, The TestTail
unit test is unconditionally skipped via t.SkipNow(), remove the unconditional
skip and replace it with t.Skip("reason") inside TestTail so the skip message
documents why it's omitted from CI and how/when to run it manually (e.g., "too
slow for CI; run with -run TestTail-local or set env
TEST_AZURE_INTEGRATION=true"). Also add a short comment above TestTail
explaining manual invocation steps and any required environment/setup so
reviewers and devs know how to execute the test outside CI.
src/pkg/clouds/azure/aci/run_test.go (1)

22-27: Re-enable teardown in cleanup to avoid cloud resource leaks.

Lines 23-26 are commented, so this test leaves resources behind once enabled.

Proposed fix
 	t.Cleanup(func() {
-		// err := containerInstance.TearDown(ctx)
-		// if err != nil {
-		// 	t.Fatalf("Failed to tear down container instance: %v", err)
-		// }
+		if err := containerInstance.TearDown(ctx); err != nil {
+			t.Logf("cleanup: failed to tear down container instance: %v", err)
+		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/run_test.go` around lines 22 - 27, Re-enable the
teardown inside the test cleanup so the created Azure container is always
removed: inside the existing t.Cleanup(func() { ... }) call call
containerInstance.TearDown(ctx) and check its returned error, invoking
t.Fatalf("Failed to tear down container instance: %v", err) (or t.Errorf if you
prefer non-fatal) when non-nil; ensure ctx and containerInstance used in the
test are in scope for the cleanup closure so resources are not leaked.
src/pkg/clouds/azure/acr/acr_test.go (1)

66-75: Avoid silent cleanup failures in integration tests.

Lines 68-73 ignore cleanup errors, making leaked ACR registries hard to detect.

Proposed fix
 	t.Cleanup(func() {
 		ctx := context.Background()
 		client, err := acr.newRegistriesClient()
 		if err != nil {
+			t.Logf("cleanup: failed to create client: %v", err)
 			return
 		}
-		poller, _ := client.BeginDelete(ctx, acr.resourceGroupName, registryName, nil)
+		poller, err := client.BeginDelete(ctx, acr.resourceGroupName, registryName, nil)
+		if err != nil {
+			t.Logf("cleanup: failed to delete registry: %v", err)
+			return
+		}
 		if poller != nil {
 			_, _ = poller.PollUntilDone(ctx, nil)
 		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/acr_test.go` around lines 66 - 75, The test's
t.Cleanup closure silently ignores errors from acr.newRegistriesClient(),
client.BeginDelete and poller.PollUntilDone which can leak ACR registries;
update the cleanup to capture and surface errors (use t.Logf or
t.Errorf/t.Fatalf as appropriate): call acr.newRegistriesClient() and if it
returns an error log/fail the test, check the error from client.BeginDelete and
log/fail on non-nil error, and check the error returned by poller.PollUntilDone
and log/fail there too, referencing the acr.newRegistriesClient,
client.BeginDelete and poller.PollUntilDone calls so failures are not dropped.
src/pkg/clouds/azure/aca/tail.go (1)

128-131: Consider adding a timeout or using a client with transport-level timeouts.

Using http.DefaultClient for a streaming endpoint without timeouts could cause the goroutine to hang indefinitely if the server stops responding without closing the connection. For a streaming/follow scenario, consider at least a response header timeout.

♻️ Example: use a client with response header timeout
+var streamClient = &http.Client{
+	Transport: &http.Transport{
+		ResponseHeaderTimeout: 30 * time.Second,
+	},
+}
+
 func (c *ContainerApp) StreamLogs(...) (<-chan LogEntry, error) {
     // ...
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := streamClient.Do(req)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/tail.go` around lines 128 - 131, The code uses
http.DefaultClient.Do(req) for a long-running streaming request (resp, req)
which can hang; replace it with a dedicated http.Client that sets timeouts
(e.g., an overall Timeout or a Transport with
ResponseHeaderTimeout/IdleConnTimeout) and call client.Do(req) instead of
http.DefaultClient.Do(req). Locate the request site (the resp, err :=
http.DefaultClient.Do(req) call in the tail/streaming function) and create or
inject a client (e.g., a package-level or function-scoped client configured with
timeouts) so the streaming goroutine still closes resp.Body but the connection
cannot hang indefinitely.
src/pkg/cli/client/byoc/azure/byoc.go (1)

229-230: Security: Sensitive values exposed as plain environment variables.

The TODO comments acknowledge this, but pulumiBackendValue and PULUMI_CONFIG_PASSPHRASE contain sensitive credentials. Consider tracking this as a follow-up to use Azure Key Vault or secure environment variables.

Would you like me to open an issue to track converting these to secrets?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/azure/byoc.go` around lines 229 - 230, The code is
exposing sensitive values pulumiBackendValue and byoc.PulumiConfigPassphrase via
plain environment variables (pulumiBackendKey and "PULUMI_CONFIG_PASSPHRASE");
update the flow to fetch these secrets from a secure store (e.g., Azure Key
Vault or your secret manager) instead of hardcoding/setting them directly, mark
them as secret when passing into Pulumi (so they are redacted), avoid logging
their values, and add a clear TODO or create a tracked issue to migrate
pulumiBackendKey/pulumiBackendValue and byoc.PulumiConfigPassphrase to the
secret store with CI/runner fallback for development credentials.
src/pkg/clouds/azure/acr/acr.go (1)

133-140: YAML construction via string concatenation is fragile.

Building YAML by concatenating strings without proper escaping could produce malformed YAML if req.Image or req.Command contains special characters (colons, quotes, newlines, etc.). Consider using a YAML library or at minimum quoting the values.

♻️ Safer approach using proper quoting
 	var yaml strings.Builder
 	yaml.WriteString("steps:\n")
-	yaml.WriteString("  - cmd: " + req.Image)
+	// Quote the image to handle special characters
+	yaml.WriteString("  - cmd: \"" + strings.ReplaceAll(req.Image, "\"", "\\\"") + "\"")
 	if len(req.Command) > 0 {
-		// Arguments after the image are passed as the command
-		yaml.WriteString(" " + strings.Join(req.Command, " "))
+		// Properly quote each argument
+		for _, arg := range req.Command {
+			yaml.WriteString(" \"" + strings.ReplaceAll(arg, "\"", "\\\"") + "\"")
+		}
 	}
 	yaml.WriteString("\n")

Alternatively, use gopkg.in/yaml.v3 (or go.yaml.in/yaml/v3) to marshal a struct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/acr.go` around lines 133 - 140, The current YAML is
built by concatenating into the strings.Builder (see variables yaml, and the use
of req.Image and req.Command) which risks producing invalid YAML when values
contain special characters; replace this with structured marshaling: create a
small Go struct representing the pipeline (e.g., type step struct { Cmd string
`yaml:"cmd"` } and wrapper { Steps []step `yaml:"steps"` }), populate Cmd by
joining req.Image and req.Command, then use gopkg.in/yaml.v3 (yaml.Marshal) to
produce the YAML bytes and write those instead of manual string concatenation;
ensure you remove the existing yaml.WriteString concatenation logic and handle
any marshal errors returned by yaml.Marshal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/go.mod`:
- Around line 20-31: Replace the beta Azure SDK module reference by updating the
module entry
"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/authorization/armauthorization/v3
v3.0.0-beta.2" to the stable release "v2.2.0"; locate the dependency line for
armauthorization (the module name contains armauthorization/v3) in the go.mod
and change the version string to v2.2.0, then run go mod tidy to ensure the
lockfile and module graph are updated.

In `@src/pkg/clouds/azure/aci/blob.go`:
- Around line 81-91: The DownloadBlob method on ContainerInstance uses
io.ReadAll on resp.Body which can OOM for large blobs; change it to enforce a
max read bound by wrapping resp.Body with io.LimitReader (or use io.CopyN) using
a defined constant like maxBlobDownloadBytes and return an explicit error if the
blob exceeds that limit (or change the signature to stream by returning
io.ReadCloser). Locate ContainerInstance.DownloadBlob and the call to
client.NewBlobClient(blobName).DownloadStream(ctx, nil) / resp.Body, replace
io.ReadAll(resp.Body) with a bounded read approach (or streaming return) and
ensure the function documents/handles the max size error path.
- Around line 30-35: When handling the result of accountsClient.ListKeys in the
blob storage logic, guard against empty or malformed responses before
dereferencing keys.Keys[0].Value: check that keys is non-nil, keys.Keys has
length > 0, and keys.Keys[0].Value is non-nil; if any check fails return a clear
error instead of dereferencing. Update the code around the ListKeys call (the
variables keys, err and storageKey assignment) to validate the response and
propagate a meaningful error when keys are missing or malformed.

In `@src/pkg/clouds/azure/aci/common.go`:
- Around line 22-41: NewContainerInstance currently silently allows an empty
Azure location and produces an invalid resource group name; change
NewContainerInstance to return (*ContainerInstance, error) and validate the
effective location (argument or AZURE_LOCATION) immediately—if empty return a
descriptive error (e.g., "azure location not set") instead of constructing the
object; if valid, construct the ContainerInstance, set SubscriptionID and
StorageAccount as before, call c.SetLocation(loc) to compute resourceGroupName,
and return the instance with nil error; update any callers of
NewContainerInstance to handle the error.

In `@src/pkg/clouds/azure/aci/run_test.go`:
- Around line 10-12: The test TestRun currently calls t.SkipNow()
unconditionally which prevents it from ever exercising the ACI run path; change
the unconditional skip to a conditional skip (e.g., check testing.Short() or an
integration-specific env var/flag) so the test runs in integration contexts but
is skipped during fast/unit runs/CI by default, and ensure the skip call
references the same TestRun function/t variable and includes a descriptive
message.

In `@src/pkg/clouds/azure/aci/run.go`:
- Around line 49-57: The code currently builds clone.Containers based only on
the supplied containers slice, causing an empty container list when containers
is nil/empty; modify the Run logic to fall back to the default container
definitions held on the receiver (c.ContainerGroupProps.Containers or a
predefined default) when the incoming containers slice is nil or has length 0,
then iterate over that effectiveContainers slice to populate clone.Containers;
ensure you still apply safeAppend(newProps.Command, commandArgs...) for the
first container and safeAppend(newProps.EnvironmentVariables, envVars...) for
each container so the behavior of newProps.Command and
newProps.EnvironmentVariables and creation of armcontainerinstance.Container
entries remains correct.
- Around line 70-76: The BeginCreateOrUpdate and BeginStart calls on
containerGroupClient (BeginCreateOrUpdate and BeginStart) are starting
long-running operations but not waiting for them to finish; change the code to
capture the poller returned by BeginCreateOrUpdate and call PollUntilDone (or
the SDK's equivalent) to wait for creation to complete before calling
BeginStart, and likewise capture the poller from BeginStart and PollUntilDone to
wait for the start operation to finish (follow the pattern used in setup.go for
PollUntilDone).

In `@src/pkg/clouds/azure/aci/setup.go`:
- Around line 208-216: The current code assigns Contributor and User Access
Administrator at subscription scope (subscriptionScope built from
c.SubscriptionID) via assignRole using contributorRoleID and
userAccessAdministratorRoleID; update this by documenting the security
implications in user-facing docs and in-code comments, and add an option/flag
(e.g., in the configuration struct where c.SubscriptionID is defined) to allow
resource-group-scoped role assignment instead of subscription-scoped; implement
the alternate flow by constructing a resource-group scope
("/subscriptions/"+c.SubscriptionID+"/resourceGroups/"+c.ResourceGroupName) and
calling assignRole with that scope when the flag indicates least-privilege is
desired, and keep the existing subscriptionScope path as the opt-in behavior
while updating comments near assignment calls and function assignRole usage to
reflect this change.

In `@src/pkg/clouds/azure/aci/stop.go`:
- Around line 7-14: The Stop method on ContainerInstance dereferences groupName
unconditionally which can panic; add a nil guard at the start of
ContainerInstance.Stop to return a clear error if groupName is nil, then only
dereference *groupName after that check before calling
containerGroupClient.Stop; update any error message to indicate "nil groupName"
and keep the rest of the logic unchanged.

In `@src/pkg/clouds/azure/aci/tail.go`:
- Around line 131-142: The retry loop around client.Attach in tail.go currently
uses time.Sleep which ignores context cancellation; change the retry sleep to a
cancellable wait using select with time.After and ctx.Done() so that when
errors.As detects ContainerNotFound you do: select { case
<-time.After(time.Second): continue; case <-ctx.Done(): return nil, ctx.Err() },
keeping the rest of error handling the same; update the loop around
attachResponse, err, client.Attach, resourceGroupName, groupName, and
containerName accordingly.
- Around line 50-57: The retry loop in tail.go that calls client.ListLogs
currently uses time.Sleep and ignores context cancellation; modify the loop (the
for { ... if errors.As(err, &respErr) && respErr.ErrorCode ==
"ContainerGroupDeploymentNotReady" { ... } } block around client.ListLogs) to
wait in a context-aware way (e.g., replace time.Sleep(time.Second) with a select
on ctx.Done() and time.After(time.Second)) so that if ctx is cancelled you
break/return promptly instead of continuing to retry; ensure you return the
original error or ctx.Err() as appropriate when ctx.Done() is selected.
- Around line 175-177: The code indexes logLine[0] without checking length,
which panics on empty websocket messages; before extracting stdioFd and slicing
logLine[1:], guard with a length check (e.g., if len(logLine) == 0 { continue }
or treat as stdout) so you only access logLine[0] when present; update the logic
around variables logLine, stdioFd, and the channel send ch <- LogEntry{Message:
string(logLine[1:]), Stderr: stdioFd == 2} to safely handle empty messages.
- Around line 146-150: The defer resp.Body.Close() is called before checking the
error from websocket.DefaultDialer.DialContext, which can panic if resp is nil;
move the defer so it runs only after a successful DialContext and resp is
non-nil, and when returning the error from DialContext, guard the error message
to avoid dereferencing resp (e.g., include resp.Status only when resp != nil or
include a fallback like "no response"). Locate the call to
websocket.DefaultDialer.DialContext and the surrounding error handling near
attachResponse.WebSocketURI and adjust ordering and nil checks accordingly.

In `@src/pkg/clouds/azure/aci/upload_test.go`:
- Line 29: Test code prints a full pre-signed SAS upload URL (t.Logf("Upload
URL: %s", url)) which leaks credentials; instead redact the query string before
logging. Update the test to parse the URL value (use url.Parse or equivalent),
remove or replace RawQuery (e.g., set to empty or "<redacted>") and log only the
scheme://host/path or a redacted string; locate the t.Logf call in
upload_test.go and replace it with code that sanitizes the url variable before
logging.

In `@src/pkg/clouds/azure/aci/upload.go`:
- Around line 51-56: The code dereferences keys.Keys[0].Value without checking
the result from accountsClient.ListKeys (in the upload flow that sets
storageKey), which can panic if keys.Keys is empty or the Value is nil; update
the block that calls accountsClient.ListKeys(ctx, c.resourceGroupName,
c.StorageAccount, nil) to validate that keys != nil, len(keys.Keys) > 0 and
keys.Keys[0].Value != nil before assigning storageKey, and return a clear error
if the checks fail so storageKey is only set from a safe, non-nil value.

In `@src/pkg/clouds/azure/acr/tail.go`:
- Around line 107-113: The code in tail.go reads body into text and indexes
text[len(text)-1] which panics on empty chunks; update the logic around
variables text, lines and complete to guard for empty text (e.g., if len(text)
== 0) before accessing the last byte, and treat an empty text as producing no
complete lines (skip slicing or set complete to nil/empty) so the subsequent
processing of complete and lines is safe; locate the block that declares text :=
string(body), lines := strings.Split(...), and the conditional using
text[len(text)-1] to add the guard.
- Around line 92-94: The current check that returns (offset, nil) when
resp.StatusCode is neither 200 nor 206 silently swallows server errors; update
the code in the tail logic that inspects resp.StatusCode (the block that
currently returns offset, nil for non-StatusOK/StatusPartialContent) to return a
descriptive error instead — include the resp.StatusCode and any short response
body or status text in the error (and/or log it via the existing logger) so
callers can detect and surface failed log endpoint responses rather than
treating them as a successful no-op.

---

Outside diff comments:
In `@src/cmd/cli/command/estimate.go`:
- Around line 70-75: The providerDescription map is missing entries for Azure
(and ensure DigitalOcean is present) so the survey shows empty descriptions;
update the providerDescription variable to include a descriptive string for
client.ProviderAzure (e.g., mention Azure CLI/ARM/ADC or environment variables)
and verify/adjust the client.ProviderDO entry text for DigitalOcean so both
client.ProviderAzure and client.ProviderDO have clear descriptions used by the
survey prompt.

---

Minor comments:
In `@src/pkg/clouds/azure/aca/common.go`:
- Around line 152-175: The code dereferences Azure SDK pointer fields without
nil checks (e.g., app.Properties.Template.Containers[0].Name,
list.Value[0].Name, list.Value[0].Properties.Containers[0].Name) which can
panic; update the logic in the block that calls c.newReplicasClient() and
processes replicas (and the earlier template-based container pick) to guard
every pointer access: verify app.Properties, app.Properties.Template, the
container slice length, and that Containers[0] and its Name are non-nil before
dereferencing; after ListReplicas also check list.Value is non-nil, that
list.Value[0] is non-nil, that list.Value[0].Name is non-nil, that
list.Value[0].Properties is non-nil, and that its Containers[0] and Name are
non-nil, and return a clear error if any expected field is nil rather than
dereferencing.

In `@src/pkg/clouds/azure/aca/tail.go`:
- Around line 54-74: poll currently swallows errors from
c.newContainerAppsClient() and pager.NextPage(), which hides auth/config issues;
update the poll function to log those errors at debug level before returning
(use the logger available on the receiver struct c or add/pass one if missing),
e.g., log the error and contextual info (resource group, operation) when
c.newContainerAppsClient() fails and when pager.NextPage(ctx) returns an error,
then preserve the original control flow (return after logging) so failures are
visible for debugging; update references in the poll function around
c.newContainerAppsClient, client.NewListByResourceGroupPager and pager.NextPage
accordingly.

In `@src/pkg/clouds/azure/aci/setup_test.go`:
- Around line 17-22: Replace the non-fatal assertion in the test's
"SetUpResourceGroup" subtest so the test fails immediately if resource group
creation fails: change the call that currently uses t.Errorf when
c.SetUpResourceGroup(ctx) returns an error to t.Fatalf (or otherwise abort the
test) to prevent subsequent TearDown from running against an uncreated resource;
update the subtest around SetUpResourceGroup and references to
c.SetUpResourceGroup/context.Background() accordingly.

In `@src/pkg/clouds/azure/aci/tail_test.go`:
- Around line 23-28: Uncomment and run the resource group cleanup by calling
containerInstance.TearDown(ctx) inside the existing t.Cleanup closure (or remove
the empty t.Cleanup if you intentionally want no cleanup); if you restore the
call, handle errors by logging rather than fatalling in cleanup (use t.Logf to
report any TearDown error) and ensure ctx and containerInstance are in scope for
the cleanup closure so the test won't leave Azure resource groups lingering.

In `@src/pkg/clouds/azure/acr/acr.go`:
- Around line 207-214: The code dereferences props.Status when constructing a
RunStatus, which can panic if props.Status is nil; update the construction (in
the block that returns *RunStatus) to check props.Status for nil before using
*props.Status (e.g., set statusStr := "unknown" or empty string and overwrite it
only if props.Status != nil), assign that statusStr to RunStatus.Status, and
keep the existing nil check for props.RunErrorMessage; references: RunStatus,
props.Status, props.RunErrorMessage.
- Around line 90-105: Do not assign a.RegistryName until after successful
creation: move the a.RegistryName = registryName assignment to after
client.BeginCreate and poller.PollUntilDone complete without error so the struct
isn't left in an inconsistent state; specifically, keep the calls to
client.BeginCreate(...) and poller.PollUntilDone(ctx, nil) as-is, and only set
a.RegistryName = registryName after the second call returns nil error (use the
existing variables poller and err to gate the assignment).

In `@src/pkg/http/post.go`:
- Around line 48-50: PostWithContext currently always adds a "Content-Type"
header (even when contentType == ""), which forces an empty header value and
changes previous behavior; update PostWithContext so it only passes a header to
PostWithHeader when contentType is non-empty—i.e., if contentType == "" call
PostWithHeader with a nil/empty header (or omit the header argument) and
otherwise call PostWithHeader with http.Header{"Content-Type":
[]string{contentType}} to preserve prior behavior; refer to PostWithContext and
PostWithHeader to locate and implement this conditional header construction.

---

Nitpick comments:
In `@src/pkg/cli/client/byoc/azure/byoc.go`:
- Around line 229-230: The code is exposing sensitive values pulumiBackendValue
and byoc.PulumiConfigPassphrase via plain environment variables
(pulumiBackendKey and "PULUMI_CONFIG_PASSPHRASE"); update the flow to fetch
these secrets from a secure store (e.g., Azure Key Vault or your secret manager)
instead of hardcoding/setting them directly, mark them as secret when passing
into Pulumi (so they are redacted), avoid logging their values, and add a clear
TODO or create a tracked issue to migrate pulumiBackendKey/pulumiBackendValue
and byoc.PulumiConfigPassphrase to the secret store with CI/runner fallback for
development credentials.

In `@src/pkg/cli/client/byoc/azure/login.go`:
- Around line 5-7: The Authenticate method on ByocAzure currently returns nil
and ignores ctx and interactive; implement real Azure authentication by using
Azure Identity (e.g., call azidentity.NewDefaultAzureCredential or the
interactive browser credential when interactive==true) and store the resulting
credentials on the ByocAzure instance so subsequent calls can use them; ensure
you propagate ctx into credential creation and return any errors from credential
initialization (update ByocAzure.Authenticate to perform the credential creation
and assignment, and handle both default and interactive flows).

In `@src/pkg/cli/compose/context.go`:
- Around line 262-264: Shared compose upload code currently sets the
Azure-specific header "X-Ms-Blob-Type" on the common header variable before
calling http.PutWithHeader; remove that Azure-specific header from the shared
function (leave Content-Type intact) and instead add logic in the Azure provider
to inject "X-Ms-Blob-Type: BlockBlob" for uploads to Azure blobs. Locate the
header construction and http.PutWithHeader call (the header variable and the
call to http.PutWithHeader) to remove header.Set("X-Ms-Blob-Type", "BlockBlob")
here, and update the Azure provider upload implementation to set that header
when targeting BlockBlob storage. Ensure other providers are unaffected and
tests cover that Azure uploads include the header while other backends do not.

In `@src/pkg/cli/tailAndMonitor.go`:
- Around line 65-66: The hardcoded 6s sleep before cancelling the tail
(pkg.SleepWithContext(ctx, 6*time.Second) followed by
cancelTail(errMonitoringDone)) should be documented and made configurable; add a
short comment near that call explaining why 6s was chosen (e.g., Azure log
propagation) and replace the literal with a configurable value or
provider-specific constant (e.g., tailCancelDelay or providerTailDelays map used
by pkg.SleepWithContext), wiring a flag/env/config option so Azure can use 6s
while AWS/GCP use a shorter delay, and keep cancelTail(errMonitoringDone)
behavior unchanged.

In `@src/pkg/clouds/azure/aca/common.go`:
- Around line 71-74: The code uses http.DefaultClient.Do(req) (see the resp, err
:= http.DefaultClient.Do(req) call) which has no timeout; change to use a
context-aware HTTP client or honor the request context: ensure the request is
created/associated with a context (req = req.WithContext(ctx) or accept a
context param) and replace http.DefaultClient with an http.Client configured
with a timeout or with Transport that respects context deadlines; update the
other occurrences noted (around the second call at lines 109-112) to the same
pattern so all Azure management calls won't hang indefinitely.

In `@src/pkg/clouds/azure/aca/tail.go`:
- Around line 128-131: The code uses http.DefaultClient.Do(req) for a
long-running streaming request (resp, req) which can hang; replace it with a
dedicated http.Client that sets timeouts (e.g., an overall Timeout or a
Transport with ResponseHeaderTimeout/IdleConnTimeout) and call client.Do(req)
instead of http.DefaultClient.Do(req). Locate the request site (the resp, err :=
http.DefaultClient.Do(req) call in the tail/streaming function) and create or
inject a client (e.g., a package-level or function-scoped client configured with
timeouts) so the streaming goroutine still closes resp.Body but the connection
cannot hang indefinitely.

In `@src/pkg/clouds/azure/aci/run_test.go`:
- Around line 22-27: Re-enable the teardown inside the test cleanup so the
created Azure container is always removed: inside the existing t.Cleanup(func()
{ ... }) call call containerInstance.TearDown(ctx) and check its returned error,
invoking t.Fatalf("Failed to tear down container instance: %v", err) (or
t.Errorf if you prefer non-fatal) when non-nil; ensure ctx and containerInstance
used in the test are in scope for the cleanup closure so resources are not
leaked.

In `@src/pkg/clouds/azure/aci/setup.go`:
- Around line 51-53: Replace the TODO about deleting the storage account with a
clear comment documenting the intended behavior: explain that when
SetUpStorageAccount creates a storage account in the same resource group it will
be removed when the resource group is deleted, while an externally provided
AZURE_STORAGE_ACCOUNT must not and will not be deleted by this code; remove or
avoid adding any deletion logic here and, if desired, add an explicit TODO to
open an issue for lifecycle policy decisions rather than leaving ambiguous text.
- Around line 108-134: Remove the large commented-out debug block containing
getCurrentUserObjectID, assignBlobDataContributor, NewStorageAccountsClient,
ListKeys, and RegenerateKey usage (and related prints and error handling around
storageAccount and c.resourceGroupName); either delete these commented lines to
improve readability or extract them into a tracked issue/PR with a short TODO
comment referencing that issue if you want to preserve the intent for future
work.

In `@src/pkg/clouds/azure/aci/tail_test.go`:
- Around line 11-12: The TestTail unit test is unconditionally skipped via
t.SkipNow(), remove the unconditional skip and replace it with t.Skip("reason")
inside TestTail so the skip message documents why it's omitted from CI and
how/when to run it manually (e.g., "too slow for CI; run with -run
TestTail-local or set env TEST_AZURE_INTEGRATION=true"). Also add a short
comment above TestTail explaining manual invocation steps and any required
environment/setup so reviewers and devs know how to execute the test outside CI.

In `@src/pkg/clouds/azure/acr/acr_test.go`:
- Around line 66-75: The test's t.Cleanup closure silently ignores errors from
acr.newRegistriesClient(), client.BeginDelete and poller.PollUntilDone which can
leak ACR registries; update the cleanup to capture and surface errors (use
t.Logf or t.Errorf/t.Fatalf as appropriate): call acr.newRegistriesClient() and
if it returns an error log/fail the test, check the error from
client.BeginDelete and log/fail on non-nil error, and check the error returned
by poller.PollUntilDone and log/fail there too, referencing the
acr.newRegistriesClient, client.BeginDelete and poller.PollUntilDone calls so
failures are not dropped.

In `@src/pkg/clouds/azure/acr/acr.go`:
- Around line 133-140: The current YAML is built by concatenating into the
strings.Builder (see variables yaml, and the use of req.Image and req.Command)
which risks producing invalid YAML when values contain special characters;
replace this with structured marshaling: create a small Go struct representing
the pipeline (e.g., type step struct { Cmd string `yaml:"cmd"` } and wrapper {
Steps []step `yaml:"steps"` }), populate Cmd by joining req.Image and
req.Command, then use gopkg.in/yaml.v3 (yaml.Marshal) to produce the YAML bytes
and write those instead of manual string concatenation; ensure you remove the
existing yaml.WriteString concatenation logic and handle any marshal errors
returned by yaml.Marshal.

In `@src/pkg/clouds/azure/common.go`:
- Around line 53-79: Create a small helper on the Azure type to encapsulate the
duplicated NewCreds() + armstorage.NewClientFactory(...) logic (e.g., a method
named newStorageClientFactory or storageClientFactory that returns
(*armstorage.ClientFactory, error}); have Azure.NewStorageAccountsClient and
Azure.NewBlobContainersClient call that helper and return
clientFactory.NewAccountsClient() and clientFactory.NewBlobContainersClient()
respectively, propagating errors unchanged (use the same error wrapping used
today when creating the factory).
- Around line 81-93: Remove the stale commented-out constructor stub for
NewRoleAssignmentsClient in the Azure type: delete the entire commented block
that references func (a Azure) NewRoleAssignmentsClient(),
armauthorization.RoleAssignmentsClient, and related NewRoleAssignmentsClient
creation logic so production code contains no dead/commented code; if this logic
needs tracking, add a separate ticket/PR note instead of leaving commented code
in src/pkg/clouds/azure/common.go.

In `@src/pkg/http/put.go`:
- Around line 24-30: PutWithHeader currently overwrites the request's Header map
instead of merging with any existing/default headers; change the logic so after
creating the request with http.NewRequestWithContext you iterate over the
provided header (if non-nil) and copy each key/value into req.Header using
Add/Set rather than replacing the map, preserving any defaults; keep the current
return path via DefaultClient.Do(req) and ensure nil-safety when header is nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9af409ca-3a77-4d21-9294-b56a7a0f273e

📥 Commits

Reviewing files that changed from the base of the PR and between 97c4704 and 4474283.

⛔ Files ignored due to path filters (2)
  • src/go.sum is excluded by !**/*.sum
  • src/protos/io/defang/v1/fabric.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (44)
  • flake.nix
  • pkgs/defang/cli.nix
  • src/cmd/cli/command/estimate.go
  • src/go.mod
  • src/pkg/agent/tools/estimate_test.go
  • src/pkg/cli/client/byoc/aws/byoc.go
  • src/pkg/cli/client/byoc/azure/byoc.go
  • src/pkg/cli/client/byoc/azure/login.go
  • src/pkg/cli/client/byoc/gcp/byoc.go
  • src/pkg/cli/client/provider_id.go
  • src/pkg/cli/client/region.go
  • src/pkg/cli/compose/context.go
  • src/pkg/cli/connect.go
  • src/pkg/cli/tailAndMonitor.go
  • src/pkg/clouds/azure/aca/common.go
  • src/pkg/clouds/azure/aca/tail.go
  • src/pkg/clouds/azure/aci/blob.go
  • src/pkg/clouds/azure/aci/common.go
  • src/pkg/clouds/azure/aci/common_test.go
  • src/pkg/clouds/azure/aci/run.go
  • src/pkg/clouds/azure/aci/run_test.go
  • src/pkg/clouds/azure/aci/setup.go
  • src/pkg/clouds/azure/aci/setup_test.go
  • src/pkg/clouds/azure/aci/status.go
  • src/pkg/clouds/azure/aci/stop.go
  • src/pkg/clouds/azure/aci/tail.go
  • src/pkg/clouds/azure/aci/tail_test.go
  • src/pkg/clouds/azure/aci/upload.go
  • src/pkg/clouds/azure/aci/upload_test.go
  • src/pkg/clouds/azure/acr/acr.go
  • src/pkg/clouds/azure/acr/acr_test.go
  • src/pkg/clouds/azure/acr/tail.go
  • src/pkg/clouds/azure/acr/types.go
  • src/pkg/clouds/azure/acr/types_test.go
  • src/pkg/clouds/azure/appcfg/appcfg.go
  • src/pkg/clouds/azure/common.go
  • src/pkg/clouds/azure/location.go
  • src/pkg/http/post.go
  • src/pkg/http/put.go
  • src/pkg/session/session.go
  • src/pkg/stacks/selector_test.go
  • src/pkg/stacks/wizard.go
  • src/pkg/utils.go
  • src/protos/io/defang/v1/fabric.proto

Comment on lines +20 to +31
github.com/Azure/azure-sdk-for-go/sdk/azcore v1.20.0
github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.13.1
github.com/Azure/azure-sdk-for-go/sdk/data/azappconfig v1.2.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/appconfiguration/armappconfiguration v1.1.1
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/appcontainers/armappcontainers v1.1.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/authorization/armauthorization/v3 v3.0.0-beta.2
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/containerinstance/armcontainerinstance/v2 v2.4.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/containerregistry/armcontainerregistry/v2 v2.0.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/msi/armmsi v1.3.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resources/armresources/v2 v2.1.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/storage/armstorage/v2 v2.0.0
github.com/Azure/azure-sdk-for-go/sdk/storage/azblob v1.6.4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the latest stable version of github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/authorization/armauthorization?

💡 Result:

The latest stable version is v2.2.0 for the Go module github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/authorization/armauthorization/v2 (published Nov 27, 2023). [1]

Note: there is a newer v3 line, but it’s currently tagged as beta (e.g. v3.0.0-beta.2), so it’s not stable. [2]

You’d typically install the latest stable with:

go get github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/authorization/armauthorization/v2@v2.2.0

Sources: [1], [2]


Replace armauthorization/v3 v3.0.0-beta.2 with the stable version v2.2.0.

The Azure SDK dependencies are appropriate for BYOC implementation, but armauthorization/v3 v3.0.0-beta.2 (line 25) is a beta version that may introduce breaking changes in future updates. A stable alternative exists at v2.2.0. Update the dependency to use the stable release instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/go.mod` around lines 20 - 31, Replace the beta Azure SDK module reference
by updating the module entry
"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/authorization/armauthorization/v3
v3.0.0-beta.2" to the stable release "v2.2.0"; locate the dependency line for
armauthorization (the module name contains armauthorization/v3) in the go.mod
and change the version string to v2.2.0, then run go mod tidy to ensure the
lockfile and module graph are updated.

Comment on lines +22 to +41
func NewContainerInstance(resourceGroupPrefix string, location azure.Location) *ContainerInstance {
if location == "" {
location = azure.Location(os.Getenv("AZURE_LOCATION"))
}
ci := &ContainerInstance{
Azure: azure.Azure{
SubscriptionID: os.Getenv("AZURE_SUBSCRIPTION_ID"),
},
resourceGroupPrefix: resourceGroupPrefix,
StorageAccount: os.Getenv("DEFANG_CD_BUCKET"),
}
ci.SetLocation(location)
return ci
}

// SetLocation updates the location and recomputes the resource group name.
func (c *ContainerInstance) SetLocation(loc azure.Location) {
c.Location = loc
c.resourceGroupName = c.resourceGroupPrefix + loc.String()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when Azure location is unset.

If both the argument and AZURE_LOCATION are empty, the instance is created with an invalid effective location and a degraded resource-group name. Validate at construction time and return an explicit error.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/common.go` around lines 22 - 41,
NewContainerInstance currently silently allows an empty Azure location and
produces an invalid resource group name; change NewContainerInstance to return
(*ContainerInstance, error) and validate the effective location (argument or
AZURE_LOCATION) immediately—if empty return a descriptive error (e.g., "azure
location not set") instead of constructing the object; if valid, construct the
ContainerInstance, set SubscriptionID and StorageAccount as before, call
c.SetLocation(loc) to compute resourceGroupName, and return the instance with
nil error; update any callers of NewContainerInstance to handle the error.

@lionello lionello closed this Apr 8, 2026
@lionello lionello reopened this Apr 8, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/pkg/session/session.go (1)

135-138: Improve Azure warning to catch partial configuration.

Line 136 only checks whether any Azure env var is set. If one key exists but others are missing, this warning won’t trigger, and failures get deferred. Consider warning on incomplete Azure env setup too.

Suggested patch
 import (
 	"context"
 	"fmt"
+	"os"
 
 	"github.com/DefangLabs/defang/src/pkg"
@@
 	case client.ProviderAzure:
-		if env := pkg.AzureInEnv(); env == "" {
+		required := []string{
+			"AZURE_SUBSCRIPTION_ID",
+			"AZURE_TENANT_ID",
+			"AZURE_CLIENT_ID",
+			"AZURE_CLIENT_SECRET",
+		}
+		missing := make([]string, 0, len(required))
+		for _, key := range required {
+			if os.Getenv(key) == "" {
+				missing = append(missing, key)
+			}
+		}
+		if len(missing) == len(required) {
 			term.Warn("Azure provider was selected, but no Azure environment variables are set")
+		} else if len(missing) > 0 {
+			term.Warnf("Azure provider was selected, but Azure configuration is incomplete; missing: %v", missing)
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/session/session.go` around lines 135 - 138, The current Azure branch
only checks pkg.AzureInEnv() == "" which misses partial configurations; update
the client.ProviderAzure case to validate that all required Azure env vars
(e.g., AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET, and any other keys
your code expects) are present and call term.Warn if any are missing or only a
subset is set. You can either change pkg.AzureInEnv() to return the list/status
of missing keys (or a boolean plus missing-key info) or add a new helper like
pkg.AzureMissingEnvKeys() and use it here; ensure the warning via term.Warn
mentions which keys are missing so partial configurations are caught early.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/pkg/session/session.go`:
- Around line 135-138: The current Azure branch only checks pkg.AzureInEnv() ==
"" which misses partial configurations; update the client.ProviderAzure case to
validate that all required Azure env vars (e.g., AZURE_CLIENT_ID,
AZURE_TENANT_ID, AZURE_CLIENT_SECRET, and any other keys your code expects) are
present and call term.Warn if any are missing or only a subset is set. You can
either change pkg.AzureInEnv() to return the list/status of missing keys (or a
boolean plus missing-key info) or add a new helper like
pkg.AzureMissingEnvKeys() and use it here; ensure the warning via term.Warn
mentions which keys are missing so partial configurations are caught early.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ddb2d57c-037b-48a8-b3ef-fcac29efdcfc

📥 Commits

Reviewing files that changed from the base of the PR and between 4474283 and 0542952.

⛔ Files ignored due to path filters (2)
  • src/go.sum is excluded by !**/*.sum
  • src/protos/io/defang/v1/fabric.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (6)
  • src/cmd/cli/command/estimate.go
  • src/go.mod
  • src/pkg/cli/client/byoc/aws/byoc.go
  • src/pkg/cli/client/byoc/gcp/byoc.go
  • src/pkg/session/session.go
  • src/protos/io/defang/v1/fabric.proto
✅ Files skipped from review due to trivial changes (2)
  • src/pkg/cli/client/byoc/gcp/byoc.go
  • src/pkg/cli/client/byoc/aws/byoc.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/cmd/cli/command/estimate.go
  • src/go.mod
  • src/protos/io/defang/v1/fabric.proto

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (6)
src/pkg/clouds/azure/acr/tail.go (2)

92-94: ⚠️ Potential issue | 🟠 Major

Surface unexpected log endpoint statuses instead of silently succeeding.

Line 92 currently treats non-200/206 responses as a no-op, which can hide real failures (e.g., expired SAS URL returning 403) and drop logs silently.

Proposed fix
 	if resp.StatusCode == http.StatusRequestedRangeNotSatisfiable {
 		return offset, nil
 	}
 	if resp.StatusCode != http.StatusPartialContent && resp.StatusCode != http.StatusOK {
-		return offset, nil
+		bodyPreview, _ := io.ReadAll(io.LimitReader(resp.Body, 1024))
+		err := fmt.Errorf("unexpected log response status: %s: %s", resp.Status, strings.TrimSpace(string(bodyPreview)))
+		if !yield("", err) {
+			return offset, errStopped
+		}
+		return offset, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/tail.go` around lines 92 - 94, The current check in
the function handling the blob log fetch (look for the resp.StatusCode check
around resp, offset in src/pkg/clouds/azure/acr/tail.go) silently returns offset
when the response is not 200 or 206; change this so unexpected statuses are
surfaced: when resp.StatusCode is not http.StatusPartialContent and not
http.StatusOK, read any response body for context and either log the status+body
via the existing logger or return a descriptive error (including the status code
and body) instead of returning offset(nil) so callers can detect and handle
failures (e.g., expired SAS/403).

107-113: ⚠️ Potential issue | 🔴 Critical

Guard empty chunks before indexing the last byte.

Line 112 can panic when text is empty (index out of range).

Proposed fix
 	text := string(body)
+	if len(text) == 0 {
+		return offset, nil
+	}
 	lines := strings.Split(text, "\n")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/tail.go` around lines 107 - 113, Guard against empty
body before indexing the last byte: change the logic around variables text,
lines and complete so you only access text[len(text)-1] when len(text) > 0
(e.g., if len(text) == 0 then set complete to an empty slice or lines and skip
the byte check). Update the condition that determines complete (currently using
text[len(text)-1] != '\n') to first check len(text) > 0, keeping variable names
text, lines and complete to locate and modify the code in tail.go.
src/pkg/clouds/azure/aci/blob.go (2)

23-36: ⚠️ Potential issue | 🔴 Critical

Pick a usable storage key instead of blindly dereferencing the first one.

Line 34 can panic when ListKeys returns no keys or the first entry has a nil/empty value. Iterate the returned keys and fail with a clear error when none are usable.

🛡️ Suggested change
 	keys, err := accountsClient.ListKeys(ctx, c.resourceGroupName, c.StorageAccount, nil)
 	if err != nil {
 		return nil, err
 	}
-	storageKey = *keys.Keys[0].Value
+	for _, key := range keys.Keys {
+		if key.Value != nil && *key.Value != "" {
+			storageKey = *key.Value
+			break
+		}
+	}
+	if storageKey == "" {
+		return nil, fmt.Errorf("no usable storage account keys returned for %q", c.StorageAccount)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/blob.go` around lines 23 - 36, The
newSharedKeyCredential method currently assumes ListKeys returns at least one
key and dereferences keys.Keys[0].Value which can panic; update
ContainerInstance.newSharedKeyCredential to (1) check that keys and keys.Keys
are non-nil and non-empty, (2) iterate over keys.Keys and select the first entry
whose Value is non-nil and non-empty, and (3) return a clear error from
newSharedKeyCredential if no usable key is found instead of dereferencing a
nil/empty value; reference ListKeys and keys.Keys when making these checks and
the azblob.NewSharedKeyCredential call.

81-91: ⚠️ Potential issue | 🟠 Major

Bound blob downloads instead of reading them fully into memory.

Line 91 pulls the entire remote blob into memory with no size limit. A large or unexpected object can OOM the CLI here; cap the read or stream the body to the caller instead.

💾 Suggested change
+const maxBlobDownloadBytes = 32 << 20
+
 // DownloadBlob fetches the contents of a blob by name.
 func (c *ContainerInstance) DownloadBlob(ctx context.Context, blobName string) ([]byte, error) {
 	client, err := c.newBlobContainerClient(ctx)
 	if err != nil {
 		return nil, err
@@
 	if err != nil {
 		return nil, err
 	}
 	defer resp.Body.Close()
-	return io.ReadAll(resp.Body)
+	data, err := io.ReadAll(io.LimitReader(resp.Body, maxBlobDownloadBytes+1))
+	if err != nil {
+		return nil, err
+	}
+	if len(data) > maxBlobDownloadBytes {
+		return nil, fmt.Errorf("blob %q exceeds %d bytes", blobName, maxBlobDownloadBytes)
+	}
+	return data, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/blob.go` around lines 81 - 91, DownloadBlob
currently reads the entire blob into memory (resp.Body) which can OOM; change it
to either stream the response or enforce a size cap. Two options: 1) Preferable:
change DownloadBlob signature to return (io.ReadCloser, error), create client
via newBlobContainerClient, call client.NewBlobClient(blobName).DownloadStream,
and return resp.Body directly so callers stream the content. 2) If keeping
[]byte, introduce a constant (e.g. maxBlobDownloadBytes), wrap resp.Body with
io.LimitReader and read via io.ReadAll, then error if the result equals the
limit (indicating truncation). Update callers of ContainerInstance.DownloadBlob
accordingly and ensure resp.Body is closed only when appropriate for streaming
vs capped-read behavior.
src/pkg/clouds/azure/aci/upload_test.go (1)

29-29: ⚠️ Potential issue | 🟠 Major

Redact the SAS URL before logging it.

Line 29 prints the full upload URL, including its SAS query string. That leaks write credentials into test logs.

🔒 Suggested change
-	t.Logf("Upload URL: %s", url)
+	t.Log("Upload URL generated")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/upload_test.go` at line 29, The test currently logs
the full upload URL (t.Logf("Upload URL: %s", url)) which exposes the SAS query
string; fix it by parsing the url value (use net/url.Parse), strip or replace
the RawQuery (e.g., set RawQuery = "" or replace with "REDACTED"), and log the
sanitized version (e.g., safeURL) instead of the original url so no SAS
credentials appear in test logs.
src/pkg/clouds/azure/aci/upload.go (1)

44-56: ⚠️ Potential issue | 🔴 Critical

Select a usable storage key instead of dereferencing keys[0].

Line 55 can panic when ListKeys returns no keys or the first key has a nil/empty value. Iterate the returned keys and fail with a descriptive error when none are usable.

🛡️ Suggested change
 	keys, err := accountsClient.ListKeys(ctx, c.resourceGroupName, c.StorageAccount, nil)
 	if err != nil {
 		return "", err
 	}
-	storageKey = *keys.Keys[0].Value // or [1]?
+	for _, key := range keys.Keys {
+		if key.Value != nil && *key.Value != "" {
+			storageKey = *key.Value
+			break
+		}
+	}
+	if storageKey == "" {
+		return "", fmt.Errorf("no usable storage account keys returned for %q", c.StorageAccount)
+	}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/upload.go` around lines 44 - 56, The code currently
dereferences keys.Keys[0].Value which can panic if ListKeys returns zero keys or
a nil/empty value; update the logic in the block that obtains storageKey (around
storageKey declaration, c.NewStorageAccountsClient(), and
accountsClient.ListKeys) to iterate over keys.Keys, pick the first
non-nil/non-empty Value, assign it to storageKey, and return a descriptive error
if no usable key is found (include context like c.resourceGroupName and
c.StorageAccount in the error message).
🧹 Nitpick comments (10)
src/pkg/clouds/azure/acr/acr_test.go (1)

66-75: Log cleanup failures in TestRunTask for resource-leak diagnosability.

This cleanup path currently swallows client/delete/poll errors, making leaked integration resources harder to investigate.

Proposed fix
 	t.Cleanup(func() {
 		ctx := context.Background()
 		client, err := acr.newRegistriesClient()
 		if err != nil {
+			t.Logf("cleanup: failed to create client: %v", err)
 			return
 		}
-		poller, _ := client.BeginDelete(ctx, acr.resourceGroupName, registryName, nil)
+		poller, err := client.BeginDelete(ctx, acr.resourceGroupName, registryName, nil)
+		if err != nil {
+			t.Logf("cleanup: failed to delete registry: %v", err)
+			return
+		}
 		if poller != nil {
-			_, _ = poller.PollUntilDone(ctx, nil)
+			if _, err := poller.PollUntilDone(ctx, nil); err != nil {
+				t.Logf("cleanup: poll failed: %v", err)
+			}
 		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/acr_test.go` around lines 66 - 75, The cleanup
currently swallows errors; update the t.Cleanup closure to log any failures from
acr.newRegistriesClient, client.BeginDelete and poller.PollUntilDone so leaked
resources are diagnosable: call acr.newRegistriesClient() and if err != nil call
t.Logf with a clear message including the error, then when calling
client.BeginDelete(ctx, acr.resourceGroupName, registryName, nil) check its
returned error and log it, and if poller is non-nil call
poller.PollUntilDone(ctx, nil) and log any returned error as well (use
t.Logf/t.Errorf to include the error values and context).
src/pkg/http/put.go (1)

24-29: Clone the header map before attaching it to the request.

Line 29 hands the request the caller’s header object verbatim. net/http.Header is a map[string][]string and already exposes Clone(), so cloning here avoids sharing mutable header state between the caller and the in-flight request. (pkg.go.dev)

♻️ Suggested change
 func PutWithHeader(ctx context.Context, url string, header http.Header, body io.Reader) (*http.Response, error) {
 	req, err := http.NewRequestWithContext(ctx, http.MethodPut, url, body)
 	if err != nil {
 		return nil, err
 	}
-	req.Header = header
+	req.Header = header.Clone()
 	return DefaultClient.Do(req)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/http/put.go` around lines 24 - 29, The PutWithHeader function assigns
the caller's http.Header map directly to req.Header which can share mutable
state; update PutWithHeader to clone the provided header before attaching it
(use header.Clone()) so the request gets its own copy and the caller's header
map isn't shared with the in-flight request.
src/pkg/clouds/azure/aci/setup.go (2)

108-133: Large block of commented-out code should be removed or documented.

There's a significant block of commented-out code (lines 108-133) related to permission assignment and key regeneration. If this is work-in-progress, consider adding a TODO comment explaining the intent. Otherwise, remove dead code to improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/setup.go` around lines 108 - 133, The large
commented-out block handling permission assignment and key regeneration
(references: getCurrentUserObjectID, assignBlobDataContributor,
NewStorageAccountsClient, ListKeys, RegenerateKey, resourceGroupName,
storageAccount) should be removed or annotated: if this is WIP, replace the
commented code with a single TODO comment explaining the planned work and
linking any relevant issue/PR; otherwise delete the dead code to keep setup.go
clean and maintainable. Ensure any removed logic is preserved in the PR
description or an issue so it can be restored if needed.

52-53: TODO comment indicates incomplete teardown logic.

The TearDown function has a TODO about deleting the storage account. This should either be implemented or tracked as a separate issue to ensure complete cleanup of Azure resources.

Would you like me to open an issue to track the storage account cleanup implementation?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/setup.go` around lines 52 - 53, The TODO in TearDown
indicates the storage account isn't being cleaned up; update the TearDown method
in src/pkg/clouds/azure/aci/setup.go to either implement deletion of the Azure
storage account tied to this deployment (use the existing Azure
client/credentials used elsewhere in this file to locate and delete the storage
account resource) or, if deletion must be deferred, replace the TODO with a
clear TODO comment and create a tracked issue (include the resource name/ID
logic and steps required) and add a reference to that issue in the comment so
resource cleanup is not forgotten.
src/pkg/cli/client/byoc/azure/byoc.go (4)

461-516: Channel nil-check placement after continue may delay termination.

When a channel closes and is set to nil (lines 466, 490), the code uses continue to restart the loop. This skips the termination check at lines 512-514, meaning the loop will iterate once more before checking if both channels are nil. While not a bug, refactoring to check immediately after setting a channel to nil would be clearer.

♻️ Clearer termination check
 				case entry, ok := <-cdCh:
 					if !ok {
 						cdCh = nil
-						continue
+						if acaCh == nil {
+							return
+						}
+						continue
 					}

Apply similar pattern for acaCh.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/azure/byoc.go` around lines 461 - 516, The loop
should terminate immediately when both channels are closed instead of relying on
the loop head after a continue; after you set cdCh = nil (in the cdCh case) and
after you set acaCh = nil (in the acaCh case) add an immediate check if cdCh ==
nil && acaCh == nil and return if so (or replace the continue with that check +
return), so the anonymous tailing function (the func returning from the caller)
exits right away when both channels are closed; apply this change for both cdCh
and acaCh handling in the yielding loop.

319-321: Status code check may be overly strict or include unexpected codes.

Azure Blob Storage "Put Blob" operation typically returns 201 Created on success. The check includes 200 which is unusual for this operation. Consider whether 201 alone is sufficient, or add a comment explaining when 200 is expected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/azure/byoc.go` around lines 319 - 321, The
status-code check in byoc.go currently treats both 200 and 201 as success
(checking resp.StatusCode), but Azure Put Blob returns 201 Created; update the
check to accept 201 only (e.g., if resp.StatusCode != 201 return error) and add
a brief comment at that check explaining that Put Blob should return 201 and why
200 is not expected (or explicitly note any special-case where 200 may be
acceptable) so future readers understand the choice; reference the
resp.StatusCode/resp.Status validation in the upload path in byoc.go when making
the change.

354-360: All errors wrapped as deployment failure may mask transient issues.

GetDeploymentStatus wraps all errors from GetContainerGroupStatus as ErrDeploymentFailed. This could mask transient network errors or rate limiting that might succeed on retry. Consider distinguishing between permanent failures (container exited with error) and transient errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/azure/byoc.go` around lines 354 - 360,
GetDeploymentStatus currently converts any error from
driver.GetContainerGroupStatus into client.ErrDeploymentFailed which hides
transient issues; update ByocAzure.GetDeploymentStatus to distinguish transient
vs permanent failures by inspecting the returned error or status from
b.driver.GetContainerGroupStatus: if the driver returns a sentinel/transient
error (e.g., network/timeouts, rate-limit type or a driver-specific transient
error type), return that error unchanged (or wrap it with context but preserve
its type) so callers can retry, but only wrap errors as
client.ErrDeploymentFailed when the container group status explicitly indicates
a permanent deployment failure (use b.cdContainerGroup or the status value
returned); use errors.As / errors.Is to detect known transient driver errors and
document the behavior in the function comment.

233-234: Sensitive values (Pulumi backend/passphrase) passed as plain environment variables.

The Pulumi backend credentials and config passphrase are passed as plain environment variables (lines 233-234). The TODO comments acknowledge this. Consider using Azure Key Vault references or secure environment variable injection before production use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/azure/byoc.go` around lines 233 - 234, The Pulumi
backend and passphrase are being injected as plain env vars
(pulumiBackendKey/pulumiBackendValue and "PULUMI_CONFIG_PASSPHRASE" using
byoc.PulumiConfigPassphrase); replace this with secure secret retrieval instead
of hard-setting env values: fetch the backend URL and passphrase from a secure
store (e.g., Azure Key Vault or your secret manager) at runtime and populate
them as secrets (or use Key Vault references / secret provider) so they are not
exposed in process environment or source; update the initialization in byoc.go
to use the secure-secret getter for pulumiBackendValue and
byoc.PulumiConfigPassphrase and remove plain-text assignment (honor the existing
TODO to make them secret).
src/pkg/clouds/azure/aca/common.go (2)

61-64: URL path segments are not encoded.

appName, ResourceGroup, and other values are interpolated directly into URLs without URL encoding. While Azure resource names have naming restrictions that typically prevent problematic characters, using url.PathEscape would be more defensive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/common.go` around lines 61 - 64, The URL is built by
interpolating values directly into fmt.Sprintf, leaving path segments like
appName, c.ResourceGroup, c.SubscriptionID and apiVersion unescaped; replace
those direct interpolations with url.PathEscape(...) calls (e.g.,
url.PathEscape(appName), url.PathEscape(c.ResourceGroup),
url.PathEscape(c.SubscriptionID), url.PathEscape(apiVersion)) when constructing
the string in the URL assignment so all path segments are safely encoded before
being inserted.

71-74: http.DefaultClient has no timeout, risking indefinite hangs.

Direct REST calls using http.DefaultClient.Do have no timeout configured. If the Azure management endpoint becomes unresponsive, this could block indefinitely. Consider using a client with a configured timeout or deriving deadlines from the context.

♻️ Suggested approach
+var httpClient = &http.Client{Timeout: 30 * time.Second}
+
 func (c *ContainerApp) getAuthToken(ctx context.Context, appName string) (string, error) {
 	// ...
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)

Apply the same change to getEventStreamBase. Alternatively, use http.NewRequestWithContext combined with context deadline propagation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/common.go` around lines 71 - 74, The call to
http.DefaultClient.Do in the function that performs the Azure REST request
should not use the default client (which has no timeout); update the code in the
function (and similarly in getEventStreamBase) to create and use an http.Client
with a sensible Timeout or use http.NewRequestWithContext and pass a context
with deadline derived from the incoming ctx so the request respects
cancellation/deadlines; ensure you replace http.DefaultClient.Do(req) with
client.Do(req) (or use req = req.WithContext(ctx)) and propagate errors
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/clouds/azure/aca/common.go`:
- Around line 153-155: The code currently dereferences SDK pointers without nil
checks (e.g., app.Properties.Template.Containers[0].Name, list.Value[0].Name,
list.Value[0].Properties.Containers[0].Name); update the logic to defensively
validate each nullable field and slice before dereferencing: check that app,
app.Properties, app.Properties.Template, and Containers are non-nil and have
length>0 and that Container.Name is non-nil before using *Name; similarly check
list.Value length, list.Value[0].Properties non-nil, its Containers length, and
their Name pointers before dereferencing; if any check fails, use a safe
fallback (empty string or skip assignment) or return an explicit error.

In `@src/pkg/clouds/azure/aca/tail.go`:
- Around line 32-52: The code marks tailed apps in the known map and never
removes them, so when c.StreamLogs(...) returns the app is never retried; modify
startTailing (and the other tailing site around the similar 67-71 block) to
remove the app from known when the goroutine exits: add a sync.Mutex (or switch
known to a sync.Map) to protect concurrent access, lock around reads/writes to
known, and defer a deletion like defer func(){ mutex.Lock(); delete(known,
appName); mutex.Unlock() }() at the start of the goroutine so the app can be
retried after StreamLogs() ends (do this both on error and normal exit).
- Around line 54-64: The poll() closure currently swallows failures from
c.newContainerAppsClient() and pager.NextPage(ctx); change it to surface those
errors instead of silently returning by (1) logging the error with context
(include the error value and which call failed) and (2) propagating the failure
to the outer watch loop so the caller can react (e.g., send the error on the
existing watch/error channel or cancel the watch context). Update the code paths
around poll(), newContainerAppsClient(), NewListByResourceGroupPager, and
pager.NextPage(ctx) to ensure errors are logged and forwarded (or cause the
watcher to stop) rather than being ignored.

In `@src/pkg/clouds/azure/aci/setup.go`:
- Around line 139-143: The blob container creation can error if the container
already exists; update the setUp flow around containerClient.Create to treat the
specific "ContainerAlreadyExists" error as success: after calling
containerClient.Create, if err != nil inspect the returned error for Azure's
container-already-exists error code (e.g., "ContainerAlreadyExists" or check
azcore/armstorage error code) and only return a failure for other errors,
otherwise proceed to set c.BlobContainerName = *container.Name (or if container
is nil when existing, set c.BlobContainerName from the requested
blobContainerName) so repeated setUp runs are idempotent.

In `@src/pkg/clouds/azure/aci/upload.go`:
- Around line 32-34: The SAS token currently sets StartTime using the local
variable now (and computes expiry), which can cause 403s for immediate PUTs;
update the SAS creation in upload.go to omit StartTime (or set it to a short
time in the past, e.g., now.Add(-15*time.Minute)) so the generated SAS is valid
immediately for the upload used by upload_test; adjust the code that uses the
now and expiry variables and the StartTime field when building the SAS (the
SAS-generation block that assigns StartTime) accordingly.

In `@src/pkg/clouds/azure/acr/acr_test.go`:
- Line 101: The test currently logs the full SAS URL stored in logURL via t.Logf
which can expose credentials; change the test to avoid printing query parameters
— e.g., parse logURL, strip or mask the RawQuery before logging, or log only the
hostname/path or a fixed placeholder instead of the full URL; update the t.Logf
usage (the call referencing logURL) to output the redacted or partial URL so no
SAS token appears in CI logs.

---

Duplicate comments:
In `@src/pkg/clouds/azure/aci/blob.go`:
- Around line 23-36: The newSharedKeyCredential method currently assumes
ListKeys returns at least one key and dereferences keys.Keys[0].Value which can
panic; update ContainerInstance.newSharedKeyCredential to (1) check that keys
and keys.Keys are non-nil and non-empty, (2) iterate over keys.Keys and select
the first entry whose Value is non-nil and non-empty, and (3) return a clear
error from newSharedKeyCredential if no usable key is found instead of
dereferencing a nil/empty value; reference ListKeys and keys.Keys when making
these checks and the azblob.NewSharedKeyCredential call.
- Around line 81-91: DownloadBlob currently reads the entire blob into memory
(resp.Body) which can OOM; change it to either stream the response or enforce a
size cap. Two options: 1) Preferable: change DownloadBlob signature to return
(io.ReadCloser, error), create client via newBlobContainerClient, call
client.NewBlobClient(blobName).DownloadStream, and return resp.Body directly so
callers stream the content. 2) If keeping []byte, introduce a constant (e.g.
maxBlobDownloadBytes), wrap resp.Body with io.LimitReader and read via
io.ReadAll, then error if the result equals the limit (indicating truncation).
Update callers of ContainerInstance.DownloadBlob accordingly and ensure
resp.Body is closed only when appropriate for streaming vs capped-read behavior.

In `@src/pkg/clouds/azure/aci/upload_test.go`:
- Line 29: The test currently logs the full upload URL (t.Logf("Upload URL: %s",
url)) which exposes the SAS query string; fix it by parsing the url value (use
net/url.Parse), strip or replace the RawQuery (e.g., set RawQuery = "" or
replace with "REDACTED"), and log the sanitized version (e.g., safeURL) instead
of the original url so no SAS credentials appear in test logs.

In `@src/pkg/clouds/azure/aci/upload.go`:
- Around line 44-56: The code currently dereferences keys.Keys[0].Value which
can panic if ListKeys returns zero keys or a nil/empty value; update the logic
in the block that obtains storageKey (around storageKey declaration,
c.NewStorageAccountsClient(), and accountsClient.ListKeys) to iterate over
keys.Keys, pick the first non-nil/non-empty Value, assign it to storageKey, and
return a descriptive error if no usable key is found (include context like
c.resourceGroupName and c.StorageAccount in the error message).

In `@src/pkg/clouds/azure/acr/tail.go`:
- Around line 92-94: The current check in the function handling the blob log
fetch (look for the resp.StatusCode check around resp, offset in
src/pkg/clouds/azure/acr/tail.go) silently returns offset when the response is
not 200 or 206; change this so unexpected statuses are surfaced: when
resp.StatusCode is not http.StatusPartialContent and not http.StatusOK, read any
response body for context and either log the status+body via the existing logger
or return a descriptive error (including the status code and body) instead of
returning offset(nil) so callers can detect and handle failures (e.g., expired
SAS/403).
- Around line 107-113: Guard against empty body before indexing the last byte:
change the logic around variables text, lines and complete so you only access
text[len(text)-1] when len(text) > 0 (e.g., if len(text) == 0 then set complete
to an empty slice or lines and skip the byte check). Update the condition that
determines complete (currently using text[len(text)-1] != '\n') to first check
len(text) > 0, keeping variable names text, lines and complete to locate and
modify the code in tail.go.

---

Nitpick comments:
In `@src/pkg/cli/client/byoc/azure/byoc.go`:
- Around line 461-516: The loop should terminate immediately when both channels
are closed instead of relying on the loop head after a continue; after you set
cdCh = nil (in the cdCh case) and after you set acaCh = nil (in the acaCh case)
add an immediate check if cdCh == nil && acaCh == nil and return if so (or
replace the continue with that check + return), so the anonymous tailing
function (the func returning from the caller) exits right away when both
channels are closed; apply this change for both cdCh and acaCh handling in the
yielding loop.
- Around line 319-321: The status-code check in byoc.go currently treats both
200 and 201 as success (checking resp.StatusCode), but Azure Put Blob returns
201 Created; update the check to accept 201 only (e.g., if resp.StatusCode !=
201 return error) and add a brief comment at that check explaining that Put Blob
should return 201 and why 200 is not expected (or explicitly note any
special-case where 200 may be acceptable) so future readers understand the
choice; reference the resp.StatusCode/resp.Status validation in the upload path
in byoc.go when making the change.
- Around line 354-360: GetDeploymentStatus currently converts any error from
driver.GetContainerGroupStatus into client.ErrDeploymentFailed which hides
transient issues; update ByocAzure.GetDeploymentStatus to distinguish transient
vs permanent failures by inspecting the returned error or status from
b.driver.GetContainerGroupStatus: if the driver returns a sentinel/transient
error (e.g., network/timeouts, rate-limit type or a driver-specific transient
error type), return that error unchanged (or wrap it with context but preserve
its type) so callers can retry, but only wrap errors as
client.ErrDeploymentFailed when the container group status explicitly indicates
a permanent deployment failure (use b.cdContainerGroup or the status value
returned); use errors.As / errors.Is to detect known transient driver errors and
document the behavior in the function comment.
- Around line 233-234: The Pulumi backend and passphrase are being injected as
plain env vars (pulumiBackendKey/pulumiBackendValue and
"PULUMI_CONFIG_PASSPHRASE" using byoc.PulumiConfigPassphrase); replace this with
secure secret retrieval instead of hard-setting env values: fetch the backend
URL and passphrase from a secure store (e.g., Azure Key Vault or your secret
manager) at runtime and populate them as secrets (or use Key Vault references /
secret provider) so they are not exposed in process environment or source;
update the initialization in byoc.go to use the secure-secret getter for
pulumiBackendValue and byoc.PulumiConfigPassphrase and remove plain-text
assignment (honor the existing TODO to make them secret).

In `@src/pkg/clouds/azure/aca/common.go`:
- Around line 61-64: The URL is built by interpolating values directly into
fmt.Sprintf, leaving path segments like appName, c.ResourceGroup,
c.SubscriptionID and apiVersion unescaped; replace those direct interpolations
with url.PathEscape(...) calls (e.g., url.PathEscape(appName),
url.PathEscape(c.ResourceGroup), url.PathEscape(c.SubscriptionID),
url.PathEscape(apiVersion)) when constructing the string in the URL assignment
so all path segments are safely encoded before being inserted.
- Around line 71-74: The call to http.DefaultClient.Do in the function that
performs the Azure REST request should not use the default client (which has no
timeout); update the code in the function (and similarly in getEventStreamBase)
to create and use an http.Client with a sensible Timeout or use
http.NewRequestWithContext and pass a context with deadline derived from the
incoming ctx so the request respects cancellation/deadlines; ensure you replace
http.DefaultClient.Do(req) with client.Do(req) (or use req =
req.WithContext(ctx)) and propagate errors unchanged.

In `@src/pkg/clouds/azure/aci/setup.go`:
- Around line 108-133: The large commented-out block handling permission
assignment and key regeneration (references: getCurrentUserObjectID,
assignBlobDataContributor, NewStorageAccountsClient, ListKeys, RegenerateKey,
resourceGroupName, storageAccount) should be removed or annotated: if this is
WIP, replace the commented code with a single TODO comment explaining the
planned work and linking any relevant issue/PR; otherwise delete the dead code
to keep setup.go clean and maintainable. Ensure any removed logic is preserved
in the PR description or an issue so it can be restored if needed.
- Around line 52-53: The TODO in TearDown indicates the storage account isn't
being cleaned up; update the TearDown method in
src/pkg/clouds/azure/aci/setup.go to either implement deletion of the Azure
storage account tied to this deployment (use the existing Azure
client/credentials used elsewhere in this file to locate and delete the storage
account resource) or, if deletion must be deferred, replace the TODO with a
clear TODO comment and create a tracked issue (include the resource name/ID
logic and steps required) and add a reference to that issue in the comment so
resource cleanup is not forgotten.

In `@src/pkg/clouds/azure/acr/acr_test.go`:
- Around line 66-75: The cleanup currently swallows errors; update the t.Cleanup
closure to log any failures from acr.newRegistriesClient, client.BeginDelete and
poller.PollUntilDone so leaked resources are diagnosable: call
acr.newRegistriesClient() and if err != nil call t.Logf with a clear message
including the error, then when calling client.BeginDelete(ctx,
acr.resourceGroupName, registryName, nil) check its returned error and log it,
and if poller is non-nil call poller.PollUntilDone(ctx, nil) and log any
returned error as well (use t.Logf/t.Errorf to include the error values and
context).

In `@src/pkg/http/put.go`:
- Around line 24-29: The PutWithHeader function assigns the caller's http.Header
map directly to req.Header which can share mutable state; update PutWithHeader
to clone the provided header before attaching it (use header.Clone()) so the
request gets its own copy and the caller's header map isn't shared with the
in-flight request.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36b30cca-2bda-44f9-901e-9b7b665094c4

📥 Commits

Reviewing files that changed from the base of the PR and between 0542952 and 17b1da3.

⛔ Files ignored due to path filters (2)
  • src/go.sum is excluded by !**/*.sum
  • src/protos/io/defang/v1/fabric.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (46)
  • flake.nix
  • pkgs/defang/cli.nix
  • pkgs/npm/README.md
  • src/README.md
  • src/cmd/cli/command/estimate.go
  • src/go.mod
  • src/pkg/agent/tools/estimate_test.go
  • src/pkg/cli/client/byoc/aws/byoc.go
  • src/pkg/cli/client/byoc/azure/byoc.go
  • src/pkg/cli/client/byoc/azure/login.go
  • src/pkg/cli/client/byoc/gcp/byoc.go
  • src/pkg/cli/client/provider_id.go
  • src/pkg/cli/client/region.go
  • src/pkg/cli/compose/context.go
  • src/pkg/cli/connect.go
  • src/pkg/cli/tailAndMonitor.go
  • src/pkg/clouds/azure/aca/common.go
  • src/pkg/clouds/azure/aca/tail.go
  • src/pkg/clouds/azure/aci/blob.go
  • src/pkg/clouds/azure/aci/common.go
  • src/pkg/clouds/azure/aci/common_test.go
  • src/pkg/clouds/azure/aci/run.go
  • src/pkg/clouds/azure/aci/run_test.go
  • src/pkg/clouds/azure/aci/setup.go
  • src/pkg/clouds/azure/aci/setup_test.go
  • src/pkg/clouds/azure/aci/status.go
  • src/pkg/clouds/azure/aci/stop.go
  • src/pkg/clouds/azure/aci/tail.go
  • src/pkg/clouds/azure/aci/tail_test.go
  • src/pkg/clouds/azure/aci/upload.go
  • src/pkg/clouds/azure/aci/upload_test.go
  • src/pkg/clouds/azure/acr/acr.go
  • src/pkg/clouds/azure/acr/acr_test.go
  • src/pkg/clouds/azure/acr/tail.go
  • src/pkg/clouds/azure/acr/types.go
  • src/pkg/clouds/azure/acr/types_test.go
  • src/pkg/clouds/azure/appcfg/appcfg.go
  • src/pkg/clouds/azure/common.go
  • src/pkg/clouds/azure/location.go
  • src/pkg/http/post.go
  • src/pkg/http/put.go
  • src/pkg/session/session.go
  • src/pkg/stacks/selector_test.go
  • src/pkg/stacks/wizard.go
  • src/pkg/utils.go
  • src/protos/io/defang/v1/fabric.proto
✅ Files skipped from review due to trivial changes (13)
  • src/README.md
  • pkgs/npm/README.md
  • src/pkg/cli/client/byoc/gcp/byoc.go
  • src/pkg/agent/tools/estimate_test.go
  • src/pkg/cli/tailAndMonitor.go
  • pkgs/defang/cli.nix
  • src/pkg/cli/client/byoc/aws/byoc.go
  • src/pkg/clouds/azure/acr/types_test.go
  • src/pkg/cli/client/byoc/azure/login.go
  • src/cmd/cli/command/estimate.go
  • src/pkg/clouds/azure/acr/types.go
  • src/pkg/clouds/azure/aci/status.go
  • src/pkg/clouds/azure/location.go
🚧 Files skipped from review as they are similar to previous changes (17)
  • src/pkg/session/session.go
  • src/pkg/cli/client/region.go
  • src/pkg/clouds/azure/aci/common_test.go
  • src/pkg/stacks/wizard.go
  • src/pkg/stacks/selector_test.go
  • src/go.mod
  • src/pkg/clouds/azure/aci/setup_test.go
  • src/pkg/clouds/azure/aci/stop.go
  • src/pkg/clouds/azure/aci/run_test.go
  • src/pkg/cli/client/provider_id.go
  • src/pkg/cli/connect.go
  • src/pkg/http/post.go
  • src/pkg/clouds/azure/aci/tail_test.go
  • src/pkg/cli/compose/context.go
  • src/pkg/clouds/azure/aci/run.go
  • src/pkg/clouds/azure/common.go
  • src/pkg/clouds/azure/acr/acr.go

Comment on lines +32 to +52
known := map[string]struct{}{}

startTailing := func(appName string) {
go func() {
appCh, err := c.StreamLogs(ctx, appName, "", "", "", true)
if err != nil {
select {
case out <- ServiceLogEntry{AppName: appName, LogEntry: LogEntry{Err: err}}:
case <-ctx.Done():
}
return
}
for entry := range appCh {
select {
case out <- ServiceLogEntry{AppName: appName, LogEntry: entry}:
case <-ctx.Done():
return
}
}
}()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Allow a tailed app to be retried after the stream exits.

Once an app name is inserted into known, it is never removed. If StreamLogs() returns because the SSE connection drops, the replica rolls, or the stream ends for any other transient reason, later polls will skip that app forever and follow mode permanently loses its logs.

Also applies to: 67-71

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/tail.go` around lines 32 - 52, The code marks tailed
apps in the known map and never removes them, so when c.StreamLogs(...) returns
the app is never retried; modify startTailing (and the other tailing site around
the similar 67-71 block) to remove the app from known when the goroutine exits:
add a sync.Mutex (or switch known to a sync.Map) to protect concurrent access,
lock around reads/writes to known, and defer a deletion like defer func(){
mutex.Lock(); delete(known, appName); mutex.Unlock() }() at the start of the
goroutine so the app can be retried after StreamLogs() ends (do this both on
error and normal exit).

Comment on lines +54 to +64
poll := func() {
client, err := c.newContainerAppsClient()
if err != nil {
return
}
pager := client.NewListByResourceGroupPager(c.ResourceGroup, nil)
for pager.More() {
page, err := pager.NextPage(ctx)
if err != nil {
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t silently swallow discovery failures.

If newContainerAppsClient() or NextPage() fails here, poll() just returns and the watch loop keeps running with no indication that app discovery is broken. In follow mode that leaves the caller with an apparently healthy stream that never picks up Container App logs.

🧭 Suggested change
 		poll := func() {
 			client, err := c.newContainerAppsClient()
 			if err != nil {
+				select {
+				case out <- ServiceLogEntry{LogEntry: LogEntry{Err: fmt.Errorf("list container apps client: %w", err)}}:
+				case <-ctx.Done():
+				}
 				return
 			}
 			pager := client.NewListByResourceGroupPager(c.ResourceGroup, nil)
 			for pager.More() {
 				page, err := pager.NextPage(ctx)
 				if err != nil {
+					select {
+					case out <- ServiceLogEntry{LogEntry: LogEntry{Err: fmt.Errorf("list container apps page: %w", err)}}:
+					case <-ctx.Done():
+					}
 					return
 				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
poll := func() {
client, err := c.newContainerAppsClient()
if err != nil {
return
}
pager := client.NewListByResourceGroupPager(c.ResourceGroup, nil)
for pager.More() {
page, err := pager.NextPage(ctx)
if err != nil {
return
}
poll := func() {
client, err := c.newContainerAppsClient()
if err != nil {
select {
case out <- ServiceLogEntry{LogEntry: LogEntry{Err: fmt.Errorf("list container apps client: %w", err)}}:
case <-ctx.Done():
}
return
}
pager := client.NewListByResourceGroupPager(c.ResourceGroup, nil)
for pager.More() {
page, err := pager.NextPage(ctx)
if err != nil {
select {
case out <- ServiceLogEntry{LogEntry: LogEntry{Err: fmt.Errorf("list container apps page: %w", err)}}:
case <-ctx.Done():
}
return
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aca/tail.go` around lines 54 - 64, The poll() closure
currently swallows failures from c.newContainerAppsClient() and
pager.NextPage(ctx); change it to surface those errors instead of silently
returning by (1) logging the error with context (include the error value and
which call failed) and (2) propagating the failure to the outer watch loop so
the caller can react (e.g., send the error on the existing watch/error channel
or cancel the watch context). Update the code paths around poll(),
newContainerAppsClient(), NewListByResourceGroupPager, and pager.NextPage(ctx)
to ensure errors are logged and forwarded (or cause the watcher to stop) rather
than being ignored.

if err != nil {
t.Fatalf("GetRunLogURL failed: %v", err)
}
t.Logf("Log URL: %s", logURL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not log full SAS log URLs.

Line 101 logs the full logURL, which includes the SAS token in query parameters and can leak credentials in CI logs.

Proposed fix
-		t.Logf("Log URL: %s", logURL)
+		t.Log("Log URL acquired")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
t.Logf("Log URL: %s", logURL)
t.Log("Log URL acquired")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/acr_test.go` at line 101, The test currently logs
the full SAS URL stored in logURL via t.Logf which can expose credentials;
change the test to avoid printing query parameters — e.g., parse logURL, strip
or mask the RawQuery before logging, or log only the hostname/path or a fixed
placeholder instead of the full URL; update the t.Logf usage (the call
referencing logURL) to output the redacted or partial URL so no SAS token
appears in CI logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
src/pkg/clouds/azure/aci/upload_test.go (1)

30-30: Minor: Non-standard Content-Type value.

"application/text" is not a registered MIME type. While Azure Blob Storage may accept it, the standard type is "text/plain".

♻️ Suggested fix
-	header := http.Header{"Content-Type": []string{"application/text"}}
+	header := http.Header{"Content-Type": []string{"text/plain"}}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/upload_test.go` at line 30, The test uses a
non-standard MIME type string in the header variable (header :=
http.Header{"Content-Type": []string{"application/text"}}) — change it to the
standard "text/plain" value; update the header declaration in
src/pkg/clouds/azure/aci/upload_test.go and any assertions or mocks in the same
test that expect "application/text" so they now expect "text/plain".
src/pkg/clouds/azure/aci/setup.go (1)

38-54: TODO: Storage account deletion not implemented in TearDown.

The TearDown function only deletes the resource group, but the storage account may have been created outside the resource group scope. The TODO on line 52-53 indicates this is a known gap.

Would you like me to help implement storage account cleanup, or open an issue to track this task?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/setup.go` around lines 38 - 54, TearDown currently
only removes the resource group but skips deleting the storage account; update
ContainerInstance.TearDown to also delete the storage account when present by
checking the instance fields that hold the storage account identity (e.g.
c.storageAccountName and any storage account resource group field), create the
storage accounts client (use the existing helper like c.newStorageAccountsClient
or add one if missing), call BeginDelete(ctx, accountName) and
PollUntilDone(ctx, nil), and return/report any errors; keep the resource-group
deletion logic and only attempt storage account deletion when the storage
account name is non-empty so we don't error for cases where no account was
created.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/clouds/azure/aci/run_test.go`:
- Around line 24-29: The test registers a t.Cleanup but the actual teardown call
to containerInstance.TearDown(ctx) is commented out, which can leave Azure
resources orphaned; either uncomment and invoke containerInstance.TearDown(ctx)
inside the t.Cleanup and handle any error by calling t.Errorf("Failed to tear
down container instance: %v", err) (not t.Fatalf), or if teardown is
intentionally skipped for debugging, add a clear comment above the t.Cleanup
explaining why it’s omitted and how/when to re-enable it; locate the cleanup
block referencing containerInstance.TearDown and update accordingly.

In `@src/pkg/clouds/azure/acr/tail_test.go`:
- Around line 59-68: The test is asserting behavior for the old buggy
Split/offset logic; after fixing tail.go to ignore the trailing empty element
from strings.Split, update the test assertions: set expectedOffset to 18 (actual
byte length) instead of int64(len(body))+1 and assert len(yielded) == 3 (three
real lines) instead of 4; change the two t.Errorf checks that compare
expectedOffset/newOffset and len(yielded) to use these new expected values so
the test matches the corrected logic in tail.go.

In `@src/pkg/clouds/azure/acr/tail.go`:
- Around line 43-54: In the terminal-status branch in the function where
status.IsTerminal() is handled, stop ignoring the return values from
readLogChunk(ctx, logURL, offset, yield); capture its (offset, err) result and
if err indicates the consumer stopped (e.g., err == errStopped or a boolean
false from yield), return immediately so you do not call yield again with the
final failure message for runID; otherwise continue building the msg and call
yield as before. Ensure you reference readLogChunk and runID in the fix and
propagate/handle the err return to avoid violating the iterator contract.
- Around line 114-127: The offset calculation overcounts when text ends with
'\n' because strings.Split creates a trailing empty element; update the logic
around lines/complete to safely handle empty text and drop the trailing empty
element when text[len(text)-1] == '\n' instead of treating it as a real line.
Concretely: check len(text)>0 before indexing, set complete = lines; if
text[len(text)-1] == '\n' && len(lines) > 0 && lines[len(lines)-1] == "" {
complete = lines[:len(lines)-1] } else if len(lines) > 0 { complete =
lines[:len(lines)-1] } so offset increments (offset += int64(len(line)+1)) and
yield(line, nil) remain correct without producing the off-by-one.

---

Nitpick comments:
In `@src/pkg/clouds/azure/aci/setup.go`:
- Around line 38-54: TearDown currently only removes the resource group but
skips deleting the storage account; update ContainerInstance.TearDown to also
delete the storage account when present by checking the instance fields that
hold the storage account identity (e.g. c.storageAccountName and any storage
account resource group field), create the storage accounts client (use the
existing helper like c.newStorageAccountsClient or add one if missing), call
BeginDelete(ctx, accountName) and PollUntilDone(ctx, nil), and return/report any
errors; keep the resource-group deletion logic and only attempt storage account
deletion when the storage account name is non-empty so we don't error for cases
where no account was created.

In `@src/pkg/clouds/azure/aci/upload_test.go`:
- Line 30: The test uses a non-standard MIME type string in the header variable
(header := http.Header{"Content-Type": []string{"application/text"}}) — change
it to the standard "text/plain" value; update the header declaration in
src/pkg/clouds/azure/aci/upload_test.go and any assertions or mocks in the same
test that expect "application/text" so they now expect "text/plain".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78fa2c15-9753-402e-b452-4a391db2709b

📥 Commits

Reviewing files that changed from the base of the PR and between 17b1da3 and f11f37e.

📒 Files selected for processing (12)
  • src/pkg/clouds/azure/aca/common.go
  • src/pkg/clouds/azure/aci/blob.go
  • src/pkg/clouds/azure/aci/run.go
  • src/pkg/clouds/azure/aci/run_test.go
  • src/pkg/clouds/azure/aci/setup.go
  • src/pkg/clouds/azure/aci/stop.go
  • src/pkg/clouds/azure/aci/tail.go
  • src/pkg/clouds/azure/aci/tail_test.go
  • src/pkg/clouds/azure/aci/upload.go
  • src/pkg/clouds/azure/aci/upload_test.go
  • src/pkg/clouds/azure/acr/tail.go
  • src/pkg/clouds/azure/acr/tail_test.go
✅ Files skipped from review due to trivial changes (1)
  • src/pkg/clouds/azure/aca/common.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/pkg/clouds/azure/aci/stop.go
  • src/pkg/clouds/azure/aci/tail_test.go
  • src/pkg/clouds/azure/aci/upload.go
  • src/pkg/clouds/azure/aci/blob.go

Comment on lines +24 to +29
t.Cleanup(func() {
// err := containerInstance.TearDown(ctx)
// if err != nil {
// t.Fatalf("Failed to tear down container instance: %v", err)
// }
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Commented-out teardown may leave orphaned Azure resources.

The cleanup callback registers but doesn't actually tear down the resource group. This will leave Azure resources behind after test runs, potentially incurring costs.

Consider either:

  1. Uncommenting the teardown code
  2. Adding a comment explaining why teardown is intentionally skipped (e.g., for debugging, shared test resources)
🧹 Proposed fix to enable cleanup
 	t.Cleanup(func() {
-		// err := containerInstance.TearDown(ctx)
-		// if err != nil {
-		// 	t.Fatalf("Failed to tear down container instance: %v", err)
-		// }
+		if err := containerInstance.TearDown(context.Background()); err != nil {
+			t.Errorf("Failed to tear down container instance: %v", err)
+		}
 	})

Note: Use t.Errorf instead of t.Fatalf in cleanup since t.Fatalf may not work correctly in cleanup functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/aci/run_test.go` around lines 24 - 29, The test
registers a t.Cleanup but the actual teardown call to
containerInstance.TearDown(ctx) is commented out, which can leave Azure
resources orphaned; either uncomment and invoke containerInstance.TearDown(ctx)
inside the t.Cleanup and handle any error by calling t.Errorf("Failed to tear
down container instance: %v", err) (not t.Fatalf), or if teardown is
intentionally skipped for debugging, add a clear comment above the t.Cleanup
explaining why it’s omitted and how/when to re-enable it; locate the cleanup
block referencing containerInstance.TearDown and update accordingly.

Comment on lines +59 to +68
// strings.Split on a trailing-newline body produces one extra empty element,
// each element (including the empty one) adds len(line)+1 to the offset.
expectedOffset := int64(len(body)) + 1 // trailing empty element adds +1
if newOffset != expectedOffset {
t.Errorf("expected offset %d, got %d", expectedOffset, newOffset)
}
// 3 real lines + 1 empty string from the trailing newline split
if len(yielded) != 4 {
t.Errorf("expected 4 yielded elements, got %d: %v", len(yielded), yielded)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test validates buggy offset calculation.

This test documents the off-by-one behavior flagged in tail.go. Once the implementation is fixed to exclude the trailing empty element from Split, update this test to expect:

  • expectedOffset = 18 (actual byte length)
  • len(yielded) == 3 (actual line count)
📝 Expected test changes after fix
-	// strings.Split on a trailing-newline body produces one extra empty element,
-	// each element (including the empty one) adds len(line)+1 to the offset.
-	expectedOffset := int64(len(body)) + 1 // trailing empty element adds +1
+	expectedOffset := int64(len(body))
 	if newOffset != expectedOffset {
 		t.Errorf("expected offset %d, got %d", expectedOffset, newOffset)
 	}
-	// 3 real lines + 1 empty string from the trailing newline split
-	if len(yielded) != 4 {
-		t.Errorf("expected 4 yielded elements, got %d: %v", len(yielded), yielded)
+	if len(yielded) != 3 {
+		t.Errorf("expected 3 lines, got %d: %v", len(yielded), yielded)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// strings.Split on a trailing-newline body produces one extra empty element,
// each element (including the empty one) adds len(line)+1 to the offset.
expectedOffset := int64(len(body)) + 1 // trailing empty element adds +1
if newOffset != expectedOffset {
t.Errorf("expected offset %d, got %d", expectedOffset, newOffset)
}
// 3 real lines + 1 empty string from the trailing newline split
if len(yielded) != 4 {
t.Errorf("expected 4 yielded elements, got %d: %v", len(yielded), yielded)
}
expectedOffset := int64(len(body))
if newOffset != expectedOffset {
t.Errorf("expected offset %d, got %d", expectedOffset, newOffset)
}
if len(yielded) != 3 {
t.Errorf("expected 3 lines, got %d: %v", len(yielded), yielded)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/tail_test.go` around lines 59 - 68, The test is
asserting behavior for the old buggy Split/offset logic; after fixing tail.go to
ignore the trailing empty element from strings.Split, update the test
assertions: set expectedOffset to 18 (actual byte length) instead of
int64(len(body))+1 and assert len(yielded) == 3 (three real lines) instead of 4;
change the two t.Errorf checks that compare expectedOffset/newOffset and
len(yielded) to use these new expected values so the test matches the corrected
logic in tail.go.

Comment on lines +43 to +54
if status.IsTerminal() {
// Read any remaining log content
readLogChunk(ctx, logURL, offset, yield)
if !status.IsSuccess() {
msg := string(status.Status)
if status.ErrorMessage != "" {
msg += ": " + status.ErrorMessage
}
yield("", fmt.Errorf("build %s: %s", runID, msg))
}
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ignored return value may cause yield to be called after consumer stopped.

Line 45 discards both return values from readLogChunk. If the consumer stops iteration (yield returns false), readLogChunk returns errStopped, but this is ignored and the code proceeds to potentially call yield again on line 51, violating the iterator contract.

🛡️ Proposed fix to respect early termination
 		if status.IsTerminal() {
 			// Read any remaining log content
-			readLogChunk(ctx, logURL, offset, yield)
+			if _, err := readLogChunk(ctx, logURL, offset, yield); err != nil {
+				return // consumer stopped iteration
+			}
 			if !status.IsSuccess() {
 				msg := string(status.Status)
 				if status.ErrorMessage != "" {
 					msg += ": " + status.ErrorMessage
 				}
 				yield("", fmt.Errorf("build %s: %s", runID, msg))
 			}
 			return
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/tail.go` around lines 43 - 54, In the
terminal-status branch in the function where status.IsTerminal() is handled,
stop ignoring the return values from readLogChunk(ctx, logURL, offset, yield);
capture its (offset, err) result and if err indicates the consumer stopped
(e.g., err == errStopped or a boolean false from yield), return immediately so
you do not call yield again with the final failure message for runID; otherwise
continue building the msg and call yield as before. Ensure you reference
readLogChunk and runID in the fix and propagate/handle the err return to avoid
violating the iterator contract.

Comment on lines +114 to +127
lines := strings.Split(text, "\n")

// If the last byte is not a newline, the last line is incomplete — hold it back
complete := lines
if text[len(text)-1] != '\n' {
complete = lines[:len(lines)-1]
}

for _, line := range complete {
offset += int64(len(line) + 1) // +1 for the newline
if !yield(line, nil) {
return offset, errStopped
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Off-by-one in offset calculation when body ends with newline.

When text ends with \n, strings.Split produces a trailing empty string (e.g., "a\nb\n"["a", "b", ""]). The code includes this empty element in complete and adds +1 for a non-existent newline, causing offset to exceed actual content length by 1 byte.

For "line1\nline2\n" (12 bytes): offset becomes 13 instead of 12.

🐛 Proposed fix to exclude trailing empty element
 	lines := strings.Split(text, "\n")
 
 	// If the last byte is not a newline, the last line is incomplete — hold it back
 	complete := lines
 	if text[len(text)-1] != '\n' {
 		complete = lines[:len(lines)-1]
+	} else if len(lines) > 0 && lines[len(lines)-1] == "" {
+		// Remove trailing empty element from Split
+		complete = lines[:len(lines)-1]
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/clouds/azure/acr/tail.go` around lines 114 - 127, The offset
calculation overcounts when text ends with '\n' because strings.Split creates a
trailing empty element; update the logic around lines/complete to safely handle
empty text and drop the trailing empty element when text[len(text)-1] == '\n'
instead of treating it as a real line. Concretely: check len(text)>0 before
indexing, set complete = lines; if text[len(text)-1] == '\n' && len(lines) > 0
&& lines[len(lines)-1] == "" { complete = lines[:len(lines)-1] } else if
len(lines) > 0 { complete = lines[:len(lines)-1] } so offset increments (offset
+= int64(len(line)+1)) and yield(line, nil) remain correct without producing the
off-by-one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants