Fix Compute Engine instance group manager monitoring by jordanstephens · Pull Request #2025 · DefangLabs/defang

jordanstephens · 2026-04-07T20:43:00Z

Fixes #2019

WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services during deployment. Three related bugs in the GCP Cloud Logging integration caused all gce_instance_group addInstances events to be silently dropped, leaving computeEngineRootTriggers always empty.

Root causes:

Wrong label filter format — allInstancesConfig.properties.labels is a map<string,string>, not a list of {key,value} structs. The Cloud Logging query was using the list format (labels.key="defang-service" / labels.value="..."), which never matched any audit log entries. The parser was similarly iterating over the field as a list (GetListInStruct), always returning nil.
Absent labels in PATCH requests — GCE audit logs for regionInstanceGroupManagers.patch only carry the fields that changed (e.g. the new instance template reference). allInstancesConfig.properties.labels is absent from the request body for every update after the initial create, so reading labels from the log entry never worked for re-deploys.

Fixes:

Corrected the query filter to use map-style key access (labels."defang-service"=~"^(svc)$") and replaced the 10-line list iteration in the parser with a single GetValueInStruct call.
Instead of reading labels from the audit log request body, now reads the instance group manager name, project, and region from entry.Resource.Labels (always present) and calls the GCE REST API to fetch the live resource's labels. Adds GetInstanceGroupManagerLabels to GcpLogsClient, implemented using the existing google.golang.org/api/compute/v1 dependency (no new deps).
Removed query-level label filters entirely — they are redundant now that the parser resolves labels from the live resource, and harmful because they prevented events from being returned at all during re-deploys.

Linked Issues

Checklist

I have performed a self-review of my code
I have added appropriate tests
I have updated the Defang CLI docs and/or README to reflect my changes, if necessary

Summary by CodeRabbit

New Features
- Informational logging shows which services are still pending during state monitoring.
- Logs a deployment-complete message before waiting for services to become healthy.
Bug Fixes
- More reliable GCP instance-group label retrieval and stricter event matching to improve deployment detection.
Tests
- Added unit tests and mocks covering GCP query generation and activity parsing behavior.

coderabbitai · 2026-04-07T20:43:07Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fcd30143-4e41-4002-9897-0faab7c99353

📥 Commits

Reviewing files that changed from the base of the PR and between 35ec187 and 97cd714.

📒 Files selected for processing (2)

src/pkg/cli/client/byoc/gcp/stream.go
src/pkg/cli/client/byoc/gcp/stream_test.go

📝 Walkthrough

Walkthrough

Replace audit-log label extraction for GCP regional Instance Group Managers with live label lookups via the Compute API; add a Compute helper and client method, adjust stream parsing and query builder, add tests and mocks, bump a go.mod dependency entry, and update Nix vendor hash.

Changes

Cohort / File(s)	Summary
Dependency Management `src/go.mod`	Reclassify `google.golang.org/genproto/googleapis/api` from indirect to direct `require` (version unchanged).
GCP Compute Integration `src/pkg/clouds/gcp/compute.go`	Add `GetInstanceGroupManagerLabels(ctx, project, region, name)` to fetch labels from `RegionInstanceGroupManagers.Get(...).Do()`.
GCP BYOC Stream & Client `src/pkg/cli/client/byoc/gcp/stream.go`, `src/pkg/cli/client/byoc/gcp/byoc_test.go`	Extend `GcpLogsClient` interface and mock with `GetInstanceGroupManagerLabels`; stream parsing now derives identifiers from `entry.Resource.Labels` and fetches live labels instead of reading `auditLog.GetRequest()`.
Query Builder & Tests `src/pkg/cli/client/byoc/gcp/query.go`, `src/pkg/cli/client/byoc/gcp/query_test.go`	`AddComputeEngineInstanceGroupInsertOrPatch` simplified to only add the base `regionInstanceGroupManagers.(insert
BYOC Stream Tests & Mocks `src/pkg/cli/client/byoc/gcp/stream_test.go`	Add test helpers and suites covering manager label lookup success/failure, etag-mismatch skipping, multi-entry flows, and unknown-trigger drops.
Service Monitoring Logging `src/pkg/cli/subscribe.go`, `src/pkg/cli/tailAndMonitor.go`	Log pending services each message in `WaitServiceState`; log deployment-complete milestone after successful CD task exit.
Build / Packaging `pkgs/defang/cli.nix`	Update Nix `vendorHash` for Go vendoring.

Sequence Diagram

sequenceDiagram
    actor CLI
    participant AuditLog as GCP Audit Log
    participant ComputeAPI as Compute Engine API
    participant StateTracker as Service State Tracker

    CLI->>AuditLog: Subscribe to instance group manager logs
    AuditLog-->>CLI: Deliver LogEntry (audit log)

    rect rgba(0,150,150,0.5)
    Note over CLI,AuditLog: OLD: parse labels from auditLog.GetRequest()
    CLI->>CLI: Extract labels from audit log payload
    end

    rect rgba(150,100,0,0.5)
    Note over CLI,ComputeAPI: NEW: fetch labels live from Compute API
    CLI->>ComputeAPI: GetInstanceGroupManagerLabels(project, region, name)
    ComputeAPI-->>CLI: Return labels map / error
    end

    CLI->>StateTracker: Emit DEPLOYMENT_PENDING using labels["defang-service"]
    StateTracker-->>CLI: Service state updates
    CLI->>CLI: Log pending services waiting for healthy state

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Fix Compute Engine instance group manager monitoring #2025: Makes the same GCP BYOC changes — adds GetInstanceGroupManagerLabels, switches to live label lookups, and removes query-level label filters.
Always use etag for gcp cdCommands, remove useless code #1772: Changes GCP BYOC logging/query flow and uses instance-group manager label lookup patterns.

Suggested reviewers

lionello
edwardrf

Poem

🐰 Hopped from logs to live API,
I fetched labels from sky up high.
Managers whispered names anew,
Services tracked and states came through.
Hooray — deployments hop on by! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main objective of the PR, which is fixing Compute Engine instance group manager monitoring to enable proper status tracking.
Linked Issues check	✅ Passed	All coding requirements from issue `#2019` are addressed: corrected Cloud Logging query format, added GetInstanceGroupManagerLabels API method, changed from reading labels from audit logs to fetching live labels via GCE REST API, and removed query-level label filters.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing Compute Engine instance group manager monitoring and enabling DEPLOYMENT_COMPLETED status tracking. Vendoring hash update is a necessary build dependency change.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jordan/fix-gcp-ce-monitoring

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=warning msg="[linters_context] running gomodguard failed: unable to read module file go.mod: current working directory must have a go.mod file: if you are not using go modules it is suggested to disable this linter"
level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lionello

Will yield to @edwardrf for approval.

src/pkg/cli/client/byoc/gcp/byoc_test.go

src/pkg/cli/subscribe.go

src/pkg/cli/tailAndMonitor.go

G101 is a gosec rule ID, not a standalone linter name. Using it in //nolint directives caused golangci-lint to warn about unknown linters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GCE allInstancesConfig.properties.labels is a map<string,string>, not a list of {key,value} structs. The query filters were using the list format (labels.key="defang-service" / labels.value="...") which never matched any audit log entries, so gce_instance_group_manager events were never returned by Cloud Logging. Even if events had arrived, the parser was iterating over the field as a list (GetListInStruct) which always returned nil, leaving the computeEngineRootTriggers map empty. As a result, all gce_instance_group addInstances events were silently dropped and WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services. Fix the query to use map-style key access: labels."defang-service"=~"^(svc)$" Fix the parser to use GetValueInStruct with the label name as a path key, replacing the 10-line list iteration with a single call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GCE audit logs for regionInstanceGroupManagers.patch only carry the fields that changed (e.g. the new instance template version). The allInstancesConfig.properties.labels — where the defang-service label lives — is absent from the request body for every update after the initial create. As a result, the computeEngineRootTriggers map was never populated and all gce_instance_group addInstances events were silently dropped, so WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services. Fix: instead of reading labels from the audit log request body, read the instance group manager name, project, and region from the always- present entry.Resource.Labels and call the GCE REST API to get the live resource's allInstancesConfig.properties.labels. This mirrors the fallback used by the server-side fabric_gcp.go implementation. Add GetInstanceGroupManagerLabels to GcpLogsClient and implement it using the already-present google.golang.org/api/compute/v1 dependency (no new deps required). Also add the missing isQuotaError helper to the gcpquota debug tool, which was preventing the pre-commit lint check from passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PATCH requests for regionInstanceGroupManagers only carry changed fields (e.g. a new instance template reference). When Pulumi re-deploys a CE service, it patches the instance template without including allInstancesConfig.properties.labels in the request body. The Cloud Logging filter on those absent label fields never matched, so no gce_instance_group_manager events were returned for re-deploys, leaving computeEngineRootTriggers empty and causing all gce_instance_group addInstances events to be silently dropped. The parser already handles service-specific filtering by reading labels from the live MIG resource via GetInstanceGroupManagerLabels, so the query-level label filters are redundant and harmful. Remove them and keep only the method name and operation.first filters, consistent with how AddComputeEngineInstanceGroupAddInstances works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…O/WZAY=

Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>

Cover the three bugs fixed in this branch: - TestAddComputeEngineInstanceGroupInsertOrPatch: asserts the query contains no allInstancesConfig or defang-* label filters (guarding against the old list-format filters that never matched) - TestActivityParser_GceInstanceGroupManager: table-driven tests for the gce_instance_group_manager parser path — happy path, API error, nil labels, missing defang-service label, and missing root_trigger_id - TestActivityParser_GceInstanceGroupFlow: end-to-end test that a manager insert/patch entry populates the trigger map and a subsequent addInstances entry uses it to emit DEPLOYMENT_COMPLETED - TestActivityParser_GceInstanceGroupDropsUnknownTrigger: events with an unrecognized root_trigger_id are silently dropped Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pkgs/defang/cli.nix

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/pkg/cli/client/byoc/gcp/stream.go (1)

602-612: ⚠️ Potential issue | 🟠 Major

Don't emit DEPLOYMENT_PENDING when root_trigger_id is absent.

If root_trigger_id is missing, this code still returns a pending update but never seeds computeEngineRootTriggers. The later gce_instance_group event is then dropped as unknown, which can leave the service stuck in pending forever.

Suggested fix

 			rootTriggerId := entry.GetLabels()["compute.googleapis.com/root_trigger_id"]
 			if rootTriggerId == "" {
 				term.Warnf("missing root_trigger_id in audit log for instance group manager %v", path.Base(auditLog.GetResourceName()))
-			} else {
-				computeEngineRootTriggers[rootTriggerId] = serviceName
+				return nil, nil
 			}
+			computeEngineRootTriggers[rootTriggerId] = serviceName
 			return []*defangv1.SubscribeResponse{{
 				Name:   serviceName,
 				State:  defangv1.ServiceState_DEPLOYMENT_PENDING,
 				Status: auditLog.GetResponse().GetFields()["status"].GetStringValue(),
 			}}, nil

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/gcp/stream.go` around lines 602 - 612, The code
currently logs a missing root_trigger_id but still returns a DEPLOYMENT_PENDING
SubscribeResponse, which creates a pending service without seeding
computeEngineRootTriggers. Change the control flow so that you only construct
and return the SubscribeResponse when rootTriggerId is non-empty (i.e., move the
return into the else branch where computeEngineRootTriggers[rootTriggerId] =
serviceName is set); when rootTriggerId is empty, simply log the warning and
return no response (nil, nil) so the event is not treated as DEPLOYMENT_PENDING.
Reference symbols: rootTriggerId, computeEngineRootTriggers, serviceName,
auditLog.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/cli/client/byoc/gcp/stream.go`:
- Around line 589-598: The code accepts any MIG event with a defang-service
label but fails to scope to the deployment etag; update the logic where
GetInstanceGroupManagerLabels is used (labels map, serviceName) to also read
labels["defang-etag"] and compare it to the expected deployment etag provided to
getActivityParser (or the current etag in scope); if the etag label is missing
or does not match, skip/warn and return nil so only events matching the target
defang-etag are processed. Ensure the comparison uses the same etag value passed
into getActivityParser and include a warn message when skipping due to etag
mismatch.

In `@src/pkg/cli/subscribe.go`:
- Around line 78-86: The waiting log for pendingServices is emitted too early
and can be stale; move the term.Infof("Waiting for %q to be in state %s...\n",
pendingServices, targetState) so it runs after processing the incoming update
(after handling msg != nil) and after recomputing pendingServices from services
and serviceStates, and only emit it when pendingServices is non-empty; update
the block around the loop that builds pendingServices (and the code that checks
msg) to recompute and log afterwards to avoid noisy/stale output.

---

Outside diff comments:
In `@src/pkg/cli/client/byoc/gcp/stream.go`:
- Around line 602-612: The code currently logs a missing root_trigger_id but
still returns a DEPLOYMENT_PENDING SubscribeResponse, which creates a pending
service without seeding computeEngineRootTriggers. Change the control flow so
that you only construct and return the SubscribeResponse when rootTriggerId is
non-empty (i.e., move the return into the else branch where
computeEngineRootTriggers[rootTriggerId] = serviceName is set); when
rootTriggerId is empty, simply log the warning and return no response (nil, nil)
so the event is not treated as DEPLOYMENT_PENDING. Reference symbols:
rootTriggerId, computeEngineRootTriggers, serviceName, auditLog.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9f23e33-f62c-4b6f-932e-b7ce14ed3112

📥 Commits

Reviewing files that changed from the base of the PR and between 7aeaff4 and 15a4987.

📒 Files selected for processing (9)

src/go.mod
src/pkg/cli/client/byoc/gcp/byoc_test.go
src/pkg/cli/client/byoc/gcp/query.go
src/pkg/cli/client/byoc/gcp/query_test.go
src/pkg/cli/client/byoc/gcp/stream.go
src/pkg/cli/client/byoc/gcp/stream_test.go
src/pkg/cli/subscribe.go
src/pkg/cli/tailAndMonitor.go
src/pkg/clouds/gcp/compute.go

src/pkg/cli/client/byoc/gcp/stream.go

src/pkg/cli/subscribe.go

…2GCR74=

Skip gce_instance_group_manager events whose defang-etag label does not match the etag passed to getActivityParser, preventing events from other deployments from being processed. When no etag is expected the check is skipped for backwards compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jordanstephens requested a review from edwardrf April 7, 2026 20:44

lionello reviewed Apr 7, 2026

View reviewed changes

src/pkg/cli/client/byoc/gcp/byoc_test.go Show resolved Hide resolved

src/pkg/cli/subscribe.go Show resolved Hide resolved

src/pkg/cli/tailAndMonitor.go Outdated Show resolved Hide resolved

jordanstephens and others added 9 commits April 7, 2026 15:23

fix: remove invalid G101 linter name from nolint directives

8f4ce0f

G101 is a gosec rule ID, not a standalone linter name. Using it in //nolint directives caused golangci-lint to warn about unknown linters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

log which services we are waiting for

e2fb500

Update Nix vendorHash to sha256-DxRBE7mugWJ2NqBiIDNazg/mb+zjZkgNjpTDJ…

e287817

…O/WZAY=

Apply suggestions from code review

49bb101

Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>

chore: go mod tidy — promote genproto/googleapis/api to direct dep

010c779

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jordanstephens force-pushed the jordan/fix-gcp-ce-monitoring branch from 2f151f3 to 010c779 Compare April 7, 2026 22:27

edwardrf approved these changes Apr 8, 2026

View reviewed changes

jordanstephens mentioned this pull request Apr 10, 2026

CLI does not track status updates for services deployed to compute engine #2019

Closed

jordanstephens marked this pull request as ready for review April 10, 2026 20:35

Merge branch 'main' into jordan/fix-gcp-ce-monitoring

ac35299

jordanstephens commented Apr 10, 2026

View reviewed changes

pkgs/defang/cli.nix Outdated Show resolved Hide resolved

Apply suggestion from @jordanstephens

15a4987

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

src/pkg/cli/client/byoc/gcp/stream.go Show resolved Hide resolved

src/pkg/cli/subscribe.go Show resolved Hide resolved

Update Nix vendorHash to sha256-zxQuu/RcVgA67++LuRs5xpDiq2e7gepkV8nqQ…

35ec187

…2GCR74=

lionello approved these changes Apr 10, 2026

View reviewed changes

lionello enabled auto-merge (squash) April 10, 2026 20:48

lionello merged commit e25fb82 into main Apr 10, 2026
4 of 5 checks passed

lionello deleted the jordan/fix-gcp-ce-monitoring branch April 10, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Compute Engine instance group manager monitoring#2025

Fix Compute Engine instance group manager monitoring#2025
lionello merged 13 commits intomainfrom
jordan/fix-gcp-ce-monitoring

jordanstephens commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

lionello left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jordanstephens commented Apr 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked Issues

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

lionello left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jordanstephens commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading