Skip to content

Fix Compute Engine instance group manager monitoring#2025

Merged
lionello merged 13 commits intomainfrom
jordan/fix-gcp-ce-monitoring
Apr 10, 2026
Merged

Fix Compute Engine instance group manager monitoring#2025
lionello merged 13 commits intomainfrom
jordan/fix-gcp-ce-monitoring

Conversation

@jordanstephens
Copy link
Copy Markdown
Member

@jordanstephens jordanstephens commented Apr 7, 2026

Fixes #2019

WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services during deployment. Three related bugs in the GCP Cloud Logging integration caused all gce_instance_group addInstances events to be silently dropped, leaving computeEngineRootTriggers always empty.

Root causes:

  1. Wrong label filter format — allInstancesConfig.properties.labels is a map<string,string>, not a list of {key,value} structs. The Cloud Logging query was using the list format (labels.key="defang-service" / labels.value="..."), which never matched any audit log entries. The parser was similarly iterating over the field as a list (GetListInStruct), always returning nil.
  2. Absent labels in PATCH requests — GCE audit logs for regionInstanceGroupManagers.patch only carry the fields that changed (e.g. the new instance template reference). allInstancesConfig.properties.labels is absent from the request body for every update after the initial create, so reading labels from the log entry never worked for re-deploys.

Fixes:

  • Corrected the query filter to use map-style key access (labels."defang-service"=~"^(svc)$") and replaced the 10-line list iteration in the parser with a single GetValueInStruct call.
  • Instead of reading labels from the audit log request body, now reads the instance group manager name, project, and region from entry.Resource.Labels (always present) and calls the GCE REST API to fetch the live resource's labels. Adds GetInstanceGroupManagerLabels to GcpLogsClient, implemented using the existing google.golang.org/api/compute/v1 dependency (no new deps).
  • Removed query-level label filters entirely — they are redundant now that the parser resolves labels from the live resource, and harmful because they prevented events from being returned at all during re-deploys.

Linked Issues

Checklist

  • I have performed a self-review of my code
  • I have added appropriate tests
  • I have updated the Defang CLI docs and/or README to reflect my changes, if necessary

Summary by CodeRabbit

  • New Features

    • Informational logging shows which services are still pending during state monitoring.
    • Logs a deployment-complete message before waiting for services to become healthy.
  • Bug Fixes

    • More reliable GCP instance-group label retrieval and stricter event matching to improve deployment detection.
  • Tests

    • Added unit tests and mocks covering GCP query generation and activity parsing behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fcd30143-4e41-4002-9897-0faab7c99353

📥 Commits

Reviewing files that changed from the base of the PR and between 35ec187 and 97cd714.

📒 Files selected for processing (2)
  • src/pkg/cli/client/byoc/gcp/stream.go
  • src/pkg/cli/client/byoc/gcp/stream_test.go

📝 Walkthrough

Walkthrough

Replace audit-log label extraction for GCP regional Instance Group Managers with live label lookups via the Compute API; add a Compute helper and client method, adjust stream parsing and query builder, add tests and mocks, bump a go.mod dependency entry, and update Nix vendor hash.

Changes

Cohort / File(s) Summary
Dependency Management
src/go.mod
Reclassify google.golang.org/genproto/googleapis/api from indirect to direct require (version unchanged).
GCP Compute Integration
src/pkg/clouds/gcp/compute.go
Add GetInstanceGroupManagerLabels(ctx, project, region, name) to fetch labels from RegionInstanceGroupManagers.Get(...).Do().
GCP BYOC Stream & Client
src/pkg/cli/client/byoc/gcp/stream.go, src/pkg/cli/client/byoc/gcp/byoc_test.go
Extend GcpLogsClient interface and mock with GetInstanceGroupManagerLabels; stream parsing now derives identifiers from entry.Resource.Labels and fetches live labels instead of reading auditLog.GetRequest().
Query Builder & Tests
src/pkg/cli/client/byoc/gcp/query.go, src/pkg/cli/client/byoc/gcp/query_test.go
AddComputeEngineInstanceGroupInsertOrPatch simplified to only add the base `regionInstanceGroupManagers.(insert
BYOC Stream Tests & Mocks
src/pkg/cli/client/byoc/gcp/stream_test.go
Add test helpers and suites covering manager label lookup success/failure, etag-mismatch skipping, multi-entry flows, and unknown-trigger drops.
Service Monitoring Logging
src/pkg/cli/subscribe.go, src/pkg/cli/tailAndMonitor.go
Log pending services each message in WaitServiceState; log deployment-complete milestone after successful CD task exit.
Build / Packaging
pkgs/defang/cli.nix
Update Nix vendorHash for Go vendoring.

Sequence Diagram

sequenceDiagram
    actor CLI
    participant AuditLog as GCP Audit Log
    participant ComputeAPI as Compute Engine API
    participant StateTracker as Service State Tracker

    CLI->>AuditLog: Subscribe to instance group manager logs
    AuditLog-->>CLI: Deliver LogEntry (audit log)

    rect rgba(0,150,150,0.5)
    Note over CLI,AuditLog: OLD: parse labels from auditLog.GetRequest()
    CLI->>CLI: Extract labels from audit log payload
    end

    rect rgba(150,100,0,0.5)
    Note over CLI,ComputeAPI: NEW: fetch labels live from Compute API
    CLI->>ComputeAPI: GetInstanceGroupManagerLabels(project, region, name)
    ComputeAPI-->>CLI: Return labels map / error
    end

    CLI->>StateTracker: Emit DEPLOYMENT_PENDING using labels["defang-service"]
    StateTracker-->>CLI: Service state updates
    CLI->>CLI: Log pending services waiting for healthy state
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • lionello
  • edwardrf

Poem

🐰 Hopped from logs to live API,
I fetched labels from sky up high.
Managers whispered names anew,
Services tracked and states came through.
Hooray — deployments hop on by! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR, which is fixing Compute Engine instance group manager monitoring to enable proper status tracking.
Linked Issues check ✅ Passed All coding requirements from issue #2019 are addressed: corrected Cloud Logging query format, added GetInstanceGroupManagerLabels API method, changed from reading labels from audit logs to fetching live labels via GCE REST API, and removed query-level label filters.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing Compute Engine instance group manager monitoring and enabling DEPLOYMENT_COMPLETED status tracking. Vendoring hash update is a necessary build dependency change.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jordan/fix-gcp-ce-monitoring

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=warning msg="[linters_context] running gomodguard failed: unable to read module file go.mod: current working directory must have a go.mod file: if you are not using go modules it is suggested to disable this linter"
level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands and usage tips.

@jordanstephens jordanstephens requested a review from edwardrf April 7, 2026 20:44
Copy link
Copy Markdown
Member

@lionello lionello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will yield to @edwardrf for approval.

jordanstephens and others added 9 commits April 7, 2026 15:23
G101 is a gosec rule ID, not a standalone linter name. Using it in
//nolint directives caused golangci-lint to warn about unknown linters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCE allInstancesConfig.properties.labels is a map<string,string>, not a
list of {key,value} structs. The query filters were using the list format
(labels.key="defang-service" / labels.value="...") which never matched any
audit log entries, so gce_instance_group_manager events were never returned
by Cloud Logging. Even if events had arrived, the parser was iterating over
the field as a list (GetListInStruct) which always returned nil, leaving the
computeEngineRootTriggers map empty. As a result, all gce_instance_group
addInstances events were silently dropped and WaitServiceState never
received DEPLOYMENT_COMPLETED for Compute Engine services.

Fix the query to use map-style key access:
  labels."defang-service"=~"^(svc)$"

Fix the parser to use GetValueInStruct with the label name as a path key,
replacing the 10-line list iteration with a single call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCE audit logs for regionInstanceGroupManagers.patch only carry the
fields that changed (e.g. the new instance template version). The
allInstancesConfig.properties.labels — where the defang-service label
lives — is absent from the request body for every update after the
initial create. As a result, the computeEngineRootTriggers map was
never populated and all gce_instance_group addInstances events were
silently dropped, so WaitServiceState never received DEPLOYMENT_COMPLETED
for Compute Engine services.

Fix: instead of reading labels from the audit log request body, read
the instance group manager name, project, and region from the always-
present entry.Resource.Labels and call the GCE REST API to get the
live resource's allInstancesConfig.properties.labels. This mirrors the
fallback used by the server-side fabric_gcp.go implementation.

Add GetInstanceGroupManagerLabels to GcpLogsClient and implement it
using the already-present google.golang.org/api/compute/v1 dependency
(no new deps required).

Also add the missing isQuotaError helper to the gcpquota debug tool,
which was preventing the pre-commit lint check from passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PATCH requests for regionInstanceGroupManagers only carry changed fields
(e.g. a new instance template reference). When Pulumi re-deploys a CE
service, it patches the instance template without including
allInstancesConfig.properties.labels in the request body. The Cloud
Logging filter on those absent label fields never matched, so no
gce_instance_group_manager events were returned for re-deploys, leaving
computeEngineRootTriggers empty and causing all gce_instance_group
addInstances events to be silently dropped.

The parser already handles service-specific filtering by reading labels
from the live MIG resource via GetInstanceGroupManagerLabels, so the
query-level label filters are redundant and harmful. Remove them and keep
only the method name and operation.first filters, consistent with how
AddComputeEngineInstanceGroupAddInstances works.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>
Cover the three bugs fixed in this branch:
- TestAddComputeEngineInstanceGroupInsertOrPatch: asserts the query
  contains no allInstancesConfig or defang-* label filters (guarding
  against the old list-format filters that never matched)
- TestActivityParser_GceInstanceGroupManager: table-driven tests for the
  gce_instance_group_manager parser path — happy path, API error, nil
  labels, missing defang-service label, and missing root_trigger_id
- TestActivityParser_GceInstanceGroupFlow: end-to-end test that a
  manager insert/patch entry populates the trigger map and a subsequent
  addInstances entry uses it to emit DEPLOYMENT_COMPLETED
- TestActivityParser_GceInstanceGroupDropsUnknownTrigger: events with
  an unrecognized root_trigger_id are silently dropped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jordanstephens jordanstephens force-pushed the jordan/fix-gcp-ce-monitoring branch from 2f151f3 to 010c779 Compare April 7, 2026 22:27
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pkg/cli/client/byoc/gcp/stream.go (1)

602-612: ⚠️ Potential issue | 🟠 Major

Don't emit DEPLOYMENT_PENDING when root_trigger_id is absent.

If root_trigger_id is missing, this code still returns a pending update but never seeds computeEngineRootTriggers. The later gce_instance_group event is then dropped as unknown, which can leave the service stuck in pending forever.

Suggested fix
 			rootTriggerId := entry.GetLabels()["compute.googleapis.com/root_trigger_id"]
 			if rootTriggerId == "" {
 				term.Warnf("missing root_trigger_id in audit log for instance group manager %v", path.Base(auditLog.GetResourceName()))
-			} else {
-				computeEngineRootTriggers[rootTriggerId] = serviceName
+				return nil, nil
 			}
+			computeEngineRootTriggers[rootTriggerId] = serviceName
 			return []*defangv1.SubscribeResponse{{
 				Name:   serviceName,
 				State:  defangv1.ServiceState_DEPLOYMENT_PENDING,
 				Status: auditLog.GetResponse().GetFields()["status"].GetStringValue(),
 			}}, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/client/byoc/gcp/stream.go` around lines 602 - 612, The code
currently logs a missing root_trigger_id but still returns a DEPLOYMENT_PENDING
SubscribeResponse, which creates a pending service without seeding
computeEngineRootTriggers. Change the control flow so that you only construct
and return the SubscribeResponse when rootTriggerId is non-empty (i.e., move the
return into the else branch where computeEngineRootTriggers[rootTriggerId] =
serviceName is set); when rootTriggerId is empty, simply log the warning and
return no response (nil, nil) so the event is not treated as DEPLOYMENT_PENDING.
Reference symbols: rootTriggerId, computeEngineRootTriggers, serviceName,
auditLog.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pkg/cli/client/byoc/gcp/stream.go`:
- Around line 589-598: The code accepts any MIG event with a defang-service
label but fails to scope to the deployment etag; update the logic where
GetInstanceGroupManagerLabels is used (labels map, serviceName) to also read
labels["defang-etag"] and compare it to the expected deployment etag provided to
getActivityParser (or the current etag in scope); if the etag label is missing
or does not match, skip/warn and return nil so only events matching the target
defang-etag are processed. Ensure the comparison uses the same etag value passed
into getActivityParser and include a warn message when skipping due to etag
mismatch.

In `@src/pkg/cli/subscribe.go`:
- Around line 78-86: The waiting log for pendingServices is emitted too early
and can be stale; move the term.Infof("Waiting for %q to be in state %s...\n",
pendingServices, targetState) so it runs after processing the incoming update
(after handling msg != nil) and after recomputing pendingServices from services
and serviceStates, and only emit it when pendingServices is non-empty; update
the block around the loop that builds pendingServices (and the code that checks
msg) to recompute and log afterwards to avoid noisy/stale output.

---

Outside diff comments:
In `@src/pkg/cli/client/byoc/gcp/stream.go`:
- Around line 602-612: The code currently logs a missing root_trigger_id but
still returns a DEPLOYMENT_PENDING SubscribeResponse, which creates a pending
service without seeding computeEngineRootTriggers. Change the control flow so
that you only construct and return the SubscribeResponse when rootTriggerId is
non-empty (i.e., move the return into the else branch where
computeEngineRootTriggers[rootTriggerId] = serviceName is set); when
rootTriggerId is empty, simply log the warning and return no response (nil, nil)
so the event is not treated as DEPLOYMENT_PENDING. Reference symbols:
rootTriggerId, computeEngineRootTriggers, serviceName, auditLog.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9f23e33-f62c-4b6f-932e-b7ce14ed3112

📥 Commits

Reviewing files that changed from the base of the PR and between 7aeaff4 and 15a4987.

📒 Files selected for processing (9)
  • src/go.mod
  • src/pkg/cli/client/byoc/gcp/byoc_test.go
  • src/pkg/cli/client/byoc/gcp/query.go
  • src/pkg/cli/client/byoc/gcp/query_test.go
  • src/pkg/cli/client/byoc/gcp/stream.go
  • src/pkg/cli/client/byoc/gcp/stream_test.go
  • src/pkg/cli/subscribe.go
  • src/pkg/cli/tailAndMonitor.go
  • src/pkg/clouds/gcp/compute.go

@lionello lionello enabled auto-merge (squash) April 10, 2026 20:48
Skip gce_instance_group_manager events whose defang-etag label does not
match the etag passed to getActivityParser, preventing events from other
deployments from being processed. When no etag is expected the check is
skipped for backwards compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lionello lionello merged commit e25fb82 into main Apr 10, 2026
4 of 5 checks passed
@lionello lionello deleted the jordan/fix-gcp-ce-monitoring branch April 10, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLI does not track status updates for services deployed to compute engine

3 participants