Fix Compute Engine instance group manager monitoring#2025
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughReplace audit-log label extraction for GCP regional Instance Group Managers with live label lookups via the Compute API; add a Compute helper and client method, adjust stream parsing and query builder, add tests and mocks, bump a go.mod dependency entry, and update Nix vendor hash. Changes
Sequence DiagramsequenceDiagram
actor CLI
participant AuditLog as GCP Audit Log
participant ComputeAPI as Compute Engine API
participant StateTracker as Service State Tracker
CLI->>AuditLog: Subscribe to instance group manager logs
AuditLog-->>CLI: Deliver LogEntry (audit log)
rect rgba(0,150,150,0.5)
Note over CLI,AuditLog: OLD: parse labels from auditLog.GetRequest()
CLI->>CLI: Extract labels from audit log payload
end
rect rgba(150,100,0,0.5)
Note over CLI,ComputeAPI: NEW: fetch labels live from Compute API
CLI->>ComputeAPI: GetInstanceGroupManagerLabels(project, region, name)
ComputeAPI-->>CLI: Return labels map / error
end
CLI->>StateTracker: Emit DEPLOYMENT_PENDING using labels["defang-service"]
StateTracker-->>CLI: Service state updates
CLI->>CLI: Log pending services waiting for healthy state
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=warning msg="[linters_context] running gomodguard failed: unable to read module file go.mod: current working directory must have a go.mod file: if you are not using go modules it is suggested to disable this linter" Comment |
G101 is a gosec rule ID, not a standalone linter name. Using it in //nolint directives caused golangci-lint to warn about unknown linters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCE allInstancesConfig.properties.labels is a map<string,string>, not a
list of {key,value} structs. The query filters were using the list format
(labels.key="defang-service" / labels.value="...") which never matched any
audit log entries, so gce_instance_group_manager events were never returned
by Cloud Logging. Even if events had arrived, the parser was iterating over
the field as a list (GetListInStruct) which always returned nil, leaving the
computeEngineRootTriggers map empty. As a result, all gce_instance_group
addInstances events were silently dropped and WaitServiceState never
received DEPLOYMENT_COMPLETED for Compute Engine services.
Fix the query to use map-style key access:
labels."defang-service"=~"^(svc)$"
Fix the parser to use GetValueInStruct with the label name as a path key,
replacing the 10-line list iteration with a single call.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCE audit logs for regionInstanceGroupManagers.patch only carry the fields that changed (e.g. the new instance template version). The allInstancesConfig.properties.labels — where the defang-service label lives — is absent from the request body for every update after the initial create. As a result, the computeEngineRootTriggers map was never populated and all gce_instance_group addInstances events were silently dropped, so WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services. Fix: instead of reading labels from the audit log request body, read the instance group manager name, project, and region from the always- present entry.Resource.Labels and call the GCE REST API to get the live resource's allInstancesConfig.properties.labels. This mirrors the fallback used by the server-side fabric_gcp.go implementation. Add GetInstanceGroupManagerLabels to GcpLogsClient and implement it using the already-present google.golang.org/api/compute/v1 dependency (no new deps required). Also add the missing isQuotaError helper to the gcpquota debug tool, which was preventing the pre-commit lint check from passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PATCH requests for regionInstanceGroupManagers only carry changed fields (e.g. a new instance template reference). When Pulumi re-deploys a CE service, it patches the instance template without including allInstancesConfig.properties.labels in the request body. The Cloud Logging filter on those absent label fields never matched, so no gce_instance_group_manager events were returned for re-deploys, leaving computeEngineRootTriggers empty and causing all gce_instance_group addInstances events to be silently dropped. The parser already handles service-specific filtering by reading labels from the live MIG resource via GetInstanceGroupManagerLabels, so the query-level label filters are redundant and harmful. Remove them and keep only the method name and operation.first filters, consistent with how AddComputeEngineInstanceGroupAddInstances works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Lio李歐 <lionello@users.noreply.github.com>
Cover the three bugs fixed in this branch: - TestAddComputeEngineInstanceGroupInsertOrPatch: asserts the query contains no allInstancesConfig or defang-* label filters (guarding against the old list-format filters that never matched) - TestActivityParser_GceInstanceGroupManager: table-driven tests for the gce_instance_group_manager parser path — happy path, API error, nil labels, missing defang-service label, and missing root_trigger_id - TestActivityParser_GceInstanceGroupFlow: end-to-end test that a manager insert/patch entry populates the trigger map and a subsequent addInstances entry uses it to emit DEPLOYMENT_COMPLETED - TestActivityParser_GceInstanceGroupDropsUnknownTrigger: events with an unrecognized root_trigger_id are silently dropped Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2f151f3 to
010c779
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/pkg/cli/client/byoc/gcp/stream.go (1)
602-612:⚠️ Potential issue | 🟠 MajorDon't emit
DEPLOYMENT_PENDINGwhenroot_trigger_idis absent.If
root_trigger_idis missing, this code still returns a pending update but never seedscomputeEngineRootTriggers. The latergce_instance_groupevent is then dropped as unknown, which can leave the service stuck in pending forever.Suggested fix
rootTriggerId := entry.GetLabels()["compute.googleapis.com/root_trigger_id"] if rootTriggerId == "" { term.Warnf("missing root_trigger_id in audit log for instance group manager %v", path.Base(auditLog.GetResourceName())) - } else { - computeEngineRootTriggers[rootTriggerId] = serviceName + return nil, nil } + computeEngineRootTriggers[rootTriggerId] = serviceName return []*defangv1.SubscribeResponse{{ Name: serviceName, State: defangv1.ServiceState_DEPLOYMENT_PENDING, Status: auditLog.GetResponse().GetFields()["status"].GetStringValue(), }}, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pkg/cli/client/byoc/gcp/stream.go` around lines 602 - 612, The code currently logs a missing root_trigger_id but still returns a DEPLOYMENT_PENDING SubscribeResponse, which creates a pending service without seeding computeEngineRootTriggers. Change the control flow so that you only construct and return the SubscribeResponse when rootTriggerId is non-empty (i.e., move the return into the else branch where computeEngineRootTriggers[rootTriggerId] = serviceName is set); when rootTriggerId is empty, simply log the warning and return no response (nil, nil) so the event is not treated as DEPLOYMENT_PENDING. Reference symbols: rootTriggerId, computeEngineRootTriggers, serviceName, auditLog.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/pkg/cli/client/byoc/gcp/stream.go`:
- Around line 589-598: The code accepts any MIG event with a defang-service
label but fails to scope to the deployment etag; update the logic where
GetInstanceGroupManagerLabels is used (labels map, serviceName) to also read
labels["defang-etag"] and compare it to the expected deployment etag provided to
getActivityParser (or the current etag in scope); if the etag label is missing
or does not match, skip/warn and return nil so only events matching the target
defang-etag are processed. Ensure the comparison uses the same etag value passed
into getActivityParser and include a warn message when skipping due to etag
mismatch.
In `@src/pkg/cli/subscribe.go`:
- Around line 78-86: The waiting log for pendingServices is emitted too early
and can be stale; move the term.Infof("Waiting for %q to be in state %s...\n",
pendingServices, targetState) so it runs after processing the incoming update
(after handling msg != nil) and after recomputing pendingServices from services
and serviceStates, and only emit it when pendingServices is non-empty; update
the block around the loop that builds pendingServices (and the code that checks
msg) to recompute and log afterwards to avoid noisy/stale output.
---
Outside diff comments:
In `@src/pkg/cli/client/byoc/gcp/stream.go`:
- Around line 602-612: The code currently logs a missing root_trigger_id but
still returns a DEPLOYMENT_PENDING SubscribeResponse, which creates a pending
service without seeding computeEngineRootTriggers. Change the control flow so
that you only construct and return the SubscribeResponse when rootTriggerId is
non-empty (i.e., move the return into the else branch where
computeEngineRootTriggers[rootTriggerId] = serviceName is set); when
rootTriggerId is empty, simply log the warning and return no response (nil, nil)
so the event is not treated as DEPLOYMENT_PENDING. Reference symbols:
rootTriggerId, computeEngineRootTriggers, serviceName, auditLog.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f9f23e33-f62c-4b6f-932e-b7ce14ed3112
📒 Files selected for processing (9)
src/go.modsrc/pkg/cli/client/byoc/gcp/byoc_test.gosrc/pkg/cli/client/byoc/gcp/query.gosrc/pkg/cli/client/byoc/gcp/query_test.gosrc/pkg/cli/client/byoc/gcp/stream.gosrc/pkg/cli/client/byoc/gcp/stream_test.gosrc/pkg/cli/subscribe.gosrc/pkg/cli/tailAndMonitor.gosrc/pkg/clouds/gcp/compute.go
Skip gce_instance_group_manager events whose defang-etag label does not match the etag passed to getActivityParser, preventing events from other deployments from being processed. When no etag is expected the check is skipped for backwards compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes #2019
WaitServiceState never received DEPLOYMENT_COMPLETED for Compute Engine services during deployment. Three related bugs in the GCP Cloud Logging integration caused all gce_instance_group addInstances events to be silently dropped, leaving computeEngineRootTriggers always empty.
Root causes:
Fixes:
Linked Issues
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
Tests