Gracefully handle GCP logging quota exhaustion#2024
Conversation
…an error
The GCP TailLogEntries streaming API can return a TailLogEntriesResponse with
zero entries. This happens for heartbeat messages (sent periodically to keep
the stream alive) and for suppression-info responses (sent when entries are
being rate-limited or sampled on the server side). The response proto has a
dedicated SuppressionInfo field for this purpose.
Previously, gcpLoggingTailer.Next() treated any empty-entries response as a
hard error ("no log entries found"). Because this is not a transient error,
WaitServiceState would not retry — the subscribe stream would terminate and
the deployment monitor would fail mid-deployment with a spurious error
unrelated to the actual deployment outcome.
The fix returns nil, nil from Next() on an empty response. The Follow() loop
in stream.go now skips nil entries and continues waiting, matching the
expected streaming behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCP Cloud Logging enforces a ReadRequestsPerMinutePerProject quota (120/min) that covers both ListLogEntries and TailLogEntries.Recv calls. During an active deployment with multiple services generating logs, the two concurrent streams (subscribe + log tail) can exhaust this quota, causing the deployment monitor to fail with ResourceExhausted. Add codes.ResourceExhausted to isTransientError so both the log tail (receiveLogs) and the subscribe stream (WaitServiceState) automatically retry with exponential backoff instead of surfacing a fatal error. The RetryDelayer backs off up to 1 minute, which aligns with the quota window reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughSkip nil/empty GCP tail responses during live tailing and treat gRPC ResourceExhausted as a transient error that triggers the existing reconnect/backoff path; add unit tests and remove a per-tick debug log. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Stream as ServerStream
participant Tailer as gcpLoggingTailer
participant Parser as parseAndFilter
participant Backoff as Retry/Backoff
Client->>Stream: Follow(ctx)
Stream->>Tailer: Next()
Note right of Tailer: May return (entry, nil), (nil, nil), or (nil, err)
Tailer-->>Stream: (nil, nil)
Stream->>Stream: continue (skip nil entry)
Tailer-->>Stream: (entryA, nil)
Stream->>Parser: parseAndFilter(entryA)
Parser-->>Stream: messageA
Stream-->>Client: emit messageA
Tailer-->>Stream: (nil, ResourceExhausted error)
Stream->>Backoff: DelayBeforeRetry(ctx)
Backoff-->>Stream: resume retry / re-subscribe
Tailer-->>Stream: (entryB, nil)
Stream->>Parser: parseAndFilter(entryB)
Parser-->>Client: emit messageB
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=warning msg="[linters_context] running gomodguard failed: unable to read module file go.mod: current working directory must have a go.mod file: if you are not using go modules it is suggested to disable this linter" Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/pkg/cli/waitForCdTaskExit.go (1)
22-22: Remove the commented debug statement instead of leaving it in place.This is dead/commented code in a hot polling loop and adds noise. Prefer deleting it entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pkg/cli/waitForCdTaskExit.go` at line 22, Remove the commented debug statement containing term.Debugf("Polled CD task status: done=%v, err=%v", done, err) from the hot polling loop in waitForCdTaskExit.go; simply delete that commented line (do not leave it as a comment), keeping the surrounding loop logic and behavior unchanged.src/pkg/cli/subscribe.go (1)
56-63: Consider handling raw gRPC status errors for the warning message.The current code checks
connect.CodeOf(err) == connect.CodeResourceExhaustedto display the quota warning. However, if the error is a raw gRPC status error (not wrapped by connect-go),connect.CodeOf()may returnCodeUnknownand the warning won't display.The retry behavior is still correct because
isTransientError()has a fallback path usingstatus.FromError(). This is just a UX consideration—the warning message might not always appear even whenResourceExhaustedtriggers a retry.♻️ Optional: Check both connect and gRPC status codes
if isTransientError(err) { - if connect.CodeOf(err) == connect.CodeResourceExhausted { + isResourceExhausted := connect.CodeOf(err) == connect.CodeResourceExhausted + if st, ok := status.FromError(err); ok && st.Code() == codes.ResourceExhausted { + isResourceExhausted = true + } + if isResourceExhausted { term.Warn("GCP logging quota exceeded; will retry subscribe stream after backoff.") } else { term.Debugf("WaitServiceState: transient error, reconnecting subscribe stream: %v", err) }This would require adding imports for
"google.golang.org/grpc/codes"and"google.golang.org/grpc/status".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pkg/cli/subscribe.go` around lines 56 - 63, The quota warning is missed when err is a raw gRPC status because the code only checks connect.CodeOf(err); update the transient-error branch in WaitServiceState (the block using isTransientError, connect.CodeOf and connect.CodeResourceExhausted) to also inspect a gRPC status code fallback (use status.FromError(err) and compare against codes.ResourceExhausted) and emit term.Warn when either check indicates ResourceExhausted; add the needed imports for "google.golang.org/grpc/status" and "google.golang.org/grpc/codes".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/pkg/cli/subscribe.go`:
- Around line 56-63: The quota warning is missed when err is a raw gRPC status
because the code only checks connect.CodeOf(err); update the transient-error
branch in WaitServiceState (the block using isTransientError, connect.CodeOf and
connect.CodeResourceExhausted) to also inspect a gRPC status code fallback (use
status.FromError(err) and compare against codes.ResourceExhausted) and emit
term.Warn when either check indicates ResourceExhausted; add the needed imports
for "google.golang.org/grpc/status" and "google.golang.org/grpc/codes".
In `@src/pkg/cli/waitForCdTaskExit.go`:
- Line 22: Remove the commented debug statement containing term.Debugf("Polled
CD task status: done=%v, err=%v", done, err) from the hot polling loop in
waitForCdTaskExit.go; simply delete that commented line (do not leave it as a
comment), keeping the surrounding loop logic and behavior unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 11efb7f4-285f-4d57-b145-959604104cb1
📒 Files selected for processing (9)
src/pkg/cli/client/byoc/gcp/stream.gosrc/pkg/cli/client/byoc/gcp/stream_test.gosrc/pkg/cli/subscribe.gosrc/pkg/cli/subscribe_test.gosrc/pkg/cli/tail.gosrc/pkg/cli/tail_test.gosrc/pkg/cli/waitForCdTaskExit.gosrc/pkg/clouds/gcp/logging.gosrc/pkg/clouds/gcp/logging_test.go
| // GCP grpc transient errors | ||
| if st, ok := status.FromError(err); ok { | ||
| transientCodes := []codes.Code{codes.Unavailable, codes.Internal} | ||
| transientCodes := []codes.Code{codes.Unavailable, codes.Internal, codes.ResourceExhausted} |
There was a problem hiding this comment.
ResourceExhausted is only transient for some resources, ie. log tail/queries can be retried. So I wonder if we should simply add the special case outside of the call to isTransientError.
There was a problem hiding this comment.
isTransientError is exclusively used in log monitoring code. I think it's appropriately placed, so that WaitServiceState,receiveLogs, and WaitForCdTaskExit can take advantage of it.
Would you prefer to rename it to something like isTransientLoggingError?
| t.cache = resp.GetEntries() | ||
| if len(t.cache) == 0 { | ||
| return nil, errors.New("no log entries found") | ||
| // GCP may send empty responses (heartbeats, suppression info); return nil |
Description
GCP Cloud Logging enforces a ReadRequestsPerMinutePerProject quota (120/min)
that covers both ListLogEntries and TailLogEntries.Recv calls. During an active
deployment with multiple services generating logs, the two concurrent streams
(subscribe + log tail) can exhaust this quota, causing the deployment monitor
to fail with ResourceExhausted.
Add codes.ResourceExhausted to isTransientError so both the log tail
(receiveLogs) and the subscribe stream (WaitServiceState) automatically retry
with exponential backoff instead of surfacing a fatal error. The RetryDelayer
backs off up to 1 minute, which aligns with the quota window reset.
Linked Issues
Might address #2001 and #1339
Checklist
Summary by CodeRabbit
Bug Fixes
Tests