Skip to content

Gracefully handle GCP logging quota exhaustion#2024

Merged
jordanstephens merged 4 commits intomainfrom
jordan/fix-gcp-logging-rate-limit
Apr 7, 2026
Merged

Gracefully handle GCP logging quota exhaustion#2024
jordanstephens merged 4 commits intomainfrom
jordan/fix-gcp-logging-rate-limit

Conversation

@jordanstephens
Copy link
Copy Markdown
Member

@jordanstephens jordanstephens commented Apr 7, 2026

Description

GCP Cloud Logging enforces a ReadRequestsPerMinutePerProject quota (120/min)
that covers both ListLogEntries and TailLogEntries.Recv calls. During an active
deployment with multiple services generating logs, the two concurrent streams
(subscribe + log tail) can exhaust this quota, causing the deployment monitor
to fail with ResourceExhausted.

Add codes.ResourceExhausted to isTransientError so both the log tail
(receiveLogs) and the subscribe stream (WaitServiceState) automatically retry
with exponential backoff instead of surfacing a fatal error. The RetryDelayer
backs off up to 1 minute, which aligns with the quota window reset.

Linked Issues

Might address #2001 and #1339

Checklist

  • I have performed a self-review of my code
  • I have added appropriate tests
  • I have updated the Defang CLI docs and/or README to reflect my changes, if necessary

Summary by CodeRabbit

  • Bug Fixes

    • Log tailing now skips empty/heartbeat responses and treats exhausted quota as a distinct transient condition with a warning and retry behavior.
    • Expanded transient-error classification for more reliable reconnects.
    • Removed noisy per‑poll debug output.
  • Tests

    • Added tests covering nil/empty tail responses, quota/resource-exhausted retry behavior, and transient-error detection.

jordanstephens and others added 2 commits April 7, 2026 13:28
…an error

The GCP TailLogEntries streaming API can return a TailLogEntriesResponse with
zero entries. This happens for heartbeat messages (sent periodically to keep
the stream alive) and for suppression-info responses (sent when entries are
being rate-limited or sampled on the server side). The response proto has a
dedicated SuppressionInfo field for this purpose.

Previously, gcpLoggingTailer.Next() treated any empty-entries response as a
hard error ("no log entries found"). Because this is not a transient error,
WaitServiceState would not retry — the subscribe stream would terminate and
the deployment monitor would fail mid-deployment with a spurious error
unrelated to the actual deployment outcome.

The fix returns nil, nil from Next() on an empty response. The Follow() loop
in stream.go now skips nil entries and continues waiting, matching the
expected streaming behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCP Cloud Logging enforces a ReadRequestsPerMinutePerProject quota (120/min)
that covers both ListLogEntries and TailLogEntries.Recv calls. During an active
deployment with multiple services generating logs, the two concurrent streams
(subscribe + log tail) can exhaust this quota, causing the deployment monitor
to fail with ResourceExhausted.

Add codes.ResourceExhausted to isTransientError so both the log tail
(receiveLogs) and the subscribe stream (WaitServiceState) automatically retry
with exponential backoff instead of surfacing a fatal error. The RetryDelayer
backs off up to 1 minute, which aligns with the quota window reset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jordanstephens jordanstephens requested a review from lionello as a code owner April 7, 2026 20:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a96381e6-7369-4ff0-bf27-d0726b9d160e

📥 Commits

Reviewing files that changed from the base of the PR and between e792e72 and 32c6379.

📒 Files selected for processing (1)
  • src/pkg/cli/subscribe.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/pkg/cli/subscribe.go

📝 Walkthrough

Walkthrough

Skip nil/empty GCP tail responses during live tailing and treat gRPC ResourceExhausted as a transient error that triggers the existing reconnect/backoff path; add unit tests and remove a per-tick debug log.

Changes

Cohort / File(s) Summary
GCP Logging Tailer
src/pkg/clouds/gcp/logging.go, src/pkg/clouds/gcp/logging_test.go
gcpLoggingTailer.Next now returns (nil, nil) for empty TailLogEntriesResponse instead of an error; added tests for empty response and multi-entry behavior.
ServerStream follow loop
src/pkg/cli/client/byoc/gcp/stream.go, src/pkg/cli/client/byoc/gcp/stream_test.go
ServerStream[T].Follow skips nil log entries (continue) rather than passing them to parsing/filtering; added test ensuring nil entries are skipped and valid entries are yielded in order.
Transient error classification & subscribe
src/pkg/cli/tail.go, src/pkg/cli/tail_test.go, src/pkg/cli/subscribe.go, src/pkg/cli/subscribe_test.go
Added codes.ResourceExhausted to errors treated as transient (reconnect/backoff path); WaitServiceState logs quota exhaustion as a warning; tests updated/added for Connect and gRPC status code handling.
Logging cleanup
src/pkg/cli/waitForCdTaskExit.go
Removed unused import and per-tick debug logging from CD task exit polling loop.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Stream as ServerStream
    participant Tailer as gcpLoggingTailer
    participant Parser as parseAndFilter
    participant Backoff as Retry/Backoff

    Client->>Stream: Follow(ctx)
    Stream->>Tailer: Next()
    Note right of Tailer: May return (entry, nil), (nil, nil), or (nil, err)
    Tailer-->>Stream: (nil, nil)
    Stream->>Stream: continue (skip nil entry)
    Tailer-->>Stream: (entryA, nil)
    Stream->>Parser: parseAndFilter(entryA)
    Parser-->>Stream: messageA
    Stream-->>Client: emit messageA
    Tailer-->>Stream: (nil, ResourceExhausted error)
    Stream->>Backoff: DelayBeforeRetry(ctx)
    Backoff-->>Stream: resume retry / re-subscribe
    Tailer-->>Stream: (entryB, nil)
    Stream->>Parser: parseAndFilter(entryB)
    Parser-->>Client: emit messageB
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • edwardrf

Poem

🐰 I nibble through streams both quiet and loud,
Skipping empty beats in the server crowd.
When quotas grumble, I pause, then try,
Hop back to logs and watch errors fly. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Gracefully handle GCP logging quota exhaustion' clearly and specifically summarizes the main change: adding graceful handling for GCP logging ResourceExhausted quota errors.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jordan/fix-gcp-logging-rate-limit

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=warning msg="[linters_context] running gomodguard failed: unable to read module file go.mod: current working directory must have a go.mod file: if you are not using go modules it is suggested to disable this linter"
level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/pkg/cli/waitForCdTaskExit.go (1)

22-22: Remove the commented debug statement instead of leaving it in place.

This is dead/commented code in a hot polling loop and adds noise. Prefer deleting it entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/waitForCdTaskExit.go` at line 22, Remove the commented debug
statement containing term.Debugf("Polled CD task status: done=%v, err=%v", done,
err) from the hot polling loop in waitForCdTaskExit.go; simply delete that
commented line (do not leave it as a comment), keeping the surrounding loop
logic and behavior unchanged.
src/pkg/cli/subscribe.go (1)

56-63: Consider handling raw gRPC status errors for the warning message.

The current code checks connect.CodeOf(err) == connect.CodeResourceExhausted to display the quota warning. However, if the error is a raw gRPC status error (not wrapped by connect-go), connect.CodeOf() may return CodeUnknown and the warning won't display.

The retry behavior is still correct because isTransientError() has a fallback path using status.FromError(). This is just a UX consideration—the warning message might not always appear even when ResourceExhausted triggers a retry.

♻️ Optional: Check both connect and gRPC status codes
 			if isTransientError(err) {
-				if connect.CodeOf(err) == connect.CodeResourceExhausted {
+				isResourceExhausted := connect.CodeOf(err) == connect.CodeResourceExhausted
+				if st, ok := status.FromError(err); ok && st.Code() == codes.ResourceExhausted {
+					isResourceExhausted = true
+				}
+				if isResourceExhausted {
 					term.Warn("GCP logging quota exceeded; will retry subscribe stream after backoff.")
 				} else {
 					term.Debugf("WaitServiceState: transient error, reconnecting subscribe stream: %v", err)
 				}

This would require adding imports for "google.golang.org/grpc/codes" and "google.golang.org/grpc/status".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/subscribe.go` around lines 56 - 63, The quota warning is missed
when err is a raw gRPC status because the code only checks connect.CodeOf(err);
update the transient-error branch in WaitServiceState (the block using
isTransientError, connect.CodeOf and connect.CodeResourceExhausted) to also
inspect a gRPC status code fallback (use status.FromError(err) and compare
against codes.ResourceExhausted) and emit term.Warn when either check indicates
ResourceExhausted; add the needed imports for "google.golang.org/grpc/status"
and "google.golang.org/grpc/codes".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/pkg/cli/subscribe.go`:
- Around line 56-63: The quota warning is missed when err is a raw gRPC status
because the code only checks connect.CodeOf(err); update the transient-error
branch in WaitServiceState (the block using isTransientError, connect.CodeOf and
connect.CodeResourceExhausted) to also inspect a gRPC status code fallback (use
status.FromError(err) and compare against codes.ResourceExhausted) and emit
term.Warn when either check indicates ResourceExhausted; add the needed imports
for "google.golang.org/grpc/status" and "google.golang.org/grpc/codes".

In `@src/pkg/cli/waitForCdTaskExit.go`:
- Line 22: Remove the commented debug statement containing term.Debugf("Polled
CD task status: done=%v, err=%v", done, err) from the hot polling loop in
waitForCdTaskExit.go; simply delete that commented line (do not leave it as a
comment), keeping the surrounding loop logic and behavior unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 11efb7f4-285f-4d57-b145-959604104cb1

📥 Commits

Reviewing files that changed from the base of the PR and between b856ef0 and e792e72.

📒 Files selected for processing (9)
  • src/pkg/cli/client/byoc/gcp/stream.go
  • src/pkg/cli/client/byoc/gcp/stream_test.go
  • src/pkg/cli/subscribe.go
  • src/pkg/cli/subscribe_test.go
  • src/pkg/cli/tail.go
  • src/pkg/cli/tail_test.go
  • src/pkg/cli/waitForCdTaskExit.go
  • src/pkg/clouds/gcp/logging.go
  • src/pkg/clouds/gcp/logging_test.go

@jordanstephens jordanstephens requested a review from edwardrf April 7, 2026 20:44
// GCP grpc transient errors
if st, ok := status.FromError(err); ok {
transientCodes := []codes.Code{codes.Unavailable, codes.Internal}
transientCodes := []codes.Code{codes.Unavailable, codes.Internal, codes.ResourceExhausted}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceExhausted is only transient for some resources, ie. log tail/queries can be retried. So I wonder if we should simply add the special case outside of the call to isTransientError.

Copy link
Copy Markdown
Member Author

@jordanstephens jordanstephens Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isTransientError is exclusively used in log monitoring code. I think it's appropriately placed, so that WaitServiceState,receiveLogs, and WaitForCdTaskExit can take advantage of it.

Would you prefer to rename it to something like isTransientLoggingError?

t.cache = resp.GetEntries()
if len(t.cache) == 0 {
return nil, errors.New("no log entries found")
// GCP may send empty responses (heartbeats, suppression info); return nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code would be improved by #1951

@jordanstephens jordanstephens merged commit 97c4704 into main Apr 7, 2026
14 checks passed
@jordanstephens jordanstephens deleted the jordan/fix-gcp-logging-rate-limit branch April 7, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants