Gracefully handle GCP logging quota exhaustion by jordanstephens · Pull Request #2024 · DefangLabs/defang

jordanstephens · 2026-04-07T20:37:19Z

Description

GCP Cloud Logging enforces a ReadRequestsPerMinutePerProject quota (120/min)
that covers both ListLogEntries and TailLogEntries.Recv calls. During an active
deployment with multiple services generating logs, the two concurrent streams
(subscribe + log tail) can exhaust this quota, causing the deployment monitor
to fail with ResourceExhausted.

Add codes.ResourceExhausted to isTransientError so both the log tail
(receiveLogs) and the subscribe stream (WaitServiceState) automatically retry
with exponential backoff instead of surfacing a fatal error. The RetryDelayer
backs off up to 1 minute, which aligns with the quota window reset.

Linked Issues

Might address #2001 and #1339

Checklist

I have performed a self-review of my code
I have added appropriate tests
I have updated the Defang CLI docs and/or README to reflect my changes, if necessary

Summary by CodeRabbit

Bug Fixes
- Log tailing now skips empty/heartbeat responses and treats exhausted quota as a distinct transient condition with a warning and retry behavior.
- Expanded transient-error classification for more reliable reconnects.
- Removed noisy per‑poll debug output.
Tests
- Added tests covering nil/empty tail responses, quota/resource-exhausted retry behavior, and transient-error detection.

…an error The GCP TailLogEntries streaming API can return a TailLogEntriesResponse with zero entries. This happens for heartbeat messages (sent periodically to keep the stream alive) and for suppression-info responses (sent when entries are being rate-limited or sampled on the server side). The response proto has a dedicated SuppressionInfo field for this purpose. Previously, gcpLoggingTailer.Next() treated any empty-entries response as a hard error ("no log entries found"). Because this is not a transient error, WaitServiceState would not retry — the subscribe stream would terminate and the deployment monitor would fail mid-deployment with a spurious error unrelated to the actual deployment outcome. The fix returns nil, nil from Next() on an empty response. The Follow() loop in stream.go now skips nil entries and continues waiting, matching the expected streaming behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GCP Cloud Logging enforces a ReadRequestsPerMinutePerProject quota (120/min) that covers both ListLogEntries and TailLogEntries.Recv calls. During an active deployment with multiple services generating logs, the two concurrent streams (subscribe + log tail) can exhaust this quota, causing the deployment monitor to fail with ResourceExhausted. Add codes.ResourceExhausted to isTransientError so both the log tail (receiveLogs) and the subscribe stream (WaitServiceState) automatically retry with exponential backoff instead of surfacing a fatal error. The RetryDelayer backs off up to 1 minute, which aligns with the quota window reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-07T20:37:39Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a96381e6-7369-4ff0-bf27-d0726b9d160e

📥 Commits

Reviewing files that changed from the base of the PR and between e792e72 and 32c6379.

📒 Files selected for processing (1)

src/pkg/cli/subscribe.go

🚧 Files skipped from review as they are similar to previous changes (1)

src/pkg/cli/subscribe.go

📝 Walkthrough

Walkthrough

Skip nil/empty GCP tail responses during live tailing and treat gRPC ResourceExhausted as a transient error that triggers the existing reconnect/backoff path; add unit tests and remove a per-tick debug log.

Changes

Cohort / File(s)	Summary
GCP Logging Tailer `src/pkg/clouds/gcp/logging.go`, `src/pkg/clouds/gcp/logging_test.go`	`gcpLoggingTailer.Next` now returns `(nil, nil)` for empty `TailLogEntriesResponse` instead of an error; added tests for empty response and multi-entry behavior.
ServerStream follow loop `src/pkg/cli/client/byoc/gcp/stream.go`, `src/pkg/cli/client/byoc/gcp/stream_test.go`	`ServerStream[T].Follow` skips `nil` log entries (`continue`) rather than passing them to parsing/filtering; added test ensuring nil entries are skipped and valid entries are yielded in order.
Transient error classification & subscribe `src/pkg/cli/tail.go`, `src/pkg/cli/tail_test.go`, `src/pkg/cli/subscribe.go`, `src/pkg/cli/subscribe_test.go`	Added `codes.ResourceExhausted` to errors treated as transient (reconnect/backoff path); `WaitServiceState` logs quota exhaustion as a warning; tests updated/added for Connect and gRPC status code handling.
Logging cleanup `src/pkg/cli/waitForCdTaskExit.go`	Removed unused import and per-tick debug logging from CD task exit polling loop.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Stream as ServerStream
    participant Tailer as gcpLoggingTailer
    participant Parser as parseAndFilter
    participant Backoff as Retry/Backoff

    Client->>Stream: Follow(ctx)
    Stream->>Tailer: Next()
    Note right of Tailer: May return (entry, nil), (nil, nil), or (nil, err)
    Tailer-->>Stream: (nil, nil)
    Stream->>Stream: continue (skip nil entry)
    Tailer-->>Stream: (entryA, nil)
    Stream->>Parser: parseAndFilter(entryA)
    Parser-->>Stream: messageA
    Stream-->>Client: emit messageA
    Tailer-->>Stream: (nil, ResourceExhausted error)
    Stream->>Backoff: DelayBeforeRetry(ctx)
    Backoff-->>Stream: resume retry / re-subscribe
    Tailer-->>Stream: (entryB, nil)
    Stream->>Parser: parseAndFilter(entryB)
    Parser-->>Client: emit messageB

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

adding debug logs for gcp logging requests #2016: Modifies ServerStream[T].Follow error/tailer handling; closely related to nil-entry handling changes.
Logs refactoring using Go iterators #1845: Related changes to GCP tailing behavior and handling of empty tail responses.
Fix panic in CD List #1962: Related BYOC GCP log selection and log-type handling adjustments.

Suggested reviewers

edwardrf

Poem

🐰 I nibble through streams both quiet and loud,
Skipping empty beats in the server crowd.
When quotas grumble, I pause, then try,
Hop back to logs and watch errors fly. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Gracefully handle GCP logging quota exhaustion' clearly and specifically summarizes the main change: adding graceful handling for GCP logging ResourceExhausted quota errors.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jordan/fix-gcp-logging-rate-limit

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=warning msg="[linters_context] running gomodguard failed: unable to read module file go.mod: current working directory must have a go.mod file: if you are not using go modules it is suggested to disable this linter"
level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/pkg/cli/waitForCdTaskExit.go (1)
22-22: Remove the commented debug statement instead of leaving it in place.

This is dead/commented code in a hot polling loop and adds noise. Prefer deleting it entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/waitForCdTaskExit.go` at line 22, Remove the commented debug
statement containing term.Debugf("Polled CD task status: done=%v, err=%v", done,
err) from the hot polling loop in waitForCdTaskExit.go; simply delete that
commented line (do not leave it as a comment), keeping the surrounding loop
logic and behavior unchanged.
src/pkg/cli/subscribe.go (1)
56-63: Consider handling raw gRPC status errors for the warning message.

The current code checks connect.CodeOf(err) == connect.CodeResourceExhausted to display the quota warning. However, if the error is a raw gRPC status error (not wrapped by connect-go), connect.CodeOf() may return CodeUnknown and the warning won't display.

The retry behavior is still correct because isTransientError() has a fallback path using status.FromError(). This is just a UX consideration—the warning message might not always appear even when ResourceExhausted triggers a retry.
♻️ Optional: Check both connect and gRPC status codes
 			if isTransientError(err) {
-				if connect.CodeOf(err) == connect.CodeResourceExhausted {
+				isResourceExhausted := connect.CodeOf(err) == connect.CodeResourceExhausted
+				if st, ok := status.FromError(err); ok && st.Code() == codes.ResourceExhausted {
+					isResourceExhausted = true
+				}
+				if isResourceExhausted {
 					term.Warn("GCP logging quota exceeded; will retry subscribe stream after backoff.")
 				} else {
 					term.Debugf("WaitServiceState: transient error, reconnecting subscribe stream: %v", err)
 				}
This would require adding imports for "google.golang.org/grpc/codes" and "google.golang.org/grpc/status".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pkg/cli/subscribe.go` around lines 56 - 63, The quota warning is missed
when err is a raw gRPC status because the code only checks connect.CodeOf(err);
update the transient-error branch in WaitServiceState (the block using
isTransientError, connect.CodeOf and connect.CodeResourceExhausted) to also
inspect a gRPC status code fallback (use status.FromError(err) and compare
against codes.ResourceExhausted) and emit term.Warn when either check indicates
ResourceExhausted; add the needed imports for "google.golang.org/grpc/status"
and "google.golang.org/grpc/codes".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/pkg/cli/subscribe.go`:
- Around line 56-63: The quota warning is missed when err is a raw gRPC status
because the code only checks connect.CodeOf(err); update the transient-error
branch in WaitServiceState (the block using isTransientError, connect.CodeOf and
connect.CodeResourceExhausted) to also inspect a gRPC status code fallback (use
status.FromError(err) and compare against codes.ResourceExhausted) and emit
term.Warn when either check indicates ResourceExhausted; add the needed imports
for "google.golang.org/grpc/status" and "google.golang.org/grpc/codes".

In `@src/pkg/cli/waitForCdTaskExit.go`:
- Line 22: Remove the commented debug statement containing term.Debugf("Polled
CD task status: done=%v, err=%v", done, err) from the hot polling loop in
waitForCdTaskExit.go; simply delete that commented line (do not leave it as a
comment), keeping the surrounding loop logic and behavior unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 11efb7f4-285f-4d57-b145-959604104cb1

📥 Commits

Reviewing files that changed from the base of the PR and between b856ef0 and e792e72.

📒 Files selected for processing (9)

src/pkg/cli/client/byoc/gcp/stream.go
src/pkg/cli/client/byoc/gcp/stream_test.go
src/pkg/cli/subscribe.go
src/pkg/cli/subscribe_test.go
src/pkg/cli/tail.go
src/pkg/cli/tail_test.go
src/pkg/cli/waitForCdTaskExit.go
src/pkg/clouds/gcp/logging.go
src/pkg/clouds/gcp/logging_test.go

src/pkg/cli/subscribe.go

lionello · 2026-04-07T21:04:29Z

src/pkg/cli/tail.go

 	// GCP grpc transient errors
 	if st, ok := status.FromError(err); ok {
-		transientCodes := []codes.Code{codes.Unavailable, codes.Internal}
+		transientCodes := []codes.Code{codes.Unavailable, codes.Internal, codes.ResourceExhausted}


ResourceExhausted is only transient for some resources, ie. log tail/queries can be retried. So I wonder if we should simply add the special case outside of the call to isTransientError.

isTransientError is exclusively used in log monitoring code. I think it's appropriately placed, so that WaitServiceState,receiveLogs, and WaitForCdTaskExit can take advantage of it.

Would you prefer to rename it to something like isTransientLoggingError?

lionello · 2026-04-07T21:05:48Z

src/pkg/clouds/gcp/logging.go

 		t.cache = resp.GetEntries()
 		if len(t.cache) == 0 {
-			return nil, errors.New("no log entries found")
+			// GCP may send empty responses (heartbeats, suppression info); return nil


This code would be improved by #1951

jordanstephens and others added 2 commits April 7, 2026 13:28

jordanstephens requested a review from lionello as a code owner April 7, 2026 20:37

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

jordanstephens requested a review from edwardrf April 7, 2026 20:44

lionello approved these changes Apr 7, 2026

View reviewed changes

jordanstephens added 2 commits April 7, 2026 14:30

log actual error

32c6379

Merge branch 'main' into jordan/fix-gcp-logging-rate-limit

8977c1f

jordanstephens merged commit 97c4704 into main Apr 7, 2026
14 checks passed

jordanstephens deleted the jordan/fix-gcp-logging-rate-limit branch April 7, 2026 22:20

jordanstephens mentioned this pull request Apr 7, 2026

Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' of service 'logging.googleapis.com' #2020

Closed

coderabbitai bot mentioned this pull request Apr 10, 2026

Fix Compute Engine instance group manager monitoring #2025

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully handle GCP logging quota exhaustion#2024

Gracefully handle GCP logging quota exhaustion#2024
jordanstephens merged 4 commits intomainfrom
jordan/fix-gcp-logging-rate-limit

jordanstephens commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

lionello Apr 7, 2026

Uh oh!

jordanstephens Apr 7, 2026 •

edited

Loading

Uh oh!

lionello Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jordanstephens commented Apr 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Linked Issues

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lionello Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

jordanstephens Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lionello Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jordanstephens commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

jordanstephens Apr 7, 2026 •

edited

Loading