fix(grpc): Add keepalive and fix reconnect issue #777

hzhaop · 2025-10-24T08:59:59Z

This commit addresses two issues related to gRPC connection stability and recovery.

Half-open connections: In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters, collector.grpc_ke epalive_time and collector.grpc_keepalive_timeout, have been added to control this behavior.
Reconnect logic: The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.

This commit addresses two issues related to gRPC connection stability and recovery. 1. **Half-open connections:** In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters, `collector.grpc_keepalive_time` and `collector.grpc_keepalive_timeout`, have been added to control this behavior. 2. **Reconnect logic:** The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.

wu-sheng · 2025-10-24T09:09:39Z

I am confused about this. In our test, the agent reconnected quickly and automatically when server rebooted.
Why did your side take so long?
If nothing changed, there is no point to create a new channel.

hzhaop · 2025-10-24T09:23:25Z

The scenario you mentioned, where the agent quickly reconnects after a server reboot, typically occurs when the server shuts down cleanly, allowing TCP connections to terminate properly.

However, the problem we encountered primarily arises in unstable network environments, leading to TCP connections entering a half-open state. In such situations:

The server-side connection is terminated, but the client still believes the connection is alive. This causes the client's send-Q to continuously accumulate data, and the agent remains unaware that the connection has become invalid, thus not triggering an automatic reconnection.
The role of gRPC keepalive: The purpose of introducing gRPC keepalive is precisely to actively detect these half-open connections.By periodically sending heartbeats, the agent can promptly discover connections that are actually dead but still perceived as alive by the client, thereby forcing their closure and initiating the reconnection process.

Regarding your point, "If nothing changed, there is no point to create a new channel":

Change in connection state: Even if the target backend address remains unchanged, the internal state of the previous connection iscorrupted due to its half-open status. In this scenario, simply reusing the old channel is ineffective as it cannot recover.
Necessity of forced reconnection: We observed that after keepalive detected a connection failure, if the agent subsequently selected the same backend, the original reconnection logic would not immediately force the establishment of a new channel. Instead, it would wait for a long period (approximately one hour) before attempting to reconnect. Therefore, modifying the reconnection logic toensure that, upon detecting a connection failure, the old channel is forcibly closed and a new channel is established, regardlessof whether the same backend is selected, is crucial for ensuring timely connection recovery and preventing prolonged serviceinterruptions.

wu-sheng · 2025-11-02T02:21:21Z

If you want to deal with a half-open status, you should check ConnectivityState.TRANSIENT_FAILURE stauts, and if it lasts too long, you could try to establish a new one.

wu-sheng · 2025-11-02T02:22:24Z

But if the server is back online, why the connection keeps in this half-open status? Not timeout, and not established? What happened actually? This seems not an expected status.

1. Restore original reconnection logic with TRANSIENT_FAILURE monitoring: - Keep original behavior: only force reconnect when different server selected - When same server selected, rely on gRPC auto-reconnect mechanism - Add TRANSIENT_FAILURE state monitoring to detect prolonged failures - Force rebuild channel if either reconnectCount or transientFailureCount exceeds threshold - Add keepAliveWithoutCalls(true) to detect half-open connections 2. Fix race condition between reportError() and run(): - Wrap all state changes (reconnect flag + notifications) in synchronized blocks - Prevents reconnect flag and listener status from becoming inconsistent - Fixes production issue where reconnect=false but listeners in DISCONNECT state 3. Additional improvements: - Adjust keepalive default from 60s to 120s (reduces overhead by 50%) - Add getState() method to GRPCChannel for state monitoring 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

hzhaop · 2025-11-06T18:12:04Z

Thanks for the feedback @wu-sheng! I've revised the implementation based on your suggestions:

Changes Made

1. Restored original reconnection logic + Added TRANSIENT_FAILURE monitoring

I agree with your point - the original logic is correct: only force reconnect when selecting a different server. When the same server is selected, we should rely on gRPC's auto-reconnect.

What I've added:

Monitor ConnectivityState.TRANSIENT_FAILURE state continuously
If channel stays in TRANSIENT_FAILURE for too long, force rebuild channel
Dual safeguard: force reconnect if either reconnectCount OR transientFailureCount exceeds threshold

This solves the half-open connection issue without disrupting gRPC's auto-reconnect mechanism.

Key code:

if (index == selectedIdx) {
    reconnectCount++;
    boolean forceReconnect = reconnectCount > threshold
                          || transientFailureCount > threshold;
    if (forceReconnect) {
        createNewChannel(...);  // Force rebuild
    } else if (managedChannel.isConnected(false)) {
        markAsConnected();  // Let gRPC auto-reconnect work
    }
}

Also added keepAliveWithoutCalls(true) to detect half-open connections.

2. Fixed race condition between reportError() and run()

Issue from heap dump analysis:
My production heap dump showed reconnect = false while listeners were in DISCONNECT state, proving the race condition exists. This causes SkyWalking Agent to permanently stop uploading data.

Fix:
Wrapped all state changes in synchronized(statusLock) to make flag updates and notifications atomic.

synchronized (statusLock) {
    reconnect = true;
    notify(GRPCChannelStatus.DISCONNECT);
}

Additional Changes

Adjusted keepalive default from 60s to 120s (reduces overhead)
Added getState() method to GRPCChannel for monitoring

Let me know if this approach looks good!

hzhaop · 2025-11-06T18:50:57Z

But if the server is back online, why the connection keeps in this half-open status? Not timeout, and not established? What happened actually? This seems not an expected status.

I checked the code and found that SkyWalking DOES set deadline for RPC calls, but there's still a critical difference:

RPC Deadline vs Keepalive Timeout

1. SkyWalking's RPC Configuration

From TraceSegmentServiceClient.java:92-93:

StreamObserver<SegmentObject> upstreamSegmentStreamObserver = serviceStub.withDeadlineAfter(
    Config.Collector.GRPC_UPSTREAM_TIMEOUT, TimeUnit.SECONDS  // Default: 30 seconds
).collect(...)

From Config.java:212:

public static int GRPC_UPSTREAM_TIMEOUT = 30;  // 30 seconds default

So RPC calls DO have 30 second timeout!

2. But Why Keepalive Still Needed?

Here's the ROOT CAUSE - RPC deadline timeout doesn't trigger reconnection!

The Critical Code in GRPCChannelManager.isNetworkError() (lines 282-291):

private boolean isNetworkError(Throwable throwable) {
    if (throwable instanceof StatusRuntimeException) {
        StatusRuntimeException statusRuntimeException = (StatusRuntimeException) throwable;
        return statusEquals(
            statusRuntimeException.getStatus(),
            Status.UNAVAILABLE,        // ✅ Keepalive generates this
            Status.PERMISSION_DENIED,
            Status.UNAUTHENTICATED,
            Status.RESOURCE_EXHAUSTED,
            Status.UNKNOWN
            // ❌ But NOT Status.DEADLINE_EXCEEDED (RPC timeout generates this)
        );
    }
    return false;
}

Without keepAliveWithoutCalls(true):

T0:     Last RPC completes successfully
T0-T120: No RPC calls (idle period)
        → No data sent, no way to detect connection is broken
        → TCP thinks connection is still alive (half-open)
T120:   New RPC call triggered
        → withDeadlineAfter(30s) starts counting
        → Data goes to TCP send-queue
        → Waits for response...
T150:   30 second deadline expires
        → RPC call fails with Status.DEADLINE_EXCEEDED
        → isNetworkError() returns FALSE ❌
        → reportError() NOT called
        → reconnect stays FALSE
        → Half-open connection persists
T180:   Another RPC triggered
        → Same cycle repeats...
        → Infinite loop!

With keepAliveWithoutCalls(true):

T0:     Last RPC completes
T120:   Keepalive PING sent (even during idle)
        → No PONG response
T150:   Keepalive timeout (30s)
        → Throws: Status.UNAVAILABLE: Keepalive failed ✅
        → isNetworkError() returns TRUE
        → reportError() called → reconnect = true
        → State becomes TRANSIENT_FAILURE
        → transientFailureCount++
T180:   Next run() cycle
        → reconnect = true, forces channel rebuild
        → Breaks the half-open connection!

Evidence from production logs:

2025-10-31 09:34:40.081 ERROR JVMMetricsSender : send JVM metrics to Collector fail.
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
  UNAVAILABLE: Keepalive failed. The connection is likely gone

3. The Real Problem

The issue is what happens BETWEEN RPC calls:

Scenario	Without keepalive	With keepalive
Idle period (no RPC)	No detection mechanism	PING detects failure
Half-open connection	Persists undetected	Detected within 150s
Next RPC call	Goes to broken connection	Blocked by TRANSIENT_FAILURE state
Result	Deadline expires, but reconnect logic fails due to `isConnected(false)` returning true	Connection state is accurate

4. Why My Production Showed Growing Send-Queue

In my production environment, the send-queue growth happened because:

RPC deadline is 30 seconds
But during idle periods (no RPC), half-open connection wasn't detected
When burst of RPC calls came, they all queued up
Each call would timeout after 30s, but new calls kept coming
The cycle of "timeout → reconnect attempt → isConnected(true) → markAsConnected()" kept the broken connection

Keepalive solves this by detecting the broken connection BEFORE the next RPC burst arrives.

So the answer is: RPC deadline handles per-call timeout, but keepalive handles connection health monitoring during idle periods. Both are needed!

Additional Discussion Point

Should DEADLINE_EXCEEDED be treated as a network error?

Currently, isNetworkError() doesn't include Status.DEADLINE_EXCEEDED, which means RPC timeouts don't trigger reconnection. This is actually correct behavior in most cases:

DEADLINE_EXCEEDED = application-level timeout (request took too long)
- Could be due to slow server processing
- Not necessarily a connection problem
- Example: Complex query times out after 30s
UNAVAILABLE = transport-level failure (connection broken)
- Definitely a network/connection problem
- Should always trigger reconnection
- Example: Keepalive PING fails

However, in the specific case of half-open connections, DEADLINE_EXCEEDED becomes misleading because:

The request didn't actually reach the server (connection is broken)
But it's reported as a timeout (not a connection error)
This prevents proper reconnection logic from kicking in

My current fix solves this by:

Adding keepAliveWithoutCalls(true) to detect half-open connections proactively
Using TRANSIENT_FAILURE state monitoring as a backup safeguard
This way we don't need to change the semantic meaning of DEADLINE_EXCEEDED

Alternative approach (for consideration):
Add Status.DEADLINE_EXCEEDED to isNetworkError(), but this might cause false positives where legitimate slow requests trigger unnecessary reconnections.

What do you think? Should we keep the current approach or also add DEADLINE_EXCEEDED to network errors?

Copilot

Pull Request Overview

This PR adds gRPC keepalive configuration support and improves channel state monitoring to better handle connection failures. The changes introduce configurable keepalive ping intervals and timeouts, while also refactoring the reconnection logic to detect and handle prolonged TRANSIENT_FAILURE states.

Key changes:

Added gRPC keepalive configuration parameters (time and timeout) with defaults of 120s and 30s
Introduced channel state monitoring to track TRANSIENT_FAILURE states
Refactored reconnection logic to use helper methods and improved state management

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
agent.config	Added commented-out keepalive configuration properties with documentation
Config.java	Added GRPC_KEEPALIVE_TIME and GRPC_KEEPALIVE_TIMEOUT configuration fields
GRPCChannel.java	Implemented keepalive configuration and exposed getState() method
GRPCChannelManager.java	Refactored reconnection logic, added state monitoring, and improved thread safety

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java

Copilot · 2025-11-07T00:44:05Z

...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java

+                        createNewChannel(ipAndPort[0], Integer.parseInt(ipAndPort[1]));
+                    } else {
+                        // Same server, increment reconnectCount and check state
+                        reconnectCount++;


The reconnectCount variable is incremented here without synchronization (line 154), but is also reset in markAsConnected() which uses statusLock (line 248). This creates inconsistent synchronization. The increment should also be protected by statusLock to prevent race conditions between the increment and reset operations.

Suggested change

reconnectCount++;

synchronized (statusLock) {

reconnectCount++;

}

The reconnectCount modifications are safe because run() is executed by a single-threaded ScheduledExecutor, ensuring no concurrent access occurs.

...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java

Copilot · 2025-11-07T00:44:06Z

apm-sniffer/apm-agent-core/src/main/java/org/apache/skywalking/apm/agent/core/conf/Config.java

+        /**
+         * The interval in seconds to send a keepalive ping to the backend.
+         * If this is less than or equal to 0, the keepalive is disabled.
+         *


[nitpick] The GRPC_KEEPALIVE_TIME field is declared as long but the configuration comment in agent.config (line 105) uses it with TimeUnit.SECONDS in GRPCChannel.java (line 45). The configuration validation at line 44 in GRPCChannel checks if > 0, but according to gRPC documentation, keepalive time values below a certain threshold (typically 10 seconds) may be rejected by the server. Consider adding a comment documenting the minimum safe value or adding validation.

Suggested change

*

* 

* Note: The minimum safe value is 10 seconds. Values below this may be rejected by the gRPC server.

[nitpick] The GRPC_KEEPALIVE_TIME field is declared as long but the configuration comment in agent.config (line 105) uses it with TimeUnit.SECONDS in GRPCChannel.java (line 45). The configuration validation at line 44 in GRPCChannel checks if > 0, but according to gRPC documentation, keepalive time values below a certain threshold (typically 10 seconds) may be rejected by the server. Consider adding a comment documenting the minimum safe value or adding validation.

Thanks for the suggestion! I've added the documentation for the minimum safe keepalive time value in the following locations:

Config.java (line 218): Added a note in the Javadoc for GRPC_KEEPALIVE_TIME:

* * Note: The minimum safe value is 10 seconds. Values below this may be rejected by the gRPC server.

agent.config (line 105): Added a comment in the configuration file:

# Note: The minimum safe value is 10 seconds. Values below this may be rejected by the gRPC server.

I opted for documentation rather than validation because:

The default value (120 seconds) is already well above the minimum threshold

Users who explicitly configure a lower value may have specific use cases or testing scenarios

Let me know if you think validation should still be added!

...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java

wu-sheng · 2025-11-07T00:56:58Z

A key question before checking codes, as you said, TRANSIENT_FAILURE represents Next Call goes to broken connection, and gRPC returns this. Why doesn't treat this as a network error directly?
You created another method and configurations to verify this status, and also trigger the reconnection like other network issue. It seems no different to me.
Did I miss anything?

Changes: - Remove transientFailureCount mechanism as TRANSIENT_FAILURE already triggers UNAVAILABLE exceptions handled by reportError() - Remove checkChannelStateAndTriggerReconnectIfNeeded() method to simplify logic - Rename markAsConnected() to notifyConnected() for better clarity on method responsibility - Only reset reconnectCount in createNewChannel() after actual channel rebuild to handle half-open connections - Remove unnecessary else branch in run() method logging - Add documentation about minimum safe keepalive time (10 seconds) in Config.java - Remove unused stableConnectionCount field Key improvement: The reconnectCount will continue to accumulate even when isConnected() returns false positives, ensuring forced channel rebuild after threshold is exceeded. This solves the issue where connections could remain in half-open state for extended periods. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…nfig Add documentation noting that the minimum safe keepalive time value is 10 seconds, as values below this threshold may be rejected by the gRPC server according to gRPC policies. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

hzhaop · 2025-11-10T08:43:39Z

A key question before checking codes, as you said, TRANSIENT_FAILURE represents Next Call goes to broken connection, and gRPC returns this. Why doesn't treat this as a network error directly? You created another method and configurations to verify this status, and also trigger the reconnection like other network issue. It seems no different to me. Did I miss anything?

You're absolutely right! After reviewing the code more carefully, I realized that monitoring TRANSIENT_FAILURE state is indeed redundant.

Here's why:

When the channel is in TRANSIENT_FAILURE state, any RPC call will immediately throw an UNAVAILABLE exception
UNAVAILABLE is already in the isNetworkError() list, which triggers reportError() and reconnection
So the TRANSIENT_FAILURE monitoring provides no additional benefit

I've removed the transientFailureCount mechanism and checkChannelStateAndTriggerReconnectIfNeeded() method in the latest commits.

The Real Problem and Solution

After analyzing production logs, I found the actual issue was with the original reconnection logic:

Original code problem:

} else if (managedChannel.isConnected(++reconnectCount > 5)) {
    reconnectCount = 0;  // Reset counter when isConnected() returns true
    reconnect = false;
}

When isConnected() returns true (even if it's a false positive due to half-open connection), the counter gets reset. This means:

If the connection is in a half-open state, isConnected() may return true
reconnectCount gets reset to 0 every time
The counter never reaches the threshold
Channel never gets rebuilt, even though the connection is actually broken

Current solution:

if (reconnectCount > Config.Agent.FORCE_RECONNECTION_PERIOD) {
    // Force rebuild channel
    createNewChannel(...);
} else if (managedChannel.isConnected(false)) {
    // Trust the connection but DON'T reset reconnectCount
    notifyConnected();
}

Now reconnectCount only resets in createNewChannel() after actual channel rebuild. This ensures:

Even if isConnected() returns false positives, the counter keeps accumulating
After threshold is exceeded, channel is forcibly rebuilt
Half-open connections are resolved within predictable timeframe (threshold × check_interval)

This matches the original intent of FORCE_RECONNECTION_PERIOD configuration.

wu-sheng requested a review from Copilot November 7, 2025 00:33

Copilot AI reviewed Nov 7, 2025

View reviewed changes

wu-sheng added bug Something isn't working enhancement New feature or request labels Nov 7, 2025

wu-sheng self-requested a review November 7, 2025 00:44

wu-sheng reviewed Nov 7, 2025

View reviewed changes

...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java Outdated Show resolved Hide resolved

hzhaop force-pushed the feat/grpc-keepalive-new branch from 8d1e883 to c1b8ab4 Compare November 10, 2025 08:18

	*
	* <p>
	* <b>Note:</b> The minimum safe value is 10 seconds. Values below this may be rejected by the gRPC server.

fix(grpc): Add keepalive and fix reconnect issue #777

Are you sure you want to change the base?

fix(grpc): Add keepalive and fix reconnect issue #777

Conversation

hzhaop commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wu-sheng commented Oct 24, 2025

Uh oh!

hzhaop commented Oct 24, 2025

Uh oh!

wu-sheng commented Nov 2, 2025

Uh oh!

wu-sheng commented Nov 2, 2025

Uh oh!

hzhaop commented Nov 6, 2025

Changes Made

1. Restored original reconnection logic + Added TRANSIENT_FAILURE monitoring

2. Fixed race condition between reportError() and run()

Additional Changes

Uh oh!

hzhaop commented Nov 6, 2025

RPC Deadline vs Keepalive Timeout

1. SkyWalking's RPC Configuration

2. But Why Keepalive Still Needed?

3. The Real Problem

4. Why My Production Showed Growing Send-Queue

Additional Discussion Point

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

hzhaop Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

hzhaop Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wu-sheng commented Nov 7, 2025

Uh oh!

hzhaop commented Nov 10, 2025

The Real Problem and Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hzhaop commented Oct 24, 2025 •

edited

Loading