-
Notifications
You must be signed in to change notification settings - Fork 663
fix(grpc): Add keepalive and fix reconnect issue #777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit addresses two issues related to gRPC connection stability and recovery. 1. **Half-open connections:** In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters, `collector.grpc_keepalive_time` and `collector.grpc_keepalive_timeout`, have been added to control this behavior. 2. **Reconnect logic:** The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.
|
I am confused about this. In our test, the agent reconnected quickly and automatically when server rebooted. |
|
The scenario you mentioned, where the agent quickly reconnects after a server reboot, typically occurs when the server shuts down cleanly, allowing TCP connections to terminate properly. However, the problem we encountered primarily arises in unstable network environments, leading to TCP connections entering a half-open state. In such situations:
Regarding your point, "If nothing changed, there is no point to create a new channel":
|
|
If you want to deal with a half-open status, you should check |
|
But if the server is back online, why the connection keeps in this half-open status? Not timeout, and not established? What happened actually? This seems not an expected status. |
1. Restore original reconnection logic with TRANSIENT_FAILURE monitoring: - Keep original behavior: only force reconnect when different server selected - When same server selected, rely on gRPC auto-reconnect mechanism - Add TRANSIENT_FAILURE state monitoring to detect prolonged failures - Force rebuild channel if either reconnectCount or transientFailureCount exceeds threshold - Add keepAliveWithoutCalls(true) to detect half-open connections 2. Fix race condition between reportError() and run(): - Wrap all state changes (reconnect flag + notifications) in synchronized blocks - Prevents reconnect flag and listener status from becoming inconsistent - Fixes production issue where reconnect=false but listeners in DISCONNECT state 3. Additional improvements: - Adjust keepalive default from 60s to 120s (reduces overhead by 50%) - Add getState() method to GRPCChannel for state monitoring 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
Thanks for the feedback @wu-sheng! I've revised the implementation based on your suggestions: Changes Made1. Restored original reconnection logic + Added TRANSIENT_FAILURE monitoringI agree with your point - the original logic is correct: only force reconnect when selecting a different server. When the same server is selected, we should rely on gRPC's auto-reconnect. What I've added:
This solves the half-open connection issue without disrupting gRPC's auto-reconnect mechanism. Key code: if (index == selectedIdx) {
reconnectCount++;
boolean forceReconnect = reconnectCount > threshold
|| transientFailureCount > threshold;
if (forceReconnect) {
createNewChannel(...); // Force rebuild
} else if (managedChannel.isConnected(false)) {
markAsConnected(); // Let gRPC auto-reconnect work
}
}Also added 2. Fixed race condition between reportError() and run()Issue from heap dump analysis: Fix: synchronized (statusLock) {
reconnect = true;
notify(GRPCChannelStatus.DISCONNECT);
}Additional Changes
Let me know if this approach looks good! |
I checked the code and found that SkyWalking DOES set deadline for RPC calls, but there's still a critical difference: RPC Deadline vs Keepalive Timeout1. SkyWalking's RPC ConfigurationFrom StreamObserver<SegmentObject> upstreamSegmentStreamObserver = serviceStub.withDeadlineAfter(
Config.Collector.GRPC_UPSTREAM_TIMEOUT, TimeUnit.SECONDS // Default: 30 seconds
).collect(...)From public static int GRPC_UPSTREAM_TIMEOUT = 30; // 30 seconds defaultSo RPC calls DO have 30 second timeout! 2. But Why Keepalive Still Needed?Here's the ROOT CAUSE - RPC deadline timeout doesn't trigger reconnection! The Critical Code in private boolean isNetworkError(Throwable throwable) {
if (throwable instanceof StatusRuntimeException) {
StatusRuntimeException statusRuntimeException = (StatusRuntimeException) throwable;
return statusEquals(
statusRuntimeException.getStatus(),
Status.UNAVAILABLE, // ✅ Keepalive generates this
Status.PERMISSION_DENIED,
Status.UNAUTHENTICATED,
Status.RESOURCE_EXHAUSTED,
Status.UNKNOWN
// ❌ But NOT Status.DEADLINE_EXCEEDED (RPC timeout generates this)
);
}
return false;
}Without With Evidence from production logs: 3. The Real ProblemThe issue is what happens BETWEEN RPC calls:
4. Why My Production Showed Growing Send-QueueIn my production environment, the send-queue growth happened because:
Keepalive solves this by detecting the broken connection BEFORE the next RPC burst arrives. So the answer is: RPC deadline handles per-call timeout, but keepalive handles connection health monitoring during idle periods. Both are needed! Additional Discussion PointShould Currently,
However, in the specific case of half-open connections, DEADLINE_EXCEEDED becomes misleading because:
My current fix solves this by:
Alternative approach (for consideration): What do you think? Should we keep the current approach or also add DEADLINE_EXCEEDED to network errors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds gRPC keepalive configuration support and improves channel state monitoring to better handle connection failures. The changes introduce configurable keepalive ping intervals and timeouts, while also refactoring the reconnection logic to detect and handle prolonged TRANSIENT_FAILURE states.
Key changes:
- Added gRPC keepalive configuration parameters (time and timeout) with defaults of 120s and 30s
- Introduced channel state monitoring to track TRANSIENT_FAILURE states
- Refactored reconnection logic to use helper methods and improved state management
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| agent.config | Added commented-out keepalive configuration properties with documentation |
| Config.java | Added GRPC_KEEPALIVE_TIME and GRPC_KEEPALIVE_TIMEOUT configuration fields |
| GRPCChannel.java | Implemented keepalive configuration and exposed getState() method |
| GRPCChannelManager.java | Refactored reconnection logic, added state monitoring, and improved thread safety |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java
Outdated
Show resolved
Hide resolved
| createNewChannel(ipAndPort[0], Integer.parseInt(ipAndPort[1])); | ||
| } else { | ||
| // Same server, increment reconnectCount and check state | ||
| reconnectCount++; |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reconnectCount variable is incremented here without synchronization (line 154), but is also reset in markAsConnected() which uses statusLock (line 248). This creates inconsistent synchronization. The increment should also be protected by statusLock to prevent race conditions between the increment and reset operations.
| reconnectCount++; | |
| synchronized (statusLock) { | |
| reconnectCount++; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reconnectCount modifications are safe because run() is executed by a single-threaded ScheduledExecutor, ensuring no concurrent access occurs.
...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java
Outdated
Show resolved
Hide resolved
...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java
Outdated
Show resolved
Hide resolved
...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * The interval in seconds to send a keepalive ping to the backend. | ||
| * If this is less than or equal to 0, the keepalive is disabled. | ||
| * |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The GRPC_KEEPALIVE_TIME field is declared as long but the configuration comment in agent.config (line 105) uses it with TimeUnit.SECONDS in GRPCChannel.java (line 45). The configuration validation at line 44 in GRPCChannel checks if > 0, but according to gRPC documentation, keepalive time values below a certain threshold (typically 10 seconds) may be rejected by the server. Consider adding a comment documenting the minimum safe value or adding validation.
| * | |
| * <p> | |
| * <b>Note:</b> The minimum safe value is 10 seconds. Values below this may be rejected by the gRPC server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The
GRPC_KEEPALIVE_TIMEfield is declared aslongbut the configuration comment in agent.config (line 105) uses it with TimeUnit.SECONDS in GRPCChannel.java (line 45). The configuration validation at line 44 in GRPCChannel checks if> 0, but according to gRPC documentation, keepalive time values below a certain threshold (typically 10 seconds) may be rejected by the server. Consider adding a comment documenting the minimum safe value or adding validation.
Thanks for the suggestion! I've added the documentation for the minimum safe keepalive time value in the following locations:
-
Config.java (line 218): Added a note in the Javadoc for
GRPC_KEEPALIVE_TIME:* <p> * <b>Note:</b> The minimum safe value is 10 seconds. Values below this may be rejected by the gRPC server.
-
agent.config (line 105): Added a comment in the configuration file:
# Note: The minimum safe value is 10 seconds. Values below this may be rejected by the gRPC server.
I opted for documentation rather than validation because:
- The default value (120 seconds) is already well above the minimum threshold
- Users who explicitly configure a lower value may have specific use cases or testing scenarios
Let me know if you think validation should still be added!
...agent-core/src/main/java/org/apache/skywalking/apm/agent/core/remote/GRPCChannelManager.java
Outdated
Show resolved
Hide resolved
|
A key question before checking codes, as you said, |
Changes: - Remove transientFailureCount mechanism as TRANSIENT_FAILURE already triggers UNAVAILABLE exceptions handled by reportError() - Remove checkChannelStateAndTriggerReconnectIfNeeded() method to simplify logic - Rename markAsConnected() to notifyConnected() for better clarity on method responsibility - Only reset reconnectCount in createNewChannel() after actual channel rebuild to handle half-open connections - Remove unnecessary else branch in run() method logging - Add documentation about minimum safe keepalive time (10 seconds) in Config.java - Remove unused stableConnectionCount field Key improvement: The reconnectCount will continue to accumulate even when isConnected() returns false positives, ensuring forced channel rebuild after threshold is exceeded. This solves the issue where connections could remain in half-open state for extended periods. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
8d1e883 to
c1b8ab4
Compare
…nfig Add documentation noting that the minimum safe keepalive time value is 10 seconds, as values below this threshold may be rejected by the gRPC server according to gRPC policies. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
You're absolutely right! After reviewing the code more carefully, I realized that monitoring TRANSIENT_FAILURE state is indeed redundant. Here's why:
I've removed the The Real Problem and SolutionAfter analyzing production logs, I found the actual issue was with the original reconnection logic: Original code problem: } else if (managedChannel.isConnected(++reconnectCount > 5)) {
reconnectCount = 0; // Reset counter when isConnected() returns true
reconnect = false;
}When
Current solution: if (reconnectCount > Config.Agent.FORCE_RECONNECTION_PERIOD) {
// Force rebuild channel
createNewChannel(...);
} else if (managedChannel.isConnected(false)) {
// Trust the connection but DON'T reset reconnectCount
notifyConnected();
}Now
This matches the original intent of |
This commit addresses two issues related to gRPC connection stability and recovery.
Half-open connections: In unstable network environments, the agent could encounter half-open TCP connections where the server-side connection is terminated, but the client-side remains. This would cause the send-queue to grow indefinitely without automatic recovery. To resolve this, this change introduces gRPC keepalive probes. The agent will now send keepalive pings to the collector, ensuring that dead connections are detected and pruned in a timely manner. Two new configuration parameters,

collector.grpc_ke epalive_timeandcollector.grpc_keepalive_timeout, have been added to control this behavior.Reconnect logic: The existing reconnection logic did not immediately re-establish a connection if the same backend instance was selected during a reconnect attempt. This could lead to a delay of up to an hour before the connection was re-established. The logic has been updated to ensure that the channel is always shut down and recreated, forcing an immediate reconnection attempt regardless of which backend is selected.
