-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29431 Update the 'ExcludeDNs' information with the cause in RS UI #7126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ed04830
to
eef3bb3
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
eef3bb3
to
0ee6fe7
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates how RegionServer displays excluded datanodes by including the reason (cause) for exclusion along with existing information, which helps in diagnosing issues related to network errors or slow performance.
- Updates the formatting of excluded datanode information to include both timestamp and cause.
- Refactors literal strings to use a dedicated enum (ExcludeCause) for consistent error cause messages.
- Adapts usage in multiple modules (region server, async fs monitor, output helper) to follow the new cause inclusion design.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MetricsRegionServerWrapperImpl.java | Updates the formatting of the excluded DN details by including timestamp and cause using the new Pair object. |
hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/monitor/StreamSlowMonitor.java | Refactors usage to replace string literal with the ExcludeCause enum for slow packet ack. |
hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/monitor/ExcludeDatanodeManager.java | Updates the excludeDNsCache to store a Pair (cause, timestamp) and introduces the ExcludeCause enum. |
hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutputHelper.java | Adjusts exclusion logic to use the ExcludeCause enum for connect errors. |
Comments suppressed due to low confidence (1)
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MetricsRegionServerWrapperImpl.java:438
- Consider using named accessor methods rather than getFirst() and getSecond() (or adding inline comments) so that it is immediately clear which value represents the cause and which is the timestamp. Also, verify that displaying the timestamp before the cause aligns with UI expectations.
.collect(Collectors.toList());
@@ -78,7 +79,7 @@ public ExcludeDatanodeManager(Configuration conf) { | |||
public boolean tryAddExcludeDN(DatanodeInfo datanodeInfo, String cause) { | |||
boolean alreadyMarkedSlow = getExcludeDNs().containsKey(datanodeInfo); | |||
if (!alreadyMarkedSlow) { | |||
excludeDNsCache.put(datanodeInfo, EnvironmentEdgeManager.currentTime()); | |||
excludeDNsCache.put(datanodeInfo, new Pair(cause, EnvironmentEdgeManager.currentTime())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider replacing the generic Pair with a dedicated value type (e.g., a small class or record) that provides self-documenting field names to improve long-term code clarity and maintainability.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will generate a compile warning? At least use new Pair<>(cause, time)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9
Thanks for looking at this changes.
I have addressed your comment pls check.
@@ -78,7 +79,7 @@ public ExcludeDatanodeManager(Configuration conf) { | |||
public boolean tryAddExcludeDN(DatanodeInfo datanodeInfo, String cause) { | |||
boolean alreadyMarkedSlow = getExcludeDNs().containsKey(datanodeInfo); | |||
if (!alreadyMarkedSlow) { | |||
excludeDNsCache.put(datanodeInfo, EnvironmentEdgeManager.currentTime()); | |||
excludeDNsCache.put(datanodeInfo, new Pair(cause, EnvironmentEdgeManager.currentTime())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will generate a compile warning? At least use new Pair<>(cause, time)
?
return excludeDatanodeManager.getExcludeDNs().entrySet().stream() | ||
.map(e -> e.getKey().toString() + ", " + e.getValue()).collect(Collectors.toList()); | ||
return excludeDatanodeManager.getExcludeDNs().entrySet().stream().map(e -> e.getKey().toString() | ||
+ " - " + e.getValue().getSecond() + " - " + e.getValue().getFirst()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old code we use ',' as separator, but in the new code we change to all use '-'? Could you please give some examples about how it looks like in the new implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In RS Wals page, for WAL Exclude DNs It will display like
DatanodeInfoStorage[<ip:port,DatanodeStrorageID,DISK] - TimeStamp - cause
example:
DatanodeInfoStorage[127.0.0.1:50074,DS-fd74b292-e398-4206-9253-99734b557ae2,DISK] - 1724043945082 - connect error
DatanodeInfoStorage[127.0.0.2:50074,DS-fd74b292-e398-4206-9253-99734b557ae4,DISK] - 1724043945678 - slow packet ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -78,7 +79,7 @@ public ExcludeDatanodeManager(Configuration conf) { | |||
public boolean tryAddExcludeDN(DatanodeInfo datanodeInfo, String cause) { | |||
boolean alreadyMarkedSlow = getExcludeDNs().containsKey(datanodeInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cache support computeIfAbsent like methods? If so we do not need to access it twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache doesNot support computeIfAbsent method.
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
RS may exclude DNs due to network issues or slow performance. Display the excluded DNs along with their causes in the UI.