Skip to content

HBASE-29431 Update the 'ExcludeDNs' information with the cause in RS UI #7126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

srinireddy2020
Copy link
Contributor

RS may exclude DNs due to network issues or slow performance. Display the excluded DNs along with their causes in the UI.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9 Apache9 requested a review from Copilot July 1, 2025 07:28
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates how RegionServer displays excluded datanodes by including the reason (cause) for exclusion along with existing information, which helps in diagnosing issues related to network errors or slow performance.

  • Updates the formatting of excluded datanode information to include both timestamp and cause.
  • Refactors literal strings to use a dedicated enum (ExcludeCause) for consistent error cause messages.
  • Adapts usage in multiple modules (region server, async fs monitor, output helper) to follow the new cause inclusion design.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MetricsRegionServerWrapperImpl.java Updates the formatting of the excluded DN details by including timestamp and cause using the new Pair object.
hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/monitor/StreamSlowMonitor.java Refactors usage to replace string literal with the ExcludeCause enum for slow packet ack.
hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/monitor/ExcludeDatanodeManager.java Updates the excludeDNsCache to store a Pair (cause, timestamp) and introduces the ExcludeCause enum.
hbase-asyncfs/src/main/java/org/apache/hadoop/hbase/io/asyncfs/FanOutOneBlockAsyncDFSOutputHelper.java Adjusts exclusion logic to use the ExcludeCause enum for connect errors.
Comments suppressed due to low confidence (1)

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MetricsRegionServerWrapperImpl.java:438

  • Consider using named accessor methods rather than getFirst() and getSecond() (or adding inline comments) so that it is immediately clear which value represents the cause and which is the timestamp. Also, verify that displaying the timestamp before the cause aligns with UI expectations.
      .collect(Collectors.toList());

@@ -78,7 +79,7 @@ public ExcludeDatanodeManager(Configuration conf) {
public boolean tryAddExcludeDN(DatanodeInfo datanodeInfo, String cause) {
boolean alreadyMarkedSlow = getExcludeDNs().containsKey(datanodeInfo);
if (!alreadyMarkedSlow) {
excludeDNsCache.put(datanodeInfo, EnvironmentEdgeManager.currentTime());
excludeDNsCache.put(datanodeInfo, new Pair(cause, EnvironmentEdgeManager.currentTime()));
Copy link
Preview

Copilot AI Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider replacing the generic Pair with a dedicated value type (e.g., a small class or record) that provides self-documenting field names to improve long-term code clarity and maintainability.

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will generate a compile warning? At least use new Pair<>(cause, time)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Apache9
Thanks for looking at this changes.

I have addressed your comment pls check.

@@ -78,7 +79,7 @@ public ExcludeDatanodeManager(Configuration conf) {
public boolean tryAddExcludeDN(DatanodeInfo datanodeInfo, String cause) {
boolean alreadyMarkedSlow = getExcludeDNs().containsKey(datanodeInfo);
if (!alreadyMarkedSlow) {
excludeDNsCache.put(datanodeInfo, EnvironmentEdgeManager.currentTime());
excludeDNsCache.put(datanodeInfo, new Pair(cause, EnvironmentEdgeManager.currentTime()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will generate a compile warning? At least use new Pair<>(cause, time)?

return excludeDatanodeManager.getExcludeDNs().entrySet().stream()
.map(e -> e.getKey().toString() + ", " + e.getValue()).collect(Collectors.toList());
return excludeDatanodeManager.getExcludeDNs().entrySet().stream().map(e -> e.getKey().toString()
+ " - " + e.getValue().getSecond() + " - " + e.getValue().getFirst())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old code we use ',' as separator, but in the new code we change to all use '-'? Could you please give some examples about how it looks like in the new implementation?

Copy link
Contributor Author

@srinireddy2020 srinireddy2020 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In RS Wals page, for WAL Exclude DNs It will display like

DatanodeInfoStorage[<ip:port,DatanodeStrorageID,DISK] - TimeStamp - cause

example:
DatanodeInfoStorage[127.0.0.1:50074,DS-fd74b292-e398-4206-9253-99734b557ae2,DISK] - 1724043945082 - connect error

DatanodeInfoStorage[127.0.0.2:50074,DS-fd74b292-e398-4206-9253-99734b557ae4,DISK] - 1724043945678 - slow packet ack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@@ -78,7 +79,7 @@ public ExcludeDatanodeManager(Configuration conf) {
public boolean tryAddExcludeDN(DatanodeInfo datanodeInfo, String cause) {
boolean alreadyMarkedSlow = getExcludeDNs().containsKey(datanodeInfo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cache support computeIfAbsent like methods? If so we do not need to access it twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache doesNot support computeIfAbsent method.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 27s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for branch
+1 💚 mvninstall 3m 8s master passed
+1 💚 compile 3m 33s master passed
+1 💚 checkstyle 0m 44s master passed
+1 💚 spotbugs 1m 53s master passed
+1 💚 spotless 0m 47s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for patch
+1 💚 mvninstall 2m 54s the patch passed
+1 💚 compile 3m 28s the patch passed
+1 💚 javac 3m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 43s the patch passed
+1 💚 spotbugs 2m 4s the patch passed
+1 💚 hadoopcheck 11m 11s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 42s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 17s The patch does not generate ASF License warnings.
39m 51s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7126/4/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7126
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 95dc46929eac 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / dbd8f65
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 85 (vs. ulimit of 30000)
modules C: hbase-asyncfs hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7126/4/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@srinireddy2020 srinireddy2020 requested a review from Apache9 July 10, 2025 18:34
@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for branch
+1 💚 mvninstall 3m 16s master passed
+1 💚 compile 1m 14s master passed
+1 💚 javadoc 0m 40s master passed
+1 💚 shadedjars 6m 8s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for patch
+1 💚 mvninstall 3m 10s the patch passed
+1 💚 compile 1m 14s the patch passed
+1 💚 javac 1m 14s the patch passed
+1 💚 javadoc 0m 39s the patch passed
+1 💚 shadedjars 6m 7s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 1m 4s hbase-asyncfs in the patch passed.
-1 ❌ unit 272m 43s /patch-unit-hbase-server.txt hbase-server in the patch failed.
302m 58s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7126/4/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7126
Optional Tests javac javadoc unit compile shadedjars
uname Linux eac6fedd1eb5 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / dbd8f65
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7126/4/testReport/
Max. process+thread count 4429 (vs. ulimit of 30000)
modules C: hbase-asyncfs hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7126/4/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants