Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1831] Add ratis commitIndex metrics #3063

Closed
wants to merge 10 commits into from

Conversation

zaynt4606
Copy link
Contributor

@zaynt4606 zaynt4606 commented Jan 13, 2025

What changes were proposed in this pull request?

Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).

Why are the changes needed?

To observe the metadata synchronization of the raft cluster.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Cluster test.
image

@zaynt4606 zaynt4606 marked this pull request as draft January 13, 2025 07:20
@SteNicholas
Copy link
Member

@zaynt4606, is it better to introduce ratis metrics to cover ha metrics?

@zaynt4606
Copy link
Contributor Author

@zaynt4606, is it better to introduce ratis metrics to cover ha metrics?

Are there ratis metrics that already exist?I want to add each master's commitIndex metrics to observe the metadata synchronization of the raft cluster in master panels like this.
image

@zaynt4606 zaynt4606 changed the title [CELEBORN-1831] Add master ha commitIndex metrics [CELEBORN-1831] Add ratis commitIndex metrics Jan 13, 2025
@zaynt4606 zaynt4606 marked this pull request as ready for review January 13, 2025 12:45
@@ -60,6 +60,10 @@ object MasterSource {

val OFFER_SLOTS_TIME = "OfferSlotsTime"

val MASTER_COMMIT_INDEX = "MasterCommitIndex"

val MASTER_COMMIT_INDEX_DIFF = "MasterCommitIndexDiff"
Copy link
Member

@turboFei turboFei Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RatisCommitIndex and RatisCommitIndexDiff are more straight forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done~

@@ -285,6 +286,10 @@ private[celeborn] class Master(
statusSystem.decommissionWorkers.size()
}

masterSource.addGauge(MasterSource.MASTER_COMMIT_INDEX) { () => getMasterRaftCommitIndex._1 }

masterSource.addGauge(MasterSource.MASTER_COMMIT_INDEX_DIFF) { () => getMasterRaftCommitIndex._2 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only addGauge if haEnabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, has been updated.

Copy link
Member

@turboFei turboFei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

assets/grafana/celeborn-dashboard.json Outdated Show resolved Hide resolved
assets/grafana/celeborn-dashboard.json Outdated Show resolved Hide resolved
@SteNicholas
Copy link
Member

@zaynt4606, ratis supports metrics which refers to https://github.com/apache/ratis/blob/master/ratis-docs/src/site/markdown/metrics.md.

@zaynt4606
Copy link
Contributor Author

zaynt4606 commented Jan 16, 2025

@zaynt4606, ratis supports metrics which refers to https://github.com/apache/ratis/blob/master/ratis-docs/src/site/markdown/metrics.md.

Replace commitIndex with applyCompletedIndex in Ratis~

Copy link

codecov bot commented Jan 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.54%. Comparing base (61c90e3) to head (5f21d95).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3063      +/-   ##
==========================================
+ Coverage   32.52%   32.54%   +0.02%     
==========================================
  Files         336      336              
  Lines       20053    20055       +2     
  Branches     1796     1796              
==========================================
+ Hits         6520     6524       +4     
+ Misses      13168    13167       -1     
+ Partials      365      364       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 1487 to 1501
private def getRatisApplyCompletedIndex: Long = {
if (conf.haEnabled) {
val ratisServer = statusSystem.asInstanceOf[HAMasterMetaManager].getRatisServer
if (ratisServer != null) {
val stateMachine = ratisServer.getMasterStateMachine
val lastAppliedIndex = stateMachine.getLastAppliedTermIndex.getIndex
lastAppliedIndex
} else {
0
}
} else {
0
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private def getRatisApplyCompletedIndex: Long = {
if (conf.haEnabled) {
val ratisServer = statusSystem.asInstanceOf[HAMasterMetaManager].getRatisServer
if (ratisServer != null) {
val stateMachine = ratisServer.getMasterStateMachine
val lastAppliedIndex = stateMachine.getLastAppliedTermIndex.getIndex
lastAppliedIndex
} else {
0
}
} else {
0
}
}
private def getRatisApplyCompletedIndex: Long = {
if (conf.haEnabled) {
val ratisServer = statusSystem.asInstanceOf[HAMasterMetaManager].getRatisServer
if (ratisServer != null) {
ratisServer.getMasterStateMachine.getLastAppliedTermIndex.getIndex
}
}
0
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RexXiong RexXiong closed this in ac0d335 Jan 17, 2025
@RexXiong
Copy link
Contributor

Thanks, merge to main(v0.6.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants