-
Notifications
You must be signed in to change notification settings - Fork 756
Add nexthop group usage check in route_check #4137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add nexthop group usage check in route_check #4137
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a nexthop group usage monitoring feature to the route_check script to prevent resource exhaustion issues. The implementation retrieves CRM (Critical Resource Monitoring) statistics from the COUNTERS_DB and alerts when nexthop group usage exceeds 80%, helping detect similar issues that previously caused traffic impact due to excessive nexthop group consumption.
Key Changes:
- Added CRM-based nexthop group usage monitoring with an 80% threshold check
- Integrated the check into the existing route validation flow in
check_routes_for_namespace() - Added support for COUNTERS_DB access to retrieve CRM statistics
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@xwjiang-ms how this trigger an alert? By syslog err? |
|
could you also check the UT failures? |
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
@StormLiangMS the result will be collected to results array and return -1 if results has any contents |
|
@xwjiang-ms could you check the PR failures? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
What I did
Previously, due to a bug in Broadcom SAI, the system incorrectly created next-hop groups with a size of 16.
This resulted in excessive next-hop group consumption, which eventually limited the number of available ECMP routes and caused traffic impact.
To prevent similar issues from going unnoticed, I added a next-hop group usage check to route_check.
If the next-hop group usage exceeds 80%, the script will report an error.
How I did it
Get CRM stats from counters DB, and get nexthop usage, then compare the usage with threshold, report error if usage exceeded threshold.
How to verify it
Verified on lab device.
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)