Skip to content

Conversation

@xwjiang-ms
Copy link
Contributor

What I did

Previously, due to a bug in Broadcom SAI, the system incorrectly created next-hop groups with a size of 16.
This resulted in excessive next-hop group consumption, which eventually limited the number of available ECMP routes and caused traffic impact.
To prevent similar issues from going unnoticed, I added a next-hop group usage check to route_check.
If the next-hop group usage exceeds 80%, the script will report an error.

How I did it

Get CRM stats from counters DB, and get nexthop usage, then compare the usage with threshold, report error if usage exceeded threshold.

How to verify it

Verified on lab device.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Copilot AI review requested due to automatic review settings November 27, 2025 03:47
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot finished reviewing on behalf of xwjiang-ms November 27, 2025 03:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a nexthop group usage monitoring feature to the route_check script to prevent resource exhaustion issues. The implementation retrieves CRM (Critical Resource Monitoring) statistics from the COUNTERS_DB and alerts when nexthop group usage exceeds 80%, helping detect similar issues that previously caused traffic impact due to excessive nexthop group consumption.

Key Changes:

  • Added CRM-based nexthop group usage monitoring with an 80% threshold check
  • Integrated the check into the existing route validation flow in check_routes_for_namespace()
  • Added support for COUNTERS_DB access to retrieve CRM statistics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@StormLiangMS
Copy link
Contributor

@xwjiang-ms how this trigger an alert? By syslog err?

@StormLiangMS
Copy link
Contributor

could you also check the UT failures?

Co-authored-by: Copilot <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xwjiang-ms
Copy link
Contributor Author

@xwjiang-ms how this trigger an alert? By syslog err?

@StormLiangMS the result will be collected to results array and return -1 if results has any contents

@StormLiangMS
Copy link
Contributor

@xwjiang-ms could you check the PR failures?

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xwjiang-ms
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants