diff --git a/doc/ecmp/fine_grained_next_hop_hld.md b/doc/ecmp/fine_grained_next_hop_hld.md index 389fe8b40b8..c90b88dfafd 100644 --- a/doc/ecmp/fine_grained_next_hop_hld.md +++ b/doc/ecmp/fine_grained_next_hop_hld.md @@ -43,6 +43,7 @@ | 1.3 | 10/23/2020 | Anish Narsian | Interface nh oper state handler | | 1.4 | 12/21/2020 | Anish Narsian | Match Mode changes | | 1.5 | 09/16/2024 | Ashutosh Agrawal/Manas Mandal | Added prefix-based match mode | +| 1.6 | 11/24/2025 | Anish Narsian | VNET_TUNNEL_ROUTE consistent hashing | # About this Manual This document provides the high level design for the Fine Grained ECMP feature implementation in SONiC @@ -90,6 +91,9 @@ Phase #1 Phase #2 - CLI commands to configure Fine Grained ECMP +Phase #3: +- Ability to enable consistent hashing for Vxlan tunnel next hops(https://github.com/sonic-net/SONiC/pull/2099/files) + ## 1.2 Orchagent requirements ### FgNhg orchagent: - Should be able to create Fine Grained Next-hop groups @@ -99,6 +103,9 @@ Phase #2 ### Route orchagent: - Should be able to redirect route and next-hop modifications to fgNhg orchagent for prefixes or next-hops which have a Fine Grained definition +### Vnet orchagent: + - Should be able to redirect vxlan tunnel nexthop group creation and modification to fgNhg orchagent for prefixes which require Fine Grained behavior, more details https://github.com/sonic-net/SONiC/pull/2099/files + ## 1.3 CLI requirements - User should be able to add/delete/view Fine Grained Next-hop groups - User should be able to view the configured state of fine grained groups @@ -169,7 +176,7 @@ Please refer to the [schema](https://github.com/sonic-net/sonic-swss/blob/master Following new table will be added to State DB. Unless otherwise stated, the attributes are mandatory. FG_ROUTE_TABLE is used for some of the show commands associated with this feature as well as for warm boot support. ``` -FG_ROUTE_TABLE|{{IPv4 OR IPv6 prefix}}: +FG_ROUTE_TABLE|{{VRF/VNET-name}}|{{IPv4 OR IPv6 prefix}}: "0": {{next-hop-key}} "1": {{next-hop-key}} ... @@ -180,7 +187,7 @@ FG_ROUTE_TABLE|{{IPv4 OR IPv6 prefix}}: ### 2.2.1 StateDB Schemas ``` ; Defines schema for FG ROUTE TABLE state db attributes -key = FG_ROUTE_TABLE|{{IPv4 OR IPv6 prefix}} ; Prefix associated with this route +key = FG_ROUTE_TABLE|{{VRF/VNET-name}}|{{IPv4 OR IPv6 prefix}} ; VNET/VRF and Prefix associated with this route ; field = value INDEX = next-hop-key ; index in hash bucket associated with the next-hop-key(IP addr,if alias) ``` @@ -320,6 +327,8 @@ Following orchagents shall be modified. Flow diagrams are captured in a later se The overall data flow diagram is captured in Section 3 for all TABLE updates. Refer to section 4 for detailed information about redistribution performed during runtime scenarios. +### vnetorch +This is the swss orchestrator which receives VNET_ROUTE_TUNNEL_TABLE entires along with a need to configure consistent hashing, vnetorch will check if consistent_hashing_buckets is set in the kv pairs and if so call fgnhgorch to create an internal FgNhgEntry and the SAI nexthop group and group members, vnetorch will also assoicate the fine grained ecmp nexthop group with a route. More details https://github.com/sonic-net/SONiC/pull/2099/files ## 2.5 SAI The below table represents main SAI attributes which shall be used for Fine Grained ECMP @@ -350,6 +359,7 @@ The below table represents main SAI attributes which shall be used for Fine Grai - A guideline for the hash bucket size is to define a bucket size which will allow equal distribution of traffic regardless of the number of next-hops which are active. For example with 2 Firewall sets, each set containing 3 firewall members: each set can have equal redistribution by finding the lowest common multiple of 3 next-hops which is 3x2x1(this is equivalent to us saying that if there were 3 or 2 or 1 next-hop active, we could distribute the traffic equally amongst the next-hops). With 2 such sets we get a total of 3x2x1 + 3x2x1 = 12 hash buckets. - fgnhgorch is an observer for SUBJECT_TYPE_PORT_OPER_STATE_CHANGE events, these events are used in conjunction with the IP to interface mapping(INTERFACE attribute of the FG NHG member table), to trigger next-hop withdrawal/addition depending on which interface's operational state transitioned to down/up. The next-hop withdrawal/addition is performed per consistent and layered hashing rules. The INTERFACE attribute is optional, so this functionality is activated based on user configuration. - There are 2 match_modes supported for Fine Grained ECMP. A nexthop-based match mode implies that all prefixes that have next-hop IPs as a subset of the FG_NHG_MEMBER nh IPs defined by the user, will get Fine Grained ECMP behavior. If a route has next-hops which don't have an equivalent FG_NHG_MEMBER, then the route will get regular ECMP/next-hop behavior. A route-based match mode implies that only those prefixes which have FG_NHG_PREFIX defined will get Fine Grained ECMP behavior. The example configuration section has examples of both config types. +- Details for VNET_ROUTE_TUNNEL with fine grained ecmp can be found in https://github.com/sonic-net/SONiC/pull/2099/files # 5 Example configuration @@ -559,3 +569,4 @@ Test details: - Test both IPv4 and IPv6 above - The above test is configured via config_db entries directly, a further test mode to configure Fine Grained ECMP via minigraph will be present and tested - Test warm reboot to ensure there is no traffic disruption and ECMP groups are correctly applied post warm boot + diff --git a/doc/vxlan/Consistent ECMP for Vxlan tunnel.md b/doc/vxlan/Consistent ECMP for Vxlan tunnel.md new file mode 100644 index 00000000000..70d63a1802e --- /dev/null +++ b/doc/vxlan/Consistent ECMP for Vxlan tunnel.md @@ -0,0 +1,198 @@ +# Consistent ECMP for Vxlan Tunnels + +# Table of Contents + +- [Revision](#revision) +- [Scope](#scope) +- [Overview](#1-overview) +- [Schema Changes](#2-schema-changes) + - [Config and APP DB](#21-config-and-appdb) + - [STATE DB](#22-state-db) + - [CLI](#23-cli) + - [YANG model](#24-yang-model) +- [Programming Flow](#3-programming-flow) +- [SWSS orchagent design](#4-swss-orchagent-design) +- [Test Plan](#5-test-plan) + + +# Revision + +| Rev | Date | Author | Change Description | +|:---:|:-----------:|:------------------:|-----------------------------------| +| 1.0 | 11/04/2025 | Anish Narsian | Added Consistent hashing support | + + +# Scope + +This document goes over an enhancement to VXLAN tunnel endpoint ECMP to add support for consistent hashing towards a group of tunnel endpoints that are nexthops for a given tunnel route. This is an extension to the existing VNET Vxlan support as defined in the [Vxlan HLD](https://github.com/sonic-net/SONiC/blob/master/doc/vxlan/Vxlan_hld.md) + + +# Abbreviations + +| Abbreviation | Meaning | +|--------------------------|-----------------| +| NH | Next hop | +| NHG | Next hop Group | +| NHGM | Next hop Group Member | +| FG | Fine Grained | +| ECMP | Equal Cost MultiPath | + +# 1 Overview +The details for enabling consistent hashing for Vxlan tunnel route(VNET_ROUTE_TUNNEL) are discussed in this document. + +###### Use-case: + Vxlan tunnel routes can contain a list of endpoints(next-hops) for overlay traffic to be routed to multiple underlay endpoints(next-hops). When there are multiple endpoints, ECMP is used to select the nexthop for this traffic to be encapsulated towards and sent out. This is primarily used in scenarios where throughput needs to be scaled beyond what a single vxlan endpoint is capable of. When these endpoints hold flow state, endpoint modifications(next-hop addition/removal), will result in most flows being rehashed and sent to a different endpoint than what they were originally going to, resulting in connection restart whenever a endpoint modification is performed. To limit connection restarts during endpoint/next hop modifications, we will enable consistent hashing for tunnel nexthops. + +###### Scale: +| Component | Expected value | +|--------------------------|-----------------------------| +| NHG size| 512 - 2048 next hop group members(NHGMs) | + +# 2 Schema Changes + +## 2.1 Config and APP DB + +We modify Config DB's **VNET_ROUTE_TUNNEL** and correspondingly APP_DB's **VNET_ROUTE_TUNNEL_TABLE** to support consistent hashing, the schema can be found below: + +The following new fields have been added the **VNET_ROUTE_TUNNEL_TABLE** + - consistent_hashing_buckets + +``` + +VNET_ROUTE_TUNNEL_TABLE:{{vnet_name}}:{{prefix}} + “endpoint”: {{ip_address1},{ip_address2},...} + “endpoint_monitor”: {{ip_address1},{ip_address2},...} (OPTIONAL) + “mac_address”: {{mac_address1},{mac_address2},...} (OPTIONAL) + “monitoring”: {{“custom”}} (OPTIONAL) + “vni”: {{vni1},{vni2},...} (OPTIONAL) + “weight”: {{w1},{w2},...} (OPTIONAL) + “profile”: {{profile_name}} (OPTIONAL) + “primary”: {{ip_address1}, {ip_address2}} (OPTIONAL) + “profile”: {{profile_name}} (OPTIONAL) + “adv_prefix”: {{prefix}} (OPTIONAL) + “rx_monitor_timer”: {time in milliseconds} (OPTIONAL) + “tx_monitor_timer”: {time in milliseconds} (OPTIONAL) + “check_directly_connected”: {{true|false}} (OPTIONAL) + “consistent_hashing_buckets”: {{bucket_size}} (OPTIONAL) -> newly introduced +``` + + +``` +consistent_hashing_buckets = DIGITS ; if specified, consistent hashing will be used for nexthops to the vnet route tunnel, the bucket size should be determined by the caller based on # of nexthops and redundancy factor, which will define how many bucket entries each nexthop receives (Optional) +``` + +## 2.2 STATE DB + +The existing Fine grained ecmp state DB table will be modified to store a VRF/VNET name, so that IP space collisions across VRFs/VNETs can be supported + +``` +FG_ROUTE_TABLE|{{VRF/VNET-name}}|{{IPv4 OR IPv6 prefix}}: + "0": {{next-hop-key}} + "1": {{next-hop-key}} + ... + "{{hash_bucket_size -1}}": {{next-hop-key}} +``` + +## 2.3 CLI +*CLI command enhancement to be able to see consistent hashing buckets for a partricular VRF/VNET and prefix:* + +``` +show fgnhg hash-view +show fgnhg active-hops +``` + +*CLI output format: show fgnhg hash-view * +``` +-----------+-----------------+--------------------+----------------+ +| VNET/VRF | FG_NHG_PREFIX | Next Hop | Hash buckets | +===========+=================+====================+================+ +``` + +*CLI output format: show fgnhg hash-view * +``` +-----------+-----------------+--------------------+ +| VNET/VRF | FG_NHG_PREFIX | Active Next Hops | +===========+=================+====================+ +``` + +## 2.4 YANG Model +The following enhancements to the VNET_ROUTE_TUNNEL YANG model will be made, specifically endpoints, mac and vni are converted into a comma separated list as a string type, and consistent_hashing_buckets is added: + +``` + container VNET_ROUTE_TUNNEL { + description "ConfigDB VNET_ROUTE_TUNNEL table"; + + list VNET_ROUTE_TUNNEL_LIST { + key "vnet_name prefix"; + leaf vnet_name { + description "VNET name"; + type leafref { + path "/svnet:sonic-vnet/svnet:VNET/svnet:VNET_LIST/svnet:name"; + } + } + + leaf prefix { + description "IPv4 prefix in CIDR format"; + type stypes:sonic-ip4-prefix; + } + + leaf endpoint { + description "Comma separated list of endpoint/next hop tunnel IPs if multiple nexthops, or a single IP address"; + type string; + mandatory true; + } + leaf mac_address { + description "Comma separated list of inner dest mac in encapsulated packet if there are multiple nexthops/endpoints, or a single mac address"; + type string; + } + leaf vni { + description "Comma separated list of VNIs if there are multiple nexthops/endpoints, or a single VNI for the route/nh"; + type string; + } + leaf consistent_hashing_buckets { + description "Number of consistent hashing buckets to use, if consistent hashing is desired"; + type unit16; + } + } + } +``` + +# 3 Programming flow +*E2E creation flow for VNET_ROUTE_TUNNEL with consistent hashing* +![](../../images/vxlan_hld/CreateTunnelConsistentHashing.png) + +*E2E flow for updating tunnel endpoints list with consistent hashing* +![](../../images/vxlan_hld/UpdateTunnelConsistentHashing.png) + +# 4 SWSS orchagent design +1. vnetorch will receive a call to create a VNET_ROUTE_TUNNEL_TABLE +2. vnetorch will check if consistent_hashing_buckets is set and if so call fgnhgorch to create internal FgNhgEntry with the following parameters: +2.a FGMatchMode will be PREFIX_BASED +2.b max_next_hops = configured_bucket_size = consistent_hashing_buckets +2.c The prefix for Fine grained behavior = prefix of the VNET_ROUTE_TUNNEL_TABLE +3. Next, vnetorch will call fgnhgorch to do the nexthop group creation with consistent hashing +4. For subsequent next-hop changes, vnetorch will continue calling fgnhgorch to handle the nexthop changes +5. At the time of VNET_ROUTE_TUNNEL_TABLE deletion, the nexthop and the internal FgNhgEntry will be deleted/cleaned up +6. For VNET_ROUTE_TUNNEL_TABLE modification where “consistent_hashing_buckets” is added for an existing tunnel route a transition from non fine grained to fine grained ecmp must occur and when “consistent_hashing_buckets” is removed then a transition from fine grained to non fine grained ecmp occurs. Both of these transitions result in a sai route update with new nexthop group/nexthop along with deleting any left over stale nexthop groups. + + +# 5 Test Plan for the enhacements +The following testing is planned for this feature: +- SWSS unit tests via virtual switch testing +- Data Plane tests via pytest + PTF + + +## SWSS unit tests: +1. Add VNET_ROUTE_TUNNEL_TABLE with consistent_hashing_buckets, and check that SAI objects are created for next-hop group and next-hop group member as fine grained ecmp +2. Remove endpoint in VNET_ROUTE_TUNNEL_TABLE, and ensure that only the next-hop group member associated with the removed endpoint is modified with another nexthop tunnel, and that the hash buckets are balanced +3. Add endpoint in VNET_ROUTE_TUNNEL_TABLE, and ensure that only total hash buckets/total endpoints buckets are impacted as a result of the change +4. Remove consistent_hashing_buckets paramater from VNET_ROUTE_TUNNEL_TABLE, and ensure that the fine grained next-hop group is cleaned up and a regular next-hop group is created, with the route pointing to the regular next-hop group +5. Add consistent_hashing_buckets paramater to VNET_ROUTE_TUNNEL_TABLE, and ensure that a fine grained next-hop group is created and the original regular next-hop group is cleaned up, with the route pointing to the fine grained next-hop group + +## Dataplane tests: +1. Do a base setup with VXLAN_TUNNEL, VNET, interface binded to the vnet +2. Add VNET_ROUTE_TUNNEL_TABLE with consistent_hashing_buckets, with 10 endpoints +3. Send 1000 unique flows and check that the resultant packet which goes out of the DUT contains varying outer dst IPs, track the flow to outer dst IP +4. Modify VNET_ROUTE_TUNNEL_TABLE to remove 1 endpoint IP, check that the only flows impacted in the 1000 unique flow to outer dst IP mapping are the ones associated with the withdrawn endpoint +5. Modify VNET_ROUTE_TUNNEL_TABLE to add 1 endpoint IP, check that only a small % of flows, ie <10% are impacted by this endpoint addition. +6. Validate that in all cases the flow distribution per endpoint is roughly equal diff --git a/images/vxlan_hld/CreateTunnelConsistentHashing.png b/images/vxlan_hld/CreateTunnelConsistentHashing.png new file mode 100644 index 00000000000..509cd8d8b9e Binary files /dev/null and b/images/vxlan_hld/CreateTunnelConsistentHashing.png differ diff --git a/images/vxlan_hld/UpdateTunnelConsistentHashing.png b/images/vxlan_hld/UpdateTunnelConsistentHashing.png new file mode 100644 index 00000000000..119629599d9 Binary files /dev/null and b/images/vxlan_hld/UpdateTunnelConsistentHashing.png differ