Skip to content

Commit e6390c3

Browse files
TAM config model for inband-telemetry and drop-monitoring
1 parent 29db0ee commit e6390c3

File tree

3 files changed

+219
-0
lines changed

3 files changed

+219
-0
lines changed

doc/tam/tam_config_flow_dm.png

97.1 KB
Loading

doc/tam/tam_config_flow_int.png

51.4 KB
Loading

doc/tam/tam_hld.md

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
# SONiC Telemetry and Monitoring (TAM) High Level Design
2+
3+
### Rev 0.1
4+
5+
## Table of Contents
6+
7+
## 1. Revision
8+
Rev | Date | Author | Change Description
9+
----|------|--------|-------------------
10+
|v0.1|2025-10-09|Senthil Krishnamurthy|Initial version of TAM HLD
11+
12+
## 2. Scope
13+
This document describes the high level design of Telemetry and Monitoring (TAM) in SONiC.
14+
15+
## 3. Definitions/Abbreviations
16+
Definitions/Abbreviation|Description
17+
------------------------|-----------
18+
TAM| Telemetry and Monitoring
19+
SAI| Switch Abstraction Interface
20+
IPFIX| IP Flow Information Export
21+
VRF| Virtual Routing and Forwarding
22+
INT| In-band Network Telemetry
23+
IFA| In-band Flow Analyzer
24+
25+
## 4. Overview
26+
TAM provides a framework to observe and report network events and telemetry by exporting messages to a collector. Telemetry provides information about the network traffic, such as congestion, latency, and port buffer utlization, on a per-device basis. Monitoring provides insights packet drops, and flow statistics.
27+
28+
### 4.1 Telemetry
29+
Inband Telemetry (INT) is a variant of telemetry that provides insights into the path taken by the packets through the network, including per-hop latency, queue depth, and other metrics. Inband Flow Analyzer (IFA) is a mechanism INT uses to embed the telemetry data directly into the packet. In an IFA-aware network, the hops can perform one of the three roles:
30+
1. **Initiator Node**: This node initiates the IFA packet and can operate in one of two modes:
31+
a. **Inband Mode**: IFA header and the first IFA metadata, with telemetry data, is added to the data packet.
32+
b. **Probe Mode**: Incoming data packets are sampled and the first IFA metadata is added to the sampled packet.
33+
2. **Transit Node**: Matches on IFA packet and adds new IFA metadata to the packet.
34+
3. **Terminator Node**: Inband IFA packet are stripped off the IFA header and metadata, and the datapacket is forwarded to the next hop, and the metadata is exported to a collector. Probe IFA packets are teminated and mirroed to a collector.
35+
36+
### 4.2 Monitoring
37+
TAM supports the Drop-Monitor (DM) feature, which provides visibility into packet drops in the dataplane. The dropped packets are trapped to a collector along with metadata that includes the reason for the drop and the payload of the packet, upto a maximum length. It can operate in two modes:
38+
1. **Stateless DM**: Drop reasons are exported on a per-packet per-drop basis.
39+
2. **Stateful DM**: Drop reasons are exported on a per-flow basis, in periodic intervals.
40+
41+
## 5. Requirements
42+
TAM will be implemented in multiple phases.
43+
44+
### 5.1 Phase 1
45+
- Support Initiator and Transit roles for INT/IFA
46+
- Support IFA version 2.0 for the IFA header and metadata inserted in the packet
47+
- Support drop-monitor (DM) telemetry type with IPFIX reporting
48+
- Support stateless drop-monitoring
49+
- Collector endpoint can be either IPv4 or IPv6
50+
- Collectors can be reachable via default or mgmt VRF (mgmt VRF must be enabled to use it)
51+
- Flow groups can be defined to seletcs packets for IFA and DM
52+
- TAM Session can be bound to one or more flow-groups
53+
- TAM Session can be bound to one or more collectors
54+
55+
### 5.2 Phase 2
56+
- Support stateful drop-monitoring
57+
- Support Terminator role for INT/IFA
58+
- Support packet sampling
59+
- Support additional telemetry types (e.g., flow-statistics)
60+
- Support additional report types (e.g., gRPC)
61+
62+
## 6. Module Design
63+
### 6.1 Overall design
64+
- Management framework writes TAM tables to CONFIG_DB using yang model
65+
- TamOrch (new) in orchagent subscribes to CONFIG_DN TAM_* tables and programs SAI TAM objects
66+
- Syncd/SAI implement TAM object model and programs the dataplane
67+
- Telemetry data is exported to collectors from front-panel ports without punts to CPU
68+
69+
### 6.2 Configuration and control flow
70+
The SWSS container is enhanced to add a new component, TamOrch, to process TAM configuration and control.
71+
72+
#### 6.2.1 Inband Telemetry
73+
The following figure shows the configuration and control flows for TAM INT using IFAv2:
74+
![tam_config_flow_int](tam_config_flow_int.png)
75+
76+
1) Administrator configures TAM device attributes (CONFIG_DB), with IFA enabled
77+
2) tamorch uses aclorch to create an ACL table and an ACL rule to match on IP_PROTO 253 (IFA)
78+
3) tamorch creates SAI TAM_REPORT and TAM_INT objects
79+
4) The ACL table is bound to the switch
80+
81+
#### 6.2.2 Drop Monitoring
82+
The following figure shows the configuration and control flows for TAM Drop Monitoring:
83+
![tam_config_flow_dm](tam_config_flow_dm.png)
84+
85+
1) Administrator configures TAM collectors, flow-groups, and sessions (CONFIG_DB)
86+
2) If the collector config uses a VRF, tamorch resolves the VRF to resolve the nexthop for the collector destination IP
87+
3) tamorch creates SAI TAM_COLLECTOR and TAM_TRANSPORT objects using the resolved nexthop
88+
4) tamorch uses aclorch to create an ACL table and an ACL rule with match conditions from the flow-group rules
89+
5) The ACL table is bound to the list of front-panel ports specified in the flow-group
90+
6) tamorch processes the TAM_SESSION_TABLE to create SAI TAM objects (report, event_action, event, tam)
91+
92+
### 6.3 SWSS and syncd changes
93+
- New tamorch: consumes CONFIG_DB TAM tables, maps to SAI TAM objects, maintains reference counts and object lifecycles
94+
- syncd/SAI: no changes
95+
96+
## 8. Configuration and Management
97+
### 8.1 CONFIG_DB
98+
Configure IFA by creating a TAM table:
99+
```
100+
"TAM": {
101+
"device": {
102+
"device-id": 1234, // 28bits
103+
"enterprise-id": 1234,
104+
"ifa": true // boolean
105+
}
106+
}
107+
```
108+
109+
Configure drop-monitor by creating flow-groups, collectors, and sessions:
110+
```
111+
"TAM_FLOW_GROUP": {
112+
"fg-1": {
113+
"aging_interval": 60,
114+
"ports": ["Ethernet0", "PortChannel10"]
115+
},
116+
"fg-1|rule1": {
117+
"src_ip_prefix": "0.0.0.0/0",
118+
"dst_ip_prefix": "10.0.0.0/8",
119+
"ip_protocol": 6,
120+
"l4_dst_port": 443
121+
}
122+
},
123+
"TAM_COLLECTOR": {
124+
"c1": {
125+
"dst_ip": "192.0.2.10",
126+
"dst_port": 4739,
127+
"dscp_value": 32,
128+
"vrf": "vrf_blue"
129+
}
130+
},
131+
"TAM_SESSION": {
132+
"s-drop": {
133+
"type": "drop-monitor",
134+
"report_type": "ipfix",
135+
"flow_group": "fg-1",
136+
"collector": ["c1"]
137+
}
138+
}
139+
```
140+
Configure sFlow TAM Session:
141+
```
142+
"TAM_SESSION": {
143+
"s-sflow": {
144+
"type": "sflow",
145+
"report_type": "ipfix",
146+
"collector": ["c1"]
147+
}
148+
}
149+
```
150+
151+
### 8.2 DB and Schema changes
152+
Derived from sonic-tam.yang. New ConfigDB tables (names map 1:1 to YANG containers/lists):
153+
154+
ConfigDB TAM (device-level):
155+
- Key: `TAM|device`
156+
- Fields:
157+
- device-id: uint32 (1..134217727)
158+
- enterprise-id: uint32 (1..134217727)
159+
- ifa: "true"|"false"
160+
161+
ConfigDB TAM_FLOW_GROUP:
162+
- Key: `TAM_FLOW_GROUP|<name>`
163+
- Fields:
164+
- aging_interval: uint32 (seconds)
165+
- ports: [list of interface names or PortChannel names]
166+
ConfigDB TAM_FLOW_GROUP_RULE:
167+
- Key: `TAM_FLOW_GROUP_RULE|<name>|<rule>`
168+
- Fields:
169+
- src_ip_prefix: IPv4/IPv6 prefix (mandatory)
170+
- dst_ip_prefix: IPv4/IPv6 prefix (mandatory)
171+
- l4_src_port: uint16 (optional)
172+
- l4_dst_port: uint16 (optional)
173+
- ip_protocol: uint8 (1..143) (optional)
174+
175+
ConfigDB TAM_COLLECTOR:
176+
- Key: `TAM_COLLECTOR|<name>`
177+
- Fields:
178+
- src_ip: IPv4/IPv6 (optional)
179+
- dst_ip: IPv4/IPv6 (mandatory)
180+
- dst_port: uint16 (mandatory)
181+
- dscp_value: uint8 (default 32)
182+
- vrf: "default"|"mgmt"|<VRF name>; mgmt requires MGMT_VRF enabled
183+
184+
ConfigDB TAM_SESSION:
185+
- Key: `TAM_SESSION|<name>`
186+
- Fields:
187+
- type: "drop-monitor" (mandatory)
188+
- report_type: "ipfix" (default ipfix)
189+
- flow_group: <TAM_FLOW_GROUP name> (mandatory)
190+
- collector: [list of TAM_COLLECTOR names] (min 1)
191+
192+
193+
## 9. Warmboot and Fastboot Impact
194+
- No additional sleeps in boot-critical path. TAM object creation occurs after dependencies are up. Service can be delayed until SYSTEM_READY is Up. When disabled/unused, no impact.
195+
196+
## 10. Memory Consumption
197+
- Minimal control-plane state in orchagent (object maps). No growth when feature disabled.
198+
199+
## 11. Restrictions/Limitations
200+
- Requires platform/SAI support for TAM drop monitoring and IPFIX export; otherwise feature remains inoperative (capability=false)
201+
- Exact limits (number of flow-groups/rules/collectors) depend on platform
202+
- mgmt VRF usage requires MGMT_VRF enabled
203+
204+
## 12. Testing Requirements
205+
### 12.1 Unit tests (one-liners)
206+
1) Validate CONFIG_DB for each table and field
207+
2) Validate reference checks (ports, VRF, flow_group, collector)
208+
3) Validate tamorch creates/updates/deletes SAI TAM objects per CONFIG_DB changes
209+
4) Capability gating: with capability=false, CONFIG_DB writes do not program SAI
210+
211+
### 12.2 System tests
212+
1) Configure flow-group, rule, collector, session; verify IPFIX is exported to collector
213+
2) Verify VRF selection (default vs mgmt) and DSCP marking
214+
3) Verify rule match scoping and port/PortChannel membership
215+
4) Reboot/warm-reboot and verify export resumes with preserved configuration
216+
217+
## 13. Open/Action items
218+
- Finalize CLI commands and Command-Reference.md updates aligned to YANG
219+
- Document per-platform capability and limits

0 commit comments

Comments
 (0)