You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
network-observer CPU utilization is unreasonably high under load
The network-observer has a collector package that listens for vanflow flow records, reconciles those records with other collector state (i.e. source and destination process records), and keeps an in memory cache of these flow records for a configurable TTL (default 15m.) Because of this, we should expect memory utilization to scale proportionally to (request rate + connection rate)*(flow ttl) as the total number of records in memory grows. CPU should scale proportional to only (request rate + connection rate), as the rate of new flow records affects the amount of time the collector spends on reconciling. Instead, the CPU utilization can far exceed the CPU footprint of the skupper-router it is attached to (as an arbitrary reference point), and it appears to scale like memory utilization should: in relation to the total number of flows in memory.
How To Reproduce
Run the network-observer with a VAN with a consistent high rate of flows: that is with an application creating many connections per second, or making many http requests.
Observe the memory and cpu utilization being very high after many flow records have accumulated.
RCA
This is just an oversight/mistake from the way the reconcile tasks work. The high CPU utilization is due to a large amount of time spent on copying the full set of flow record state for each reconcile task, and then on garbage collection cleaning up these copies. For example this section below.
Here is the routine that purges app flow records that have exceeded their TTL. It calls the method keyedLRUCache.Items() - which makes a full copy of the app flow state data structures of type []appState. Then it iterates that slice in least recently used order queuing app flows for eviction until we get to a flow record that has been updated since the cutoff time and break. It does not need the remaining majority of the collection that was copied. The slice goes out of scope, and is eventually garbage collected.
The text was updated successfully, but these errors were encountered:
Describe the bug
network-observer CPU utilization is unreasonably high under load
The network-observer has a collector package that listens for vanflow flow records, reconciles those records with other collector state (i.e. source and destination process records), and keeps an in memory cache of these flow records for a configurable TTL (default 15m.) Because of this, we should expect memory utilization to scale proportionally to (request rate + connection rate)*(flow ttl) as the total number of records in memory grows. CPU should scale proportional to only (request rate + connection rate), as the rate of new flow records affects the amount of time the collector spends on reconciling. Instead, the CPU utilization can far exceed the CPU footprint of the skupper-router it is attached to (as an arbitrary reference point), and it appears to scale like memory utilization should: in relation to the total number of flows in memory.
How To Reproduce
Run the network-observer with a VAN with a consistent high rate of flows: that is with an application creating many connections per second, or making many http requests.
Observe the memory and cpu utilization being very high after many flow records have accumulated.
RCA
This is just an oversight/mistake from the way the reconcile tasks work. The high CPU utilization is due to a large amount of time spent on copying the full set of flow record state for each reconcile task, and then on garbage collection cleaning up these copies. For example this section below.
skupper/cmd/network-observer/internal/collector/connections.go
Lines 549 to 554 in 668268c
keyedLRUCache.Items()
- which makes a full copy of the app flow state data structures of type[]appState
. Then it iterates that slice in least recently used order queuing app flows for eviction until we get to a flow record that has been updated since the cutoff time and break. It does not need the remaining majority of the collection that was copied. The slice goes out of scope, and is eventually garbage collected.The text was updated successfully, but these errors were encountered: