network-observer CPU utilization unreasonably high under load #1942

c-kruse · 2025-02-04T17:54:25Z

Describe the bug
network-observer CPU utilization is unreasonably high under load

The network-observer has a collector package that listens for vanflow flow records, reconciles those records with other collector state (i.e. source and destination process records), and keeps an in memory cache of these flow records for a configurable TTL (default 15m.) Because of this, we should expect memory utilization to scale proportionally to (request rate + connection rate)*(flow ttl) as the total number of records in memory grows. CPU should scale proportional to only (request rate + connection rate), as the rate of new flow records affects the amount of time the collector spends on reconciling. Instead, the CPU utilization can far exceed the CPU footprint of the skupper-router it is attached to (as an arbitrary reference point), and it appears to scale like memory utilization should: in relation to the total number of flows in memory.

How To Reproduce
Run the network-observer with a VAN with a consistent high rate of flows: that is with an application creating many connections per second, or making many http requests.

Observe the memory and cpu utilization being very high after many flow records have accumulated.

RCA
This is just an oversight/mistake from the way the reconcile tasks work. The high CPU utilization is due to a large amount of time spent on copying the full set of flow record state for each reconcile task, and then on garbage collection cleaning up these copies. For example this section below.

skupper/cmd/network-observer/internal/collector/connections.go

Lines 549 to 554 in 668268c

    
           flowStates := c.appFlows.Items() 
        
           for i := len(flowStates) - 1; i >= 0; i-- { 
        
           	state := flowStates[i] 
        
           	if !state.LastSeen.Before(time.Now().Add(-1 * c.ttl)) { 
        
           		break 
        
           	}

Here is the routine that purges app flow records that have exceeded their TTL. It calls the method keyedLRUCache.Items() - which makes a full copy of the app flow state data structures of type []appState. Then it iterates that slice in least recently used order queuing app flows for eviction until we get to a flow record that has been updated since the cutoff time and break. It does not need the remaining majority of the collection that was copied. The slice goes out of scope, and is eventually garbage collected.

The text was updated successfully, but these errors were encountered:

c-kruse · 2025-02-04T20:41:54Z

CPU profile from network observer. Note the large portion of time dedicated to gc-related tasks and to reconcileAppFlow calling Items()

c-kruse self-assigned this Feb 4, 2025

c-kruse linked a pull request Feb 4, 2025 that will close this issue

Process flows more efficiency in network observer #1940

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

network-observer CPU utilization unreasonably high under load #1942

network-observer CPU utilization unreasonably high under load #1942

c-kruse commented Feb 4, 2025 •

edited

Loading

c-kruse commented Feb 4, 2025 •

edited

Loading

network-observer CPU utilization unreasonably high under load #1942

network-observer CPU utilization unreasonably high under load #1942

Comments

c-kruse commented Feb 4, 2025 • edited Loading

c-kruse commented Feb 4, 2025 • edited Loading

c-kruse commented Feb 4, 2025 •

edited

Loading

c-kruse commented Feb 4, 2025 •

edited

Loading