Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network-observer CPU utilization unreasonably high under load #1942

Open
c-kruse opened this issue Feb 4, 2025 · 1 comment · May be fixed by #1940
Open

network-observer CPU utilization unreasonably high under load #1942

c-kruse opened this issue Feb 4, 2025 · 1 comment · May be fixed by #1940
Assignees

Comments

@c-kruse
Copy link
Contributor

c-kruse commented Feb 4, 2025

Describe the bug
network-observer CPU utilization is unreasonably high under load

The network-observer has a collector package that listens for vanflow flow records, reconciles those records with other collector state (i.e. source and destination process records), and keeps an in memory cache of these flow records for a configurable TTL (default 15m.) Because of this, we should expect memory utilization to scale proportionally to (request rate + connection rate)*(flow ttl) as the total number of records in memory grows. CPU should scale proportional to only (request rate + connection rate), as the rate of new flow records affects the amount of time the collector spends on reconciling. Instead, the CPU utilization can far exceed the CPU footprint of the skupper-router it is attached to (as an arbitrary reference point), and it appears to scale like memory utilization should: in relation to the total number of flows in memory.

How To Reproduce
Run the network-observer with a VAN with a consistent high rate of flows: that is with an application creating many connections per second, or making many http requests.

Observe the memory and cpu utilization being very high after many flow records have accumulated.

RCA
This is just an oversight/mistake from the way the reconcile tasks work. The high CPU utilization is due to a large amount of time spent on copying the full set of flow record state for each reconcile task, and then on garbage collection cleaning up these copies. For example this section below.

flowStates := c.appFlows.Items()
for i := len(flowStates) - 1; i >= 0; i-- {
state := flowStates[i]
if !state.LastSeen.Before(time.Now().Add(-1 * c.ttl)) {
break
}
Here is the routine that purges app flow records that have exceeded their TTL. It calls the method keyedLRUCache.Items() - which makes a full copy of the app flow state data structures of type []appState. Then it iterates that slice in least recently used order queuing app flows for eviction until we get to a flow record that has been updated since the cutoff time and break. It does not need the remaining majority of the collection that was copied. The slice goes out of scope, and is eventually garbage collected.

@c-kruse c-kruse self-assigned this Feb 4, 2025
@c-kruse c-kruse linked a pull request Feb 4, 2025 that will close this issue
@c-kruse
Copy link
Contributor Author

c-kruse commented Feb 4, 2025

CPU profile from network observer. Note the large portion of time dedicated to gc-related tasks and to reconcileAppFlow calling Items()

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant