-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[usm] add regular and raw tracepoints /sched_process_exit #33943
base: main
Are you sure you want to change the base?
Conversation
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=57146343 --os-family=ubuntu Note: This applies to commit 32f36b1 |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: c714131 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | quality_gate_logs | % cpu utilization | +0.70 | [-2.18, +3.58] | 1 | Logs |
➖ | file_tree | memory utilization | +0.57 | [+0.51, +0.63] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.22 | [+0.15, +0.28] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.21 | [-0.26, +0.68] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.19 | [-0.67, +1.05] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +0.13 | [+0.08, +0.18] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | +0.03 | [-0.80, +0.85] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | +0.01 | [-0.02, +0.04] | 1 | Logs bounds checks dashboard |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.64, +0.66] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.30, +0.30] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | +0.00 | [-0.63, +0.63] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.01, +0.02] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | -0.00 | [-0.79, +0.78] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | -0.01 | [-0.81, +0.79] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | -0.01 | [-0.82, +0.81] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.10 | [-0.87, +0.67] | 1 | Logs |
Bounds Checks: ✅ Passed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | intake_connections | 10/10 | |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
Static quality checks ✅Please find below the results from static quality gates Successful checksInfo
|
1f1269f
to
6cf2d27
Compare
7eb24b6
to
ef1ce42
Compare
ef1ce42
to
3832d0c
Compare
…epoint__sched_process_exit()
…xit' if raw tracepoint is supported
Static quality checks ❌Please find below the results from static quality gates Error
Gate failure full details
Successful checksInfo
|
pkg/network/usm/ebpf_ssl.go
Outdated
keysToDelete = append(keysToDelete, key) | ||
} | ||
} | ||
_, err = emap.BatchDelete(keysToDelete, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchDeleteAPI is supported from kernel 5.6
How did you ensure this code runs only when it can?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a back and forth here
Original comment asking - why don't you use batch-delete-api when possible
This comment highlighting you didn't handle kernels older than 5.6
and now you got back to deletion on key at a time
I missing the clarity of the reasoning why you didn't use batch-delete-api when it is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The system-probe calls this function only if kernel is <4.17, in that case batch deletion is not supported. Unit test may call it on any kernel, but unit tests are not so time critical. Am I missing something?
@@ -208,10 +210,11 @@ func (w *Watcher) handleLibraryOpen(lib LibPath) { | |||
} | |||
|
|||
// Start consuming shared-library events | |||
func (w *Watcher) Start() { | |||
func (w *Watcher) Start(mapsCleaner func(map[uint32]struct{})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isn't it part of the constructor?
passing variables to store in the struct, should pass in the c'tor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ssl program constructor newSSLProgramProtocolFactory
initializes watcher before sslProgram
is allocated. Moving map cleaner to constructor would require refactoring newSSLProgramProtocolFactory
, is it worth it?
pkg/network/usm/ebpf_ssl_test.go
Outdated
if err != nil { | ||
require.NoError(t, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err != nil { | |
require.NoError(t, err) | |
} | |
require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl_test.go
Outdated
t.Logf("pid '%d' was found in the map '%s'", pid, mapInfo.Name) | ||
ebpftest.DumpMapsTestHelper(t, monitor.DumpMaps, mapInfo.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we fail here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl_test.go
Outdated
require.Eventually(t, func() bool { | ||
return findKeyInMap(m, key) | ||
}, 1*time.Second, 100*time.Millisecond) | ||
if t.Failed() { | ||
t.Logf("pid '%d' not found in the map '%s'", pid, mapInfo.Name) | ||
ebpftest.DumpMapsTestHelper(t, monitor.DumpMaps, mapInfo.Name) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tried getting a failure in lines 154-156? If so, did you see the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, replaced require
with assert
pkg/network/usm/ebpf_ssl_test.go
Outdated
key := uint64(pid)<<32 | uint64(pid) | ||
value := make([]byte, m.ValueSize()) | ||
|
||
err := m.Put(&key, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you use unsafe.Pointer on the variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
// setup monitor | ||
cfg := utils.NewUSMEmptyConfig() | ||
cfg.EnableNativeTLSMonitoring = true | ||
cfg.EnableUSMEventStream = false | ||
|
||
if !usmconfig.TLSSupported(cfg) { | ||
t.Skip("SSL maps cleaner not supported for this platform") | ||
} | ||
|
||
monitor := setupUSMTLSMonitor(t, cfg, reInitEventConsumer) | ||
require.NotNil(t, monitor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need the monitor? can't you just use the watcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitor is necessary to dump maps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it's not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only need an ebpf manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot use watcher because it uses separate program manager and loads programs from shared-libraries/probes.h
(generated shared-libraries.c
) and runs appropriate manager while all SSL maps are located in usm.c
module and are not accessible from shared-libraries.c
pkg/network/usm/ebpf_ssl_test.go
Outdated
// find maps by names | ||
maps := getMaps(t, monitor, sslPidKeyMaps) | ||
require.Equal(t, len(maps), 6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// find maps by names | |
maps := getMaps(t, monitor, sslPidKeyMaps) | |
require.Equal(t, len(maps), 6) | |
// find maps by names | |
maps := getMaps(t, monitor, sslPidKeyMaps) | |
require.Equal(t, len(maps), len(sslPidKeyMaps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl_test.go
Outdated
"github.com/DataDog/datadog-agent/pkg/ebpf/ebpftest" | ||
"github.com/cilium/ebpf" | ||
"github.com/stretchr/testify/require" | ||
|
||
"github.com/DataDog/datadog-agent/pkg/network/protocols/http/testutil" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"github.com/DataDog/datadog-agent/pkg/ebpf/ebpftest" | |
"github.com/cilium/ebpf" | |
"github.com/stretchr/testify/require" | |
"github.com/DataDog/datadog-agent/pkg/network/protocols/http/testutil" | |
"github.com/cilium/ebpf" | |
"github.com/stretchr/testify/require" | |
"github.com/DataDog/datadog-agent/pkg/ebpf/ebpftest" | |
"github.com/DataDog/datadog-agent/pkg/network/protocols/http/testutil" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl.go
Outdated
@@ -38,6 +38,7 @@ import ( | |||
"github.com/DataDog/datadog-agent/pkg/util/log" | |||
"github.com/DataDog/datadog-agent/pkg/util/safeelf" | |||
ddsync "github.com/DataDog/datadog-agent/pkg/util/sync" | |||
manager "github.com/DataDog/ebpf-manager" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move back to line 23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be part of Datadog section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved back
pkg/network/usm/ebpf_ssl.go
Outdated
@@ -487,7 +503,7 @@ func (o *sslProgram) Name() string { | |||
} | |||
|
|||
// ConfigureOptions changes map attributes to the given options. | |||
func (o *sslProgram) ConfigureOptions(_ *manager.Manager, options *manager.Options) { | |||
func (o *sslProgram) ConfigureOptions(m *manager.Manager, options *manager.Options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to get m
as you already have that in o.ebpfManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl.go
Outdated
@@ -496,11 +512,17 @@ func (o *sslProgram) ConfigureOptions(_ *manager.Manager, options *manager.Optio | |||
MaxEntries: o.cfg.MaxTrackedConnections, | |||
EditorFlag: manager.EditMaxEntries, | |||
} | |||
o.addProcessExitProbe(m, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to get m
as you already have that in o.ebpfManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl.go
Outdated
if features.HaveProgramType(ebpf.RawTracepoint) != nil { | ||
o.watcher.Start(o.cleanupDeadPids) | ||
} else { | ||
o.watcher.Start(nil) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if features.HaveProgramType(ebpf.RawTracepoint) != nil { | |
o.watcher.Start(o.cleanupDeadPids) | |
} else { | |
o.watcher.Start(nil) | |
} | |
cleanerCB := nil | |
if features.HaveProgramType(ebpf.RawTracepoint) != nil { | |
cleanerCB = o.cleanupDeadPids | |
} | |
o.watcher.Start(cleanerCB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl.go
Outdated
|
||
// cleanupDeadPids clears maps of terminated processes. | ||
func (o *sslProgram) cleanupDeadPids(alivePIDs map[uint32]struct{}) { | ||
if o.ebpfManager != nil && features.HaveProgramType(ebpf.RawTracepoint) != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are calling cleanupDeadPids
, you already know features.HaveProgramType(ebpf.RawTracepoint) != nil
and o.ebpfManager == nil
can never happen
so the condition is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/network/usm/ebpf_ssl.go
Outdated
for _, mapName := range sslPidKeyMaps { | ||
err := deleteDeadPidsInMap(o.ebpfManager, mapName, alivePIDs) | ||
if err != nil { | ||
log.Debugf("SSL maps %q cleanup error: %v", mapName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Debugf("SSL maps %q cleanup error: %v", mapName, err) | |
log.Debugf("SSL map %q cleanup error: %v", mapName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
// deleteDeadPidsInMap finds a map by name and deletes dead processes. | ||
// enters when raw tracepoint is not supported, kernel < 4.17 | ||
func deleteDeadPidsInMap(manager *manager.Manager, mapName string, alivePIDs map[uint32]struct{}) error { | ||
emap, _, err := manager.GetMap(mapName) | ||
if err != nil { | ||
return fmt.Errorf("dead process cleaner failed to get map: %q error: %w", mapName, err) | ||
} | ||
|
||
var keysToDelete []uint64 | ||
var key uint64 | ||
value := make([]byte, emap.ValueSize()) | ||
iter := emap.Iterate() | ||
|
||
for iter.Next(unsafe.Pointer(&key), unsafe.Pointer(&value)) { | ||
pid := uint32(key >> 32) | ||
if _, exists := alivePIDs[pid]; !exists { | ||
keysToDelete = append(keysToDelete, key) | ||
} | ||
} | ||
for _, k := range keysToDelete { | ||
_ = emap.Delete(&k) | ||
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the performance impact of this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested a modified branch with periodic cleaning for all kernel versions and included the performance results in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codebase has two modes -
- Running on kernel 4.17 or later - should be covered by the load test & dogfooding
- Running on kernels 4.14-4.16 - how did you verify the performance looks good here?
What does this PR do?
Adds eBPF programs for
tracepoint/sched/sched_process_exit
(kernel < 4.17) andraw_tracepoint/sched_process_exit (kernel>=4.17)
These programs clean terminated processes from
ssl_read_args
andssl_read_ex_args
maps.Motivation
Significant amount of
E2BIG
errors was observed on staging clusters in January, 2025This error spike aligns with
mongodb
pod restarts, which were triggered by Kubernetes due to health check failures.The issue was reproduced on a local VM by stressing a simple SSL server with frequent connections. Terminating the server mid-operation led to stale entries accumulating in the relevant kernel maps.
Describe how you validated your changes
This code was tested on the staging cluster
oddish-c
, which frequently exhibitedE2BIG
errors for thessl_read_args
map. After deploying the fix, no errors occurred for 16 hours. See this metrics.Passed CI unit tests and e2e tests.

USM SSL load tests dashboard
shows slightly lower CPU and memory with
new code
compared tomaster
:The custom agent was deployed on staging cluster oddish-c between Feb 19, 2025 3pm and Feb 20, 8am, USM internal metrics in this dashboard.
Possible Drawbacks / Trade-offs
The system-probe runs periodic sync job each 30 seconds to detect loaded SSL shared ibraries.
This job also additionally will clean possible stale entries in SSL-related kernel maps on kernel < 4.17
Additional Notes
The codebase has two modes:
To test mode (2) I created a testing branch from the current one that emulates a lower kernel version and does periodic cleanup on any Linux version. Results of SSL load tests are available in this dashboard.
There is no significant difference in CPU and memory usage compared to the main branch.