Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 217 additions & 0 deletions 1346-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
**Training Mode:**
During the training phase the system collects per-client metrics for each new event and aggregates global statistics required for subsequent z-score calculation in defence mode. For each client only the current maximum number of connections/requests is stored. Global statistics include:

- number of samples ("n", i.e. number of active clients),
- sum of values ("sum"),
- sum of squared values ("sumsq").

At the end of the training phase the mean and standard deviation are computed and later used for z-score calculation during anomaly detection.

Several approaches for online variance calculation were evaluated, including Welford’s algorithm and the sum/sumsq method.

The classical Welford algorithm was found to be unsuitable for this workload. In its original form Welford assumes an append-only stream of samples, where each new observation increases the total sample count. In our case, however, "n" represents the number of clients rather than the number of events. For each client we continuously update the current maximum number of connections or requests. Therefore, when a client metric changes, the previous value must first be removed from the statistics and only then the new value can be added. This requires a modified reversible version of Welford’s algorithm, which significantly complicates the implementation.

In addition, kernel-space constraints prohibit floating-point arithmetic, requiring the use of fixed-point integer arithmetic instead. While Welford’s algorithm is known for its excellent numerical stability with floating-point arithmetic, its fixed-point implementation introduces truncation errors during repeated division operations. In workloads where metric values remain relatively small and close to each other (e.g. connection/request maxima), these rounding errors accumulate over time and may lead to noticeable precision degradation.

Benchmarking (see `benchmark_training` folder) also demonstrated that the modified fixed-point Welford implementation is significantly slower than the alternative approach due to additional arithmetic operations, divisions, and the need to perform both removal and insertion for each update.

Benchmark Time CPU Iterations
BM_welford_fixed_point 11.0 ns 11.0 ns 62528259
BM_sum_sumsq 3.44 ns 3.43 ns 208387347

As a result, the implementation uses the sum of values / sum of squares method (sum/sumsq method). This approach maintains:

- the sum of all values,
- the sum of squared values,
- the total number of clients.

The variance is then computed using the standard relation:

[
Var(X) = E[X^2] - E[X]^2
]

This method is generally considered less numerically stable than Welford’s algorithm because subtracting two large close values may lead to catastrophic cancellation and precision loss. However, this issue primarily affects workloads with very large numbers and extremely small variance.

For the considered workload, where client metrics are bounded and remain relatively small, the sum/sumsq approach provides sufficient numerical accuracy while being substantially simpler and faster. It also maps naturally to the mutable per-client update model used by the system and avoids the complexity of reversible online variance algorithms.
(It should also be noted that accurate and stable calculation of memory and CPU consumption in streaming or long-running workloads may require the use of Welford’s algorithm).

**Defence Mode**
Each new observation is evaluated using z = ((x−mean) << SCALE_SHIFT) / std (Where SCALE_SHIFT = 10 - fixed-point scaling factor used for integer arithmetic. Kernel code avoids floating point operations, so all fractional calculations (e.g. mean, variance, z-score) are performed using scaled integers). If z > configured_threshold he event is considered anomalous. Reject request / connection, drop connection with TCP RST and optionally block client by IP.

**Disabled Mode**
Internal state used during transitions. Ensures safe updates of shared data (via RCU synchronization). Also I think it's better to implement this state also, not only as internal state, to prevent any additional calculations, when it is not necessary (for example administrator don't need this security feature at all).

**Connection Count Tracking**
In`TfwClient` structure we additionally store `unsigned int conn_max`, `int conn_curr` and `unsigned int conn_training_epoch`. We don't need any lock here, because all this fields updated under private `ra->lock` in frang.
We use new implemented function `tfw_client_training_adjust_conn_num` both for training and defence mode.

**Training mode**
`conn_curr` is incremented/decremented.
Track maximum concurrent connections (`conn_max`). When max increases - compute `delta1 = new_max - old_max` and `delta2 = new_max² - old_max²` and use this values to update `sum` and `sumsq`.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sum and sumsq are? If it is mean and standard deviation, then the
computation is wrong for Welford:

n += 1
delta  = new_max - mean
mean  += delta / n
delta2 = new_max - mean
M2    += delta * delta2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta = curr - old_max;
s->sum += delta (per cpu)
n (not always +1), because we increment it only when the client is new!
total_sum = percpu_counter_sum(&s->sum);
s->mean = (total_sum << SCALE_SHIFT) / num_clients;

"sum" and "sumsq" are accumulated values used for online calculation of the mean and standard deviation without storing the full history of samples.

The algorithm keeps:

- "sum = Σx"
- "sumsq = Σx²"

which allows computing:

- mean:
- variance:

This is a classic streaming statistics approach commonly referred to as the “sum/squared-sum” method or “naive variance algorithm” (Wikipedia — “Algorithms for calculating variance”)
This approach is efficient because:

- O(1) update cost,
- no historical samples must be stored,
- naturally supports per-CPU counters and lockless aggregation,
- very cheap for hot-path telemetry.

However, the algorithm may become numerically unstable when:

- the mean is very large,
- variance is very small,
- or values are extremely close relative to the magnitude of the mean.

For example, the following dataset (1000000000, 1000000001, 999999999, 1000000000) may become problematic. In this case - mean ≈ 1,000,000,000, standard deviation ≈ 1.
The variance computation subtracts two extremely large nearly identical numbers, which can cause catastrophic cancellation and precision loss. In contrast, a connection telemetry workload such as (100,
105, 103, 98, 110) is generally safe because:

- values are integer-based,
- the variance is reasonably large relative to the mean,
- the dynamic range is moderate,

According to our investigation (described at the beginning of the document for connections and requests this simplest algorithm is better both in terms of performance and accuracy).

**Defence mode**
Track `conn_curr` on each new opened connection. Calculate `z = ((conn_curr - mean) << SCALE_SHIFT) / std` if `z > threshold` reject connection and block client by IP if necessary.

**Epoch handling**
Each connection tagged with training_epoch (we add new field to `tempesta_sock` and save epoch in this field) and also we add `conn_training_epoch` to the `TfwClient` structure. We need epoch handling to zero history from previous trainging and prevent mixing old and new training data. When we call `tfw_client_training_adjust_conn_num` (function for both trainging and defence mode) first of all we check `if (delta < 0 && *training_epoch < g_training_epoch)` and immediately return if condition is true (`delta < 0` means that connection is dropped and belongs to previous epoch). If `delta > 0` we set `*training_epoch = g_training_epoch` to the new established connection (when connection is opening it always belongs to the new epoch if trainging enabled!). In trainging mode we also check
`if (cli->conn_training_epoch < g_training_epoch)` to zero all client training data (`conn_curr` and `conn_max`).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already account memory for TfwClient and we also should implement the z-score
logic for memory as weel. The reason is that it could be hard to set the memory limit
or it can be too permissive. Also this is transition to #488 for adaptive client
classification. This shoul work as the current client_mem must never be reached, but
depending on the cimputed values during training mode, a client can be blocked on
smaller memory consumption.

**Request Count Tracking (Non-idempotent)**
We implement `TfwTrainingStat` structure to track all trainging events except connections.
```C
/*
* max - maximum observed value of the tracked metric within the
* current training epoch (e.g. peak number of in-flight
* non-idempotent requests);
* curr - current value of the tracked metric;
* lock - spinlock for serialized reset of @max and @curr when a
* new training epoch starts.
* @epoch - training epoch identifier. Compared against the global
* @g_training_epoch to detect epoch change and trigger
* reinitialization of @max and @curr.
*/
typedef struct {
atomic64_t max;
atomic64_t curr;
spinlock_t lock;
unsigned int epoch;
} TfwTrainingStat;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With atomics and spin lock in TfwTrainingStat I assume this data structure is going
to be frequently updated from all CPUs, so we're going to get hard contention and
false sharing. We either need to account statistics per-cpu and later merge it or
process TfeClient per-CPU using the message bus. It's not necessary to implement the
last one (if we decide to go this way), but it must be well designed at this moment
the code must ne developed acoring to the future changes to not to rewrite the code
later.

Copy link
Copy Markdown
Contributor

@EvgeniiMekhanik EvgeniiMekhanik May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spin lock is used only for protection during trainging epoch changing

```
We use new implemented function `tfw_client_training_adjust_req_num` both for training and defence mode.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what the function does?


**Training mode**
Track `curr` - current in-flight non-idempotent requests. Increment `curr` in `tfw_http_req_enlist`, decrement in `tfw_http_req_nip_delist`. Also track `max` maximum count in-flight non-idempotent requests per client. When max increases update global trainging stats, same as we do it for connections (`delta1 = new_max - old_max` and `delta2 = new_max² - old_max²`).

**Defence mode**
Change signature for `tfw_http_req_enlist` from `void` to `bool`. Call `tfw_client_training_adjust_req_num` on each new non-idempotent request, calculate z-score, return false if `z > threshold`. `tfw_http_req_enlist` is called from `tfw_http_req_fwd` and `tfw_http_req_fwd_resched`, this functions now return T_BLOCK if `tfw_http_req_enlist` fails.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we compute z-score only on training mode. In defence (protection) mode we only
compare computed value with the current number of indempotent requests in flight.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is important note. It implies that in training mode we can only compute local values and merge them when we finished processing of the current client. We can use per-cpu counters. But in defence mode we can collect and sum all per-cpu counters in the beginning of processing the client and cache it to compare with z-score even for each request.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think we calculate mean and std at the end of the training mode (during switching to defence mode).
Then on each new value we calculate *z_score = ((s64)(val << SCALE_SHIFT) - s->mean) / s->std;
and compare it with configured threshold

Callers of `tfw_http_req_fwd` and `tfw_http_req_fwd_resched` send 403 error response, drop client connection with TCP RST and block client by IP if these functions return T_BLOCK.

**Epoch handling**
Each request tagged with `training_epoch` to prevent mixing old and new training data (we add new field to `request` structure and save epoch in this field). When request removed from server connection queue we don't update `curr` field in case when request belongs to previous epoch. (When request added to server connection queue it always belongs to new epoch if trainging enabled!).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? IIUC this is for the case when net.tempesta.training is
changed many times, i.e. there are many transitions between training and protection
modes (maybe with disabled as well). It seems this is a sophistication just to not to
start training from absolute zero, but use requests in flight. Probably, this is not
so big win to make the sophistication, at least in the first implementation.


**Current method and alternatives**
The same problems and altgernatives as for connections.

**CPU Tracking**
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have cheap and precise enough nanoseconds time, then the current proposal should work. Meantime, I want to propose an alternate or additional change to block malicious users by CPU consumption.

Rework http_body_chunk_cnt and http_header_chunk_cnt limits as it's hard to unify the values for many-headers long messages and short-headers short-body messages.

Instead we need to detect artificially lowered chunk sizes for HTTP/1 and DATA and CONTINUATION frames for HTTP/2.

We can do this with learning average DATA and CONTINUATION frame size in HTTP/2 and/or data chunk (skb-carried, not HTTP chunk) for both HTTP/1 and HTTP/2.

We should accounb the average (for training and protection modes) ONLY for multi-chunk messages. I.e. if a message has zero or 1 CONTINUATION or DATA, then we do not compute the average for it.

We learn and analyze average chunk size, where chunk is a CONTINUATION or DATA frame size for HTTP/2 or skb data chunk in HTTP/1 (not an HTTP chunk size). It's is essentially total_size / chinks_number, where total_size is the total body or headers size.

The average chunk size is about kilobyte, maybe several kilobytes (with GRE) and we need to catch extremely small chunk sizes. Not only that (it's probably OK to have several occasional small chunks), but when a client sends a lot of small chunks. I.e. I propose to learn and detect multiplication of N / average_chunk_size * chunks_number - this feature should have high deviations for normal and attacking connections.

In comparison with the current http_body_chunk_cnt and http_header_chunk_cnt limits:

  1. this scheme normally handles small messages consisting of 1 small chunk
  2. normally handles large messages consisting of many large chunks
  3. trigger on large messages consisting on many small chunks
  4. these parameters are learnt from traffic and we don't need to specify the hard-to-define limits

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nano seconds why not cpu cycles?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on today call that this solution would be only for HTTP/2 framing attacks, not a generic CPU attacks in sense of #488 (e.g. think about ReDDoS or parser-specific attacks)

In addition to `TfwTrainingStat` implement structure and per-cpu array of this structures.
```C
/**
* Exponential moving average (EMA) tracker for per-CPU time usage.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I think EMA should work well here

*
* The structure is used to accumulate execution time deltas and maintain
* a smoothed estimate (EMA) of CPU consumption.
*
* @last_ts - timestamp of the last update (in ns). Used to compute
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it expensive to get time with ns accuracy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to change it jiffies and check is it ok.

* time delta between consecutive measurements;
* @ema - current exponential moving average of CPU usage;
* @pending_cpu - accumulated raw CPU time (in ns) since the last EMA
* update. This value is periodically folded into @ema;
*/
typedef struct {
u64 last_ts;
s64 ema;
u64 pending_cpu;
} TfwCpuEma;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is per TfwClient, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

```
Save time at the beginning of SoftIRQ shot and check CPU usage at the end of SoftIRQ shot (to prevent perfomance regression in case when we do it on each request) .
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this begin_time?

In one softirq shot we process many requests - how can we apply begin_time to all of them?

I think begin_time should be the time of receiving an skb. We can save the time somewhere (e.g. in a static per-cpu variable) - when we get an skb we do not know the client. But we need to call tfw_client_update_cpu_ema() not only on forwarding an HTTP message, but also on error responses. At all these calls we should know the socket and TfwClient.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can do it same as we do it for client_mem. We save begin_time at the beginning of ss_tcp_process_data and check at the connection_recv_finish callback. For client mem we do it to prevent performance degradation, I think for CPU we can do the same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Socket is known in ss_tcp_process_data, we can get client from sk_user_data (connection)->client same as we do for client mem.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For responses the different case we can do it in process_resp function same as we do for client_mem


**Training mode**
Calculate `delta_cpu = now - begin_time;`, update CPU ema.
```C
/**
* Update per-client CPU usage EMA.
* @cpu_ema: per-CPU EMA state for the client.
* @delta_cpu: CPU time consumed since the last measurement (in ns).
*
* Accumulates raw CPU time in @pending_cpu and periodically folds it
* into the exponential moving average (@ema).
*
* The update is performed only if enough time (@min_time_to_adjust)
* has passed since the previous update to avoid excessive noise and
* high-frequency recalculations.
*
* The function:
* - computes elapsed time (@dt);
* - converts accumulated CPU time into normalized usage value;
* - applies time-based decay (older history loses weight);
* - updates EMA using a combination of decay and smoothing factor.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 bulets above are just the idea of EMA, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

*/
static void
tfw_client_update_cpu_ema(TfwCpuEma *cpu_ema, u64 delta_cpu)
{
u64 now = ktime_get_ns();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this one looks fast enough

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this one looks fast enough

This is fast enough only for one call per SoftIRQ shot. Calling this function often will cause performance degradation.

u64 dt = now - cpu_ema->last_ts;
u64 usage, decay, total_cpu = 0;
static const u64 time_to_forget_ns = 100000000;
static const u64 min_time_to_adjust = 1000;
static const unsigned int ema_alpha_shift = 4;

cpu_ema->pending_cpu += delta_cpu;
if (unlikely(dt < min_time_to_adjust))
return;

cpu_ema->last_ts = now;
swap(cpu_ema->pending_cpu, total_cpu);
usage = (total_cpu << SCALE_SHIFT) / dt;
decay = (dt << SCALE_SHIFT) / time_to_forget_ns;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is SCALE_SHIFT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for more accuracy calculation. There is no float calculation in kernel, so we use SCALE_SHIFT for more accuracy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on SCALE_SHIFT we likely to get always 0 here, since we pass to this code every 1000ns and divide for 0.1s


if (decay > (1 << SCALE_SHIFT))
decay = 1 << SCALE_SHIFT;
cpu_ema->ema = cpu_ema->ema *
((1 << SCALE_SHIFT) - decay) >> SCALE_SHIFT;
cpu_ema->ema += ((s64)usage - (s64)cpu_ema->ema) >> ema_alpha_shift;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original EMA old EMA value and the new observation are summed with coeficients alpha and 1 - aplha, but in your case decay is independent from alpha - why?

}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math is always easy to make wrong computations, so strongly propose to start from a unit test showing the algorithm behavior on different data, see for example t/unit/user_space/percentiles.c

```
Pass `delta = new_ema - prev_ema` to `tfw_client_training_adjust_cpu_num` which do the same as ` `tfw_client_training_adjust_req_num`.

**Defence mode**
In defence mode use `delta_ema` on each SoftIRQ shot to calculate `z = (delta_ema - mean) / std` and if calculated `z > threshold` reject connection with TCP RST and block client by IP if necessary.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say we process requests for 1K clients in one SoftIRQ shot, then all of them will use the same begin_time and different now timestamp - the last client processed has the larges CPU time and the first one the lowest. This a computation bug.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All requests belongs to the same client (we process only one socket during ss_tcp_process_data ). Yes it is not accuracy, but we do the same for client_mem to prevent performance degradation.


**Current method and alternatives**

**Alternatives**
1. Use raw CPU time
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What 'raw CPU' time means exactly?

Copy link
Copy Markdown
Contributor

@EvgeniiMekhanik EvgeniiMekhanik May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use EMA, use directly time difference

* ✔ simple
* ✔ accuracy
* ❌ very noisy
* ❌ strong peaks
* ❌ Bad normalization
2. Sliding window average (store CPU usage for the last N ms)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how it's better/worse than EMA?

3. Use `ema` directly. Currently we measure change, not level (constant high CPU → delta ≈ 0 → no detection).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that we pass delta ema. So we calculate changing of cpu usage.


3 changes: 3 additions & 0 deletions benchmark_training/gen.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#! /bin/bash

g++ mybenchmark.cc -std=c++11 -isystem benchmark/include -Iebtree -Iheap -Lbenchmark/build/src -lbenchmark -o mybenchmark
Loading