Skip to content

Commit

Permalink
Importing all the files for the alpha release
Browse files Browse the repository at this point in the history
  • Loading branch information
inteltheo committed Jun 27, 2022
1 parent dc75d53 commit 6949b46
Show file tree
Hide file tree
Showing 30 changed files with 3,054 additions and 27 deletions.
8 changes: 8 additions & 0 deletions Contributors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# This is the list of TCP-INT's contributors.

Bimmy Pujari <[email protected]>
Grzegorz Jereczek <[email protected]>
Jeongkeun Lee <[email protected]> contributed to design/algorithms
Simon Wass <[email protected]>
Theo Jepsen <[email protected]>
...
426 changes: 399 additions & 27 deletions LICENSE

Large diffs are not rendered by default.

263 changes: 263 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# TCP-INT: Lightweight In-band Network Telemetry for TCP

In-band Network Telemetry (INT) provides visibility into the state of the network and can be used
for monitoring and debugging. However, existing INT implementations do not make telemetry available
to end-hosts. TCP-INT addresses this shortcoming by delivering network telemetry directly to the
end-host TCP stack, enabling end-hosts to correlate local TCP state with the state of the network
fabric.

TCP-INT is implemented in the TCP header as a new TCP option with three fields:

- INTval: the link utilization (or queue depth if utilization is 100%).
- HopID: the ID of most congested switch (the packet’s TTL at the switch).
- SWLat: the sum of latencies experienced at each hop.

Each field has a corresponding echo-reply field (ecr) for the receiver to echo the telemetry back to
the sender.

### TCP-INT option format

```
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+---------------+---------------+
| Kind = 0x72 | Length = 6 | INTval | INTecr |
+---------------+---------------+---------------+----------------
| HopID=IP.TTL | HopIDecr | SWLat (3B) ...
----------------+---------------+--------------------------------
... | SWLatEcr (3B) |
----------------+-----------------------------------------------+
```

### Workflow
- A process joins the `tcp-int` cgroup to indicate that its flows should be monitored by TCP-INT.
- When a packet is sent, the TCP-INT eBPF adds a TCP-INT header option to the TCP header.
- Upon receiving a packet with a TCP-INT header option, the switch updates the fields:
- pkt.INTval = switch.INTval if switch.INTval > pkt.INTval
- pkt.HopID = IP.TTL if switch.INTval > pkt.INTval
- pkt.SWLat += latency through this switch
- The eBPF receives the telemetry and (possibly) sends it to user-space for consumption.
- When the ACK is sent, the eBPF sets the ecr fields to the latest INT received on the send path.


## Repo Organization

The TCP-INT repository is organized as follows:
```
code:
┣ src/bpf: eBPF module
┣ src/tools: tool for loading, managing and monitoring the module
┣ tcp-int-exporter: user-space agent to export INT to gRPC server
┗ scripts: helper scripts
```

### Switch Code

The switch-side P4 and control plane implementations are provided in a special release of Intel P4 Studio (aka Tofino SDE). To obtain access to the SDE, contact your Intel Sales Representative. Academic researchers can apply for access through the [Intel Connectivity Research Program (ICRP)](https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/connectivity-education-hub/research-program.html).


## Getting Started

### Prerequisites

* Linux Kernel 5.10.x or later

### Tested on

* Linux 5.11.0-17-generic and Linux 5.13.0-28-generic
* Ubuntu 21.04 and Ubuntu 21.10

### Setup steps for all end-hosts (clients and servers)

Examples in this README assume that you are running as the 'root' user. If not, then most commands will likely need privilege escalation using `sudo`.

#### 1. Install TCP-INT Dependencies

TCP-INT requires the following packages to be installed:
* build-essential
* clang
* llvm
* gcc-multilib
* libelf-dev
* linux-tools
* pkg-config
* libbpf 0.7 (See LibBPF setup below)
```
# For Debian based distributions
apt install -y build-essential clang llvm gcc-multilib libelf-dev linux-tools-$(uname -r) pkg-config
```

#### 2. Download and install Libbpf

More detailed installation instructions and dependencies can be found on the Libbpf Github page:<br>
https://github.com/libbpf/libbpf/tree/v0.7.0
```
wget https://github.com/libbpf/libbpf/archive/refs/tags/v0.7.0.tar.gz
tar xzf v0.7.0.tar.gz -C /opt
cd /opt/libbpf-0.7.0/src
make
LIBDIR=/lib/x86_64-linux-gnu make install
```
Note: User might need pkg-config to build libbpf.

#### 3. Copy TCP-INT source to end-hosts

Copy this source code repository to each end-host where you intend to use TCP-INT. For the purpose of these instructions we will assume it is extracted to `/opt/tcp-int`.

#### 4. Optional: update vmlinux.h file for the current kernel version

BPF programs need to hook into data structures in memory of the current running kernel, to do this they need an up-to-date header file containing information about the running kernel.

* Additional packages may be required to install the bpftool application

```
cd /opt/tcp-int/include/
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux-$(uname -r).h
rm vmlinux.h
ln -s vmlinux-$(uname -r).h vmlinux.h
```

#### 5. Compile and deploy TCP-INT on the end-hosts

```
# Compile and install TCP-INT
cd /opt/tcp-int/src
make clean && make install
```
#### 6. Verify / set-up cgroups v2

TCP-INT requires cgroups v2 in order to work.
* The default 'load' command will attempt to create the following cgroup directory: `/sys/fs/cgroup/cgroup.tcp-int`
* The default expects /sys/fs/cgroup to already exist as a cgroup2 mount.
* An alternative path can be specified when loading with the `-c` option.
* For example: `tcp_int load -c /sys/fs/cgroup/unified/my-tcp-int.cgroup`

```
# Check if cgroups v2 are already mounted correctly
mount -l | grep cgroup2
# Example output 1: Ubuntu 21.04: Indicates that cgroups v2 is already mounted in a subdirectory of cgroup.
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
# Example output 2: Debian 11 (and ubuntu 21.10): Indicates that cgroups v2 is mounted at the top level cgroup directory.
# In this case you can simply create the 'unified' subdirectory or define a custom tcp-int cgroup path using the -c option.
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
```

### Loading TCP-INT and running test applications

#### 1. Load TCP-INT on the end-hosts

This step should be run on all end-hosts each time that node is rebooted. This loads the TCP-INT BPF program.

```
# First, attempt to unload TCP-INT in case it was not left in a clean state
/usr/local/lib/bpf/tcp-int/tcp_int unload
# Now load the TCP-INT BPF program
/usr/local/lib/bpf/tcp-int/tcp_int load
```

#### 2. Disable TCP Timestamps (optional)

TCP timestamps may need to be disabled on BOTH the client and server end-hosts, depending on the current version of TCP-INT support on the switches. If this is necessary, run the following command on ALL end-host nodes:
```
# Disable TCP timestamps
sysctl -w net.ipv4.tcp_timestamps=0
```

#### 3. Server-side setup

The role of the server in TCP-INT test applications is to echo back the altered TCP-INT values to the client in acknowledgement (ACK) messages.

```
# Start bash session in tcp-int cgroup
bash
echo $$ >> /sys/fs/cgroup/cgroup.tcp-int/cgroup.procs
# Start server
pkill iperf3
iperf3 -s -p 5001
```

#### 4. Client-side setup

The role of the client is to initialise the TCP-INT TCP option and also to receive and process the echoed TCP-INT replies coming back from the server.

* Setting up a test stream with TCP-INT TCP Option initialised
```
# Open a shell terminal on the client node in the tcp-int cgroup
bash
echo $$ >> /sys/fs/cgroup/cgroup.tcp-int/cgroup.procs
# Generate a tcp stream using TCP-INT-eBPF
iperf3 -c <iPerf SERVER-IP> -p 5001 -w 20M -t 120
```

* Open a second console to control and observer TCP-INT
```
# Optionally disable TCP-INT echo on the client to reduce overheads
/usr/local/lib/bpf/tcp-int/tcp_int ecr-disable
# Start full tracing
/usr/local/lib/bpf/tcp-int/tcp_int trace
```
* Alternatively, display TCP-INT histograms showing in-network link utilisation and queue depth
```
# Optionally disable TCP-INT echo on the client to reduce overheads
/usr/local/lib/bpf/tcp-int/tcp_int ecr-disable
# Show TCP-INT histograms
watch -t -p -n 1 timeout -s SIGINT 1 "/usr/local/lib/bpf/tcp-int/tcp_int hist-int-perid | column"
# The tcp_int help text lists other modes of operation
/usr/local/lib/bpf/tcp-int/tcp_int help
```

## Running applications locally with network namespaces

This was tested an Ubuntu 21.10 VM with kernel `5.13.0-28-generic`.

#### 1. Load TCP-INT on the host
```
sudo /usr/local/lib/bpf/tcp-int/tcp_int load
```

#### 2. Setup the network namespaces
```
./code/scripts/namespaces.sh
```

#### 3. Start the application in the namespaces
* Open a shell for the server, add it to the tcp-int cgroup, and launch iperf3 in ns1:
```
bash
echo $$ | sudo tee -a /sys/fs/cgroup/cgroup.tcp-int/cgroup.procs
sudo ip netns exec ns1 iperf3 -s -p 5001
```

* Open another shell for the client and start the iperf3 client in ns2:
```
bash
echo $$ | sudo tee -a /sys/fs/cgroup/cgroup.tcp-int/cgroup.procs
sudo ip netns exec ns2 iperf3 -c 10.0.0.1 -p 5001 -t 120
```

#### 4. Add a `tc` pedit filter to mimic the switch
Modify TCP packets (except for SYN/FIN) sent from ns2 to port 5001: set the tcp-int option fields (intval=5, id=6).
This `tc` filter uses hard-coded offsets that **assume TCP timestamps are enabled**.
```
sudo ip netns exec ns2 tc qdisc replace dev v2 root handle 1: htb
sudo ip netns exec ns2 tc filter add dev v2 parent 1: protocol ip flower ip_proto tcp tcp_flags 0x0/0x7 dst_port 5001 action pedit munge offset 54 u8 set 5 continue
sudo ip netns exec ns2 tc filter add dev v2 parent 1: protocol ip flower ip_proto tcp tcp_flags 0x0/0x7 dst_port 5001 action pedit munge offset 56 u8 set 6 continue
```

#### 5. Monitor with tcp-int trace
```
sudo /usr/local/lib/bpf/tcp-int/tcp_int trace
```

#### 6. Remove the `tc` pedit filter
```
sudo ip netns exec ns2 tc filter del dev v2 parent 1:
```
17 changes: 17 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Changes to TCP-INT in version 0.1.0-alpha

## New features

* eBPF inserts TCP-INT option header into TCP packets being sent
* eBPF parses TCP-INT option header from received packets, and sends metrics to user-space for consumption
* Only programs (PIDs) in the tcp-int cgroup will have their TCP flows tagged by TCP-INT
* tcp_int user-space tool loads and configures eBPF
* User-space tcp_int tool prints trace of INT as it arrives
* The exporter receives INT events from the eBPF module and sends them as gRPC messages to a collector server
* Script and instructions for trying TCP-INT on a single server with network namespaces
* Script to plot timeseries of INT from trace generated by tcp_int tool

## Limitations

* Fine-grained flow filtering: currently, all TCP flows belonging to a PID in the tcp-int cgroup will be tagged. In addition, it would be useful to tag flows based on some filters (e.g., destination port, destination address, socket options).
* Testing: the tests in this release only exercise the basic functionality.
6 changes: 6 additions & 0 deletions code/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*.o
*__pycache__
src/tools/tcp_int
src/bpf/tcp_int.skel.h
include/vmlinux*
tcp-int-exporter/src/tcp_int_exporter/tcp_int_exporter
37 changes: 37 additions & 0 deletions code/include/bits.bpf.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/* Copyright (c) 2021-2022 Intel Corporation */
/* SPDX-License-Identifier: BSD-2-Clause */

#ifndef __BITS_BPF_H
#define __BITS_BPF_H

static __always_inline u64 log2(u32 v)
{
u32 shift, r;

r = (v > 0xFFFF) << 4;
v >>= r;
shift = (v > 0xFF) << 3;
v >>= shift;
r |= shift;
shift = (v > 0xF) << 2;
v >>= shift;
r |= shift;
shift = (v > 0x3) << 1;
v >>= shift;
r |= shift;
r |= (v >> 1);

return r;
}

static __always_inline u64 log2l(u64 v)
{
u32 hi = v >> 32;

if (hi)
return log2(hi) + 32;
else
return log2(v);
}

#endif /* __BITS_BPF_H */
42 changes: 42 additions & 0 deletions code/include/tcp_int_common.bpf.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/* Copyright (C) 2022 Intel Corporation */
/* SPDX-License-Identifier: GPL-2.0-only */

#ifndef __TCP_INT_COMMON_BPF_H
#define __TCP_INT_COMMON_BPF_H

#include <bpf/bpf_helpers.h>
#include "tcp_int_opt.h"

#ifdef TCP_INT_DEBUG
#define tcp_int_printk(...) bpf_printk(__VA_ARGS__)
#else
#define tcp_int_printk(...)
#endif /* TCP_INT_DEBUG */

/* TCP INT state definition */
struct tcp_int_state {
bool pending_ecr; /* Indicates pending echo request */
tcp_int_val intvalecr; /* INT value to be echoed back (network order) */
tcp_int_id idecr; /* ID to be echoed back (network order) */
__u32 qdepth; /* Queue depth in data path */
tcp_int_lat swlatecr; /* Sum of switch latencies on data path */
};

/* Attaches INT state to socket */
struct {
__uint(type, BPF_MAP_TYPE_SK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, struct tcp_int_state);
} map_tcp_int_state SEC(".maps");

static inline struct tcp_int_state *tcp_int_get_state(struct bpf_sock *sk)
{
if (!sk)
return NULL;

return bpf_sk_storage_get(&map_tcp_int_state, sk, NULL,
BPF_SK_STORAGE_GET_F_CREATE);
}

#endif /* __TCP_INT_COMMON_BPF_H */
Loading

0 comments on commit 6949b46

Please sign in to comment.