Skip to content

Commit

Permalink
importing all files for v0.3.0 release
Browse files Browse the repository at this point in the history
  • Loading branch information
srwass committed Sep 30, 2022
1 parent 6949b46 commit c372d3c
Show file tree
Hide file tree
Showing 24 changed files with 596 additions and 114 deletions.
8 changes: 4 additions & 4 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
All files in this project are:

Copyright (c) 2021 Intel Corporation or Copyright 2021 Intel Corporation.
Copyright (c) <years> Intel Corporation or Copyright 2021 Intel Corporation.
All rights reserved.

The files containing Copyright 2021 Intel Corporation and the
The files containing Copyright <years> Intel Corporation and the
following SPDX-License-Identifier line in a comment near the beginning:

SPDX-License-Identifier: Apache-2.0

are distributed under the Apache license shown below.


The files containing Copyright (c) 2021 Intel Corporation and the
The files containing Copyright (c) <years> Intel Corporation and the
following SPDX-License-Identifier line in a comment near the beginning:

SPDX-License-Identifier: GPL-2.0-only

are distributed under the GNU General Public License, version 2.


The files containing Copyright (c) 2021 Intel Corporation and the
The files containing Copyright (c) <years> Intel Corporation and the
following SPDX-License-Identifier line in a comment near the beginning:

SPDX-License-Identifier: BSD-2-Clause
Expand Down
38 changes: 29 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ TCP-INT is implemented in the TCP header as a new TCP option with three fields:

- INTval: the link utilization (or queue depth if utilization is 100%).
- HopID: the ID of most congested switch (the packet’s TTL at the switch).
- SWLat: the sum of latencies experienced at each hop.
- HopLat: the sum of latencies experienced across all hops.

Each field has a corresponding echo-reply field (ecr) for the receiver to echo the telemetry back to
the sender.
Expand All @@ -20,13 +20,13 @@ the sender.
```
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+---------------+---------------+---------------+
| Kind = 0x72 | Length = 6 | INTval | INTecr |
+---------------+---------------+---------------+----------------
| HopID=IP.TTL | HopIDecr | SWLat (3B) ...
----------------+---------------+--------------------------------
... | SWLatEcr (3B) |
----------------+-----------------------------------------------+
+---------------+---------------+-------+-------+---------------+
| Kind = 0x72 | Length = 12 |TagFreq|LinkSpd| INTval |
+---------------+---------------+-------+-------+---------------+
| HopID=IP.TTL | HopLat (3B) |
+---------------+-------+-------+-------------------------------+
| INTEcr |LnkSEcr| HIDEcr| HopLatEcr (2B) |
----------------+-------+-------+-------------------------------+
```

### Workflow
Expand All @@ -35,7 +35,7 @@ the sender.
- Upon receiving a packet with a TCP-INT header option, the switch updates the fields:
- pkt.INTval = switch.INTval if switch.INTval > pkt.INTval
- pkt.HopID = IP.TTL if switch.INTval > pkt.INTval
- pkt.SWLat += latency through this switch
- pkt.HopLat += latency through this switch
- The eBPF receives the telemetry and (possibly) sends it to user-space for consumption.
- When the ACK is sent, the eBPF sets the ecr fields to the latest INT received on the send path.

Expand Down Expand Up @@ -144,6 +144,26 @@ cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
```

## Tagging policy
The TCP-INT eBPF inserts a TCP header option in outgoing packets. The switch can "tag" the packet by setting the INT fields in the header option. If the switch were to tag every packet, it could interfere with generic receive offload (GRO) in the Linux TCP stack, possibly hurting performance. To avoid this interference, the switch can tag *some* of the packets -- enough to provide fresh telemetry, without hurting performance.

### Tagging frequency
Instead of tagging every packet, the switch can tag packets at a given frequency. For example, if the tagging frequency is 16, one in every 16 packets will be tagged. The switch implements this logic probabilistically, so it's possible -- but unlikely -- for two successive packets to be tagged. The packets will be tagged in *expectation* at the specified tagging frequency.

To control the tagging frequency, the sender (eBPF) specifies the tagging frequency by setting the `tagfreqkey` field in the TCP header option. The switch uses `tagfreqkey` to decide whether to tag the packet.

### Default (static) tagging policy
By default, the tagging policy is to tag packets at a fixed tagging frequency. This tagging frequency is configured through the switch control plane. The sender requests the default tagging frequency by setting `tagfreqkey` to `TCP_INT_TAGFREQKEY_SWITCH_DEFAULT`.

### Dynamic tagging policy
Depending on network conditions and INT freshness requirements, the sender can dynamically change the tagging frequency. The current dynamic tagging policy identifies three states of a flow:

- `APPLIMITED`: the application isn't calling send() fast enough, so the in-flight packets are far below the CWND.
- `CONGESTED`: there's congestion in the network fabric.
- `UNCONGESTED`: there's no fabric congestion and the flow isn't application limited.

The thresholds for distinguishing these states, as well as the tagging frequency to use for each state, can be configured in `include/tcp_int_common.h`. The dynamic policy can be enabled through the `TCP_INT_ENABLE_DYNAMIC_TAGGING` define.

### Loading TCP-INT and running test applications

#### 1. Load TCP-INT on the end-hosts
Expand Down
28 changes: 27 additions & 1 deletion changelog.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,30 @@
# Changes to TCP-INT in version 0.1.0-alpha
# Changes to TCP-INT in version 0.3.0-alpha

## New features

* Dynamic tagging policy: the eBPF specifies the tagging frequency to get timely updates without hurting performance
* New header option fields: TagFreqKey and LinkSpeed
* Compact encoding of Ecr fields: HopLat and IDEcr
* Performance evaluation scripts that configure the system and run workloads with various parameters
* Histograms of sent and received skb length
* Makefile configures the number of CPUs at compile time (instead of manually editing the header file)

## Optimizations

* Removed compiler warnings
* Avoid multiple eBPF map lookups by caching result

## Maintenance

* Renamed "SwitchID" to "HopID"
* Renamed "iratio" to "tagging frequency"

## Limitations

* The new 4-bit IDEcr encoding represents at most 15 hops on the path


# Changes to TCP-INT in version 0.2.0-alpha

## New features

Expand Down
12 changes: 6 additions & 6 deletions code/include/tcp_int_common.bpf.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@

/* TCP INT state definition */
struct tcp_int_state {
bool pending_ecr; /* Indicates pending echo request */
tcp_int_val intvalecr; /* INT value to be echoed back (network order) */
tcp_int_id idecr; /* ID to be echoed back (network order) */
__u32 qdepth; /* Queue depth in data path */
tcp_int_lat swlatecr; /* Sum of switch latencies on data path */
bool pending_ecr; /* Indicates pending echo request */
tcp_int_val intvalecr; /* INT value to be echoed back (network order) */
tcp_int_id idecr; /* ID to be echoed back (network order) */
__u32 qdepth; /* Queue depth in data path */
tcp_int_latecr hoplatecr; /* Sum of hop latencies on data path */
};

/* Attaches INT state to socket */
Expand All @@ -39,4 +39,4 @@ static inline struct tcp_int_state *tcp_int_get_state(struct bpf_sock *sk)
BPF_SK_STORAGE_GET_F_CREATE);
}

#endif /* __TCP_INT_COMMON_BPF_H */
#endif /* __TCP_INT_COMMON_BPF_H */
57 changes: 50 additions & 7 deletions code/include/tcp_int_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,53 @@

#include <tcp_int_opt.h>

#define TCP_INT_ENABLE_DYNAMIC_TAGGING 0
#define TCP_INT_CONG_QDEPTH_THRESH 10000
#define TCP_INT_TAGFREQKEY_SWITCH_DEFAULT 0xf
#define TCP_INT_TAGFREQKEY_APPLIMITED 0
#define TCP_INT_TAGFREQKEY_CONGESTED 4
#define TCP_INT_TAGFREQKEY_UNCONGESTED 7

#define TCP_INT_BYTES_IN_KBYTE 1024
#define TCP_INT_HIST_MAX_SLOTS 256
#define TCP_INT_MAX_PERID_HISTS (TCP_INT_TTL_INIT + 1)
#define TCP_INT_MAX_UTIL_PERCENT 100
#define TCP_INT_MAX_UTIL_SCALED 0x7f
#define TCP_INT_MIN_QDEPTH_SCALED 0x80
#define TCP_INT_MAX_CGROUP_PATH_LEN 128

#define tcp_int_swlat_to_us(x) ((x) * ((1 << TCP_INT_SWLAT_BITSHIFT) / 1000.0))
#define TCP_INT_SKBLEN_BITSHIFT 8

/* HopLat is the upper 24 bits of a 32-bit unsigned that represents the sum of
* hop latencies in ns. On the switch, HopLat is shifted up to perform
* saturating addition, and then shifed back down before sending it to the next
* hop/host.
*
* HopLatEcr is a 16-bit encoding (compression) of the 24-bit HopLat. If HopLat
* overflows 15 bits, the tcp_int_hoplat_to_hoplatecr() macro shifts HopLat down
* 8 bits and stores it in HopLatEcr with MSB set to 1, indicating that
* HopLatEcr contains the shifted HopLat.
*
* N.B. These macros assume host order. The caller should convert the argument
* to host order before using this macro.
*/
#define tcp_int_hoplatecr_to_ns(x) \
(((x)&0x8000) ? ((__u32)(x) << (TCP_INT_HLAT_BITSHIFT * 2)) \
: ((__u32)(x) << TCP_INT_HLAT_BITSHIFT))
#define tcp_int_hoplat_to_hoplatecr(x) \
(((x)&0xff8000) ? (((x) >> TCP_INT_HLAT_BITSHIFT) | 0x8000) : (x))

/* Id is a 8-bit field that identifies the most congested hop. On the switch, Id
* is set to the packet's current TTL value. Thus, the Id decreases as the
* packet traverses the hops.
*
* IdEcr is a 4-bit field that also identifies the congested hop, but in
* ascending order, starting from 1 for the first hop. 0 indicates uninitialized
* Ecr data.
*
* N.B. Because IdEcr is 4 bits (and 0 indicates uninitialized), it cannot be
* used for paths longer than 15 hops.
*/
#define tcp_int_id_to_idecr(x) (TCP_INT_TTL_INIT - (x) + 1)

struct tcp_int_event {
__u64 ts_us;
Expand All @@ -38,19 +76,24 @@ struct tcp_int_event {
__u32 mss;
__u32 lost_out;
tcp_int_val intval;
tcp_int_id sid;
__u32 swlat;
__u32 return_swlat;
tcp_int_id hid;
__u32 hoplat;
__u32 return_hoplat;
__u32 total_retrans;
__u32 segs_out;
__u64 bytes_acked;
} __attribute__((packed));
;

enum tcp_int_hist_type {
TCP_INT_HIST_TYPE_SRTT = 0,
TCP_INT_HIST_TYPE_CWND,
TCP_INT_HIST_TYPE_SID,
TCP_INT_HIST_TYPE_HID,
TCP_INT_HIST_TYPE_UTIL,
TCP_INT_HIST_TYPE_QDEPTH,
TCP_INT_HIST_TYPE_SWLAT,
TCP_INT_HIST_TYPE_HLAT,
TCP_INT_HIST_TYPE_RXSKBLEN,
TCP_INT_HIST_TYPE_TXSKBLEN,
TCP_INT_HIST_TYPE_MAX
};

Expand Down
4 changes: 3 additions & 1 deletion code/include/tcp_int_opt.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ struct uint24 {
typedef __u8 tcp_int_val;
typedef __u8 tcp_int_id;
typedef struct uint24 tcp_int_lat;
typedef __u16 tcp_int_latecr;

#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define be24tohl(x) (bpf_ntohl((x) << 8))
Expand All @@ -21,9 +22,10 @@ typedef struct uint24 tcp_int_lat;

#define TCP_INT_UTIL_BITSHIFT 3
#define TCP_INT_QDEPTH_BITSHIFT 13
#define TCP_INT_SWLAT_BITSHIFT 8
#define TCP_INT_HLAT_BITSHIFT 8
#define TCP_INT_MAX_UTIL_SCALED 0x7f
#define TCP_INT_MIN_QDEPTH_SCALED 0x80
#define TCP_INT_TTL_INIT 64
#define TCP_INT_MAX_SKBLEN 65536

#endif /* __TCP_INT_OPT_H */
2 changes: 1 addition & 1 deletion code/scripts/clang_style_format.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

# Copyright 2021-2022 Intel Corporation
# Copyright 2022 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Automatically reindent all .c and .h source files in this project.
Expand Down
79 changes: 79 additions & 0 deletions code/scripts/collect_client_stats.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/bin/bash

# Copyright 2022 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

show_usage() {
1>&2 echo "usage: `basename $0` <interface> <time_in_seconds> <server_ip> <directory> <tcpdump_yes | tcpdump_no> [ extra args added to iperf command ]"
}

if [ $# -lt 5 ]
then
show_usage
exit 1
fi

DEV=$1
shift
TIME=$1
shift
DEST_IP=$1
shift
DIR=$1
shift
TCPDUMP=$1
shift

ss_pid=0
hist_pid=0

mkdir -p $DIR

cp /dev/null $DIR/ifconfig_client.log

date >> $DIR/ifconfig_client.log

echo "Running ifconfig"
ifconfig $DEV >> $DIR/ifconfig_client.log
bpftool prog show > $DIR/start_stats_client.txt

echo "Start top command to run every 2 second"
top -b -d 2 > $DIR/top_client.log &

if [[ $TCPDUMP == tcpdump_yes ]]; then
tcpdump -i $DEV -s 128 -w $DIR/results_client.pcap &
fi

echo "Collect ss stats"
/opt/tcp-int/scripts/run_ss.sh $DEST_IP > $DIR/ss_results_client.txt &
ss_pid=$!

echo "Start collecting tx histograms"
/usr/local/lib/bpf/tcp-int/tcp_int hist-txpktlen > $DIR/tx_hist_client.log &
hist_pid=$!

/opt/tcp-int/scripts/tcp-int-run numactl -N netdev:$DEV -m netdev:$DEV iperf -c $DEST_IP -p 5001 -N -i 5 -e -t $TIME -P 1 -Z dctcp $* > $DIR/iperf_client.log

bpftool prog show > $DIR/end_stats_client.txt
date >> $DIR/ifconfig_client.log
ifconfig $DEV >> $DIR/ifconfig_client.log

pkill top

if [[ $TCPDUMP == tcpdump_yes ]]; then
pkill tcpdump
fi

if [[ ${hist_pid} -ne 0 ]]; then
kill -s SIGINT ${hist_pid}
hist_pid=0
fi

if [[ ${ss_pid} -ne 0 ]]; then
kill -9 ${ss_pid}
ss_pid=0
fi

echo "Perf test finished"

exit 0
Loading

0 comments on commit c372d3c

Please sign in to comment.