The following kernel and distros have been tested:
- Ubuntu 18.04 LTS (Kernel 4.15.0-20-generic)
- Ubuntu 22.04 LTS (Kernel 5.15.0-72-generic)
- Ubuntu 22.04 LTS (Kernel 5.15.0-100-generic)
- prepare clone these repo to your home dir or anywhere you like.
git clone [email protected]:RC4ML/qdma_driver.git
git clone --recursive [email protected]:RC4ML/rc4ml_qdma.git
sudo apt-get install libaio1 libaio-dev
- compile
cd ~/qdma_driver
make modulesymfile=/usr/src/linux-headers-$(uname -r)/Module.symvers
make apps modulesymfile=/usr/src/linux-headers-$(uname -r)/Module.symvers
- install apps and header files
sudo make install-apps modulesymfile=/usr/src/linux-headers-$(uname -r)/Module.symvers
- install kernel mod
sudo insmod src/qdma-pf.ko
If you find the kernel module fails to install due to invalid module format, consider updating your Linux header files by the following script:
sudo apt update && sudo apt upgrade
sudo apt remove --purge linux-headers-*
sudo apt autoremove && sudo apt autoclean
sudo apt install linux-headers-generic
Note: nvcc
is required in the system PATH.
Important: If nvcc
is not in the system PATH, Do not install it from apt directly. Instead, do as follows:
- Run the following commands:
cd /usr/local
ls
- You shall see some folders like
cuda-XX.X
, whereXX.X
is the version of the CUDA toolkit. - Run
nvidia-smi
, you can see the driver's CUDA version of the GPU. - Choose the proper version of CUDA toolkit. The CUDA versions of the toolkit and the driver don't have to be identical.
- Run the following commands, replace '
XX.X
' with the toolkit version you choose:
export PATH=/usr/local/cuda-XX.X/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-XX.X/lib64:$LD_LIBRARY_PATH
- run
nvcc -V
, if you see the version of the CUDA toolkit, then you can go to the next step.
cd ~/rc4ml_qdma
mkdir build
cd build
cmake ..
sudo make install
There are five binary files you can use:
Coressponed to QDMATop.scala
.
h2c_benchmark()
will test host to card channelc2h_benchmark()
will test card to host channelbenchmark_bridge_write()
will test axi bridge channel
Coressponed to QDMARandomTop.scala
, which aims to benchmark random access performance. (1 GB memory)
h2c_benchmark_random()
will test host to card channelc2h_benchmark_random()
will test card to host channelconcurrent_random()
will test concurrent performance, if you want to get one direction performance (such as h2c), you need to set another's factor (c2h_factor) to 2 ensuring that c2h is always running when h2c is performed.
Coressponed to QDMALatencyTop.scala
, which aims to benchmark dma channel's host to card and card to host latency. (1
GB memory)
h2c_benchmark_latency()
will test host to card channelc2h_benchmark_latency()
will test card to host channelconcurrent_latency()
will test concurrent performance, this is not fully implemented and latency would increase around 1us when fully loaded.
Coressponed to AXILBenchmarkTop.scala
, which aims to benchmark the AXIL read latency in various situations under
different workloads.
startFpgaH2C()
will initialize the host to card channel with a simple throughput benchmarkstartFpgaC2H()
will initialize the card to host channel with a simple throughput benchmarkaxilReadBenchmark()
will test the axi lite read latency
This aims to benchmark mmio performance. Any bitstream with the IP QDMABlackBox is suitable for this.
benchmark_bridge_write()
will test mmio write performance in either UC mode or WC mode. Note that the mode is set by adding PAT records on x86 platforms, and Linux kernel doesn't provide an interface to remove PAT records, so after mode switching a reboot is required.
- **Attention!** Before you run these binaries, you must program FPGA and reboot the host. Each time you reboot you need to redo the insmod step (i.e., sudo insmod src/qdma-pf.ko)
And following instructions needs to be executed before you run binaries.
sudo su
echo 1024 > /sys/bus/pci/devices/0000:1a:00.0/qdma/qmax
dma-ctl qdma1a000 q add idx 0 mode st dir bi
dma-ctl qdma1a000 q start idx 0 dir bi desc_bypass_en pfetch_bypass_en
dma-ctl qdma1a000 q add idx 1 mode st dir bi
dma-ctl qdma1a000 q start idx 1 dir bi desc_bypass_en pfetch_bypass_en
dma-ctl qdma1a000 q add idx 2 mode st dir bi
dma-ctl qdma1a000 q start idx 2 dir bi desc_bypass_en pfetch_bypass_en
dma-ctl qdma1a000 q add idx 3 mode st dir bi
dma-ctl qdma1a000 q start idx 3 dir bi desc_bypass_en pfetch_bypass_en
- **Attention!** If you meet errors like "bash: echo: write error: Invalid argument" while executing instructions, check if related instruction has been in /etc/rc.local file and has auto-started on boot.
- **Attention!** Other errors see [here](https://www.notion.so/rc4mlzju/QDMA-d0778b6595e440ae9c87ed7bc76873b3).
Run your binaries according to which bitstream is in FPGA.
There are some useful commands (provided by Xilinx QDMA Linux Kernel Driver) in cmd.txt.
Testbed: amax2, U280 board.
This benchmark the axi-lite read latency when dma channel is busy.
AXIL Latency | QDMA Bandwidth | ||
---|---|---|---|
axi lite read(yes/no) | latency(us) | read(GBps) | write(GBps) |
no | / | 12.79 | 12.99 |
yes | 0.88 | 0 | 0 |
yes | 0.95 | 12.79 | 0 |
yes | 1.47 | 0 | 12.98 |
yes | 2.95 | 4.8 | 12.9 |
(512 * 4's average) |
Host memory size = 1GB, total cmds = 256*1024. This benchmark the random access throughput
package size (Bytes) | Qs = 1 | 2 | 4 | 8 | 16 | |
---|---|---|---|---|---|---|
64 | 2.07 | 1.97 | 1.96 | 2.05 | 2.03 | OPS ~ 32M |
128 | 3.93 | 3.96 | 3.98 | 3.79 | 3.98 | OPS ~ 32M |
256 | 7.24 | 7.27 | 7.85 | 7.52 | 7.58 | OPS ~ 29M |
package size (Bytes) | Qs = 1 | 2 | 4 | 8 | |
---|---|---|---|---|---|
64 | 4.97 | 4.97 | 4.97 | 4.97 | OPS ~ 80M |
128 | 9.85 | 9.93 | 9.93 | 9.93 | OPS ~ 79M |
256 | 11.92 | 11.92 | 11.92 | 11.92 | OPS ~ 48M |
package size (Bytes) | Qs=1 | 2 | 4 | 8 | |||||
---|---|---|---|---|---|---|---|---|---|
H2C | C2H | H2C | C2H | H2C | C2H | H2C | C2H | ||
64 | 1.64 | 1.83 | 1.77 | 1.83 | 1.79 | 1.8 | 1.64 | 1.84 | OPS~29M |
128 | 3.42 | 3.52 | 3.31 | 3.35 | 3.36 | 3.44 | 3.52 | 3.5 | OPS~28M |
256 | 6.56 | 6.63 | 6.37 | 6.45 | 6.64 | 6.4 | 6.57 | 6.47 | OPS~26M |
512 | 6.66 | 12.35 | x | x | x | x | x | x | x |
1024 | 10.68 | 12.12 | x | x | x | x | x | x | x |
2048 | 10.88 | x | x | x | x | x | x | x | x |
4096 | 11.39 | x | x | x | x | x | x | x | x |
Host memory = 1GB, total cmds = 256*1024. This benchmark the dma read/write latency.
Wait cycles
is the minimum duration when issuing two cmds, thus the maximum OPS is limited.
Latency CMD
calculated duration begin when cmd issues, ends when axibridge returns.
Latency DATA
calculated duration begin when last beat data issues, ends when axibridge returns.
*this latency can be thousands us sometimes, because write latency use bridge channel to reply, single thread can issue around 8M bridge write, when ops exceeds this, the latency increase a lot.
Packet Size | Wait cycles | OPS limit | Throughput (Mops) | Throughput (GB/s) | Latency (us) |
---|---|---|---|---|---|
64B | 50 | 5M | 4.6 | 0.3 | 1.0 |
25 | 10M | 8.8 | 0.6 | 1.1 | |
12 | 20M | 17.0 | 1.1 | 1.0 | |
6 | 40M | 29.8 | 1.9 | 1.0 | |
0 | N/A | 36.2 | 2.3 | 2.5 | |
4KB | 100 | 2.5M | 2.3 | 9.3 | 1.4 |
90 | 2.8M | 2.6 | 10.4 | 1.5 | |
85 | 2.9M | 2.7 | 11.0 | 1.5 | |
80 | 3.1M | 2.9 | 11.6 | 1.6 | |
75 | 3.3M | 3.1 | 12.4 | 1.7 | |
70 | 3.6M | 3.2 | 12.8 | 11.2 | |
50 | 5M | 3.2 | 12.8 | 11.3 |
Packet Size | Wait cycles | OPS limit | Throughput (Mops) | Throughput (GB/s) | Latency CMD (us) | Latency DATA (us) |
---|---|---|---|---|---|---|
64B | 50 | 5M | 4.6 | 0.29 | 1.3 | 1.3 |
25 | 10M | 8.8 | 0.55 | 1.7* | 1.7* | |
4KB | 100 | 2.5M | 2.3 | 9.35 | 1.4 | 1.2 |
90 | 2.8M | 2.6 | 10.37 | 1.3 | 1 | |
85 | 2.9M | 2.7 | 10.96 | 1.5 | 1.2 | |
80 | 3.1M | 2.9 | 11.63 | 1.6 | 1.3 | |
75 | 3.3M | 3.1 | 12.39 | 1.6 | 1.3 | |
70 | 3.6M | 3.2 | 12.8 | 6.8 | 3.8 | |
50 | 5M | 3.2 | 12.82 | 6.6 | 3.6 |
-
When running with more than 8 qs, it will always fail. QDMA C2H data port's ready would be down after receiving several data.
-
Even running with less than or equal to 8 qs, it can sometimes fail, try reprogram the FPGA. I guess only one q has the most chance to pass.
-
Tested situation:
packet size: 1K/32K(which would be splited into multiple packets)
-
I have tried fetching tag index for each queue using dma-ctl, it's useless.
- All statistics are calculated at the 250M user clock, (so if your speed is 10.6 GB/s at most, maybe you have used 300M user clock).