QDMA Software Guide

The following kernel and distros have been tested:

Ubuntu 18.04 LTS (Kernel 4.15.0-20-generic)
Ubuntu 22.04 LTS (Kernel 5.15.0-72-generic)
Ubuntu 22.04 LTS (Kernel 5.15.0-100-generic)

Driver Installation

prepare clone these repo to your home dir or anywhere you like.

git clone [email protected]:RC4ML/qdma_driver.git
git clone --recursive [email protected]:RC4ML/rc4ml_qdma.git
sudo apt-get install libaio1 libaio-dev

compile

cd ~/qdma_driver
make modulesymfile=/usr/src/linux-headers-$(uname -r)/Module.symvers
make apps modulesymfile=/usr/src/linux-headers-$(uname -r)/Module.symvers

install apps and header files

sudo make install-apps modulesymfile=/usr/src/linux-headers-$(uname -r)/Module.symvers

install kernel mod

sudo insmod src/qdma-pf.ko

If you find the kernel module fails to install due to invalid module format, consider updating your Linux header files by the following script:

sudo apt update && sudo apt upgrade
sudo apt remove --purge linux-headers-*
sudo apt autoremove && sudo apt autoclean
sudo apt install linux-headers-generic

QDMA Lib Installation

Note: nvcc is required in the system PATH.

Important: If nvcc is not in the system PATH, Do not install it from apt directly. Instead, do as follows:

Run the following commands:

cd /usr/local
ls

You shall see some folders like cuda-XX.X, where XX.X is the version of the CUDA toolkit.
Run nvidia-smi, you can see the driver's CUDA version of the GPU.
Choose the proper version of CUDA toolkit. The CUDA versions of the toolkit and the driver don't have to be identical.
Run the following commands, replace 'XX.X' with the toolkit version you choose:

export PATH=/usr/local/cuda-XX.X/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-XX.X/lib64:$LD_LIBRARY_PATH

run nvcc -V, if you see the version of the CUDA toolkit, then you can go to the next step.

Lib install

cd ~/rc4ml_qdma
mkdir build
cd build
cmake  ..
sudo make install

There are five binary files you can use:

`qdma_throughput`

Coressponed to QDMATop.scala.

h2c_benchmark() will test host to card channel
c2h_benchmark() will test card to host channel
benchmark_bridge_write() will test axi bridge channel

`qdma_random`

Coressponed to QDMARandomTop.scala, which aims to benchmark random access performance. (1 GB memory)

h2c_benchmark_random() will test host to card channel
c2h_benchmark_random() will test card to host channel
concurrent_random() will test concurrent performance, if you want to get one direction performance (such as h2c), you need to set another's factor (c2h_factor) to 2 ensuring that c2h is always running when h2c is performed.

`qdma_latency`

Coressponed to QDMALatencyTop.scala, which aims to benchmark dma channel's host to card and card to host latency. (1 GB memory)

h2c_benchmark_latency() will test host to card channel
c2h_benchmark_latency() will test card to host channel
concurrent_latency() will test concurrent performance, this is not fully implemented and latency would increase around 1us when fully loaded.

`axil_latency`

Coressponed to AXILBenchmarkTop.scala, which aims to benchmark the AXIL read latency in various situations under different workloads.

startFpgaH2C() will initialize the host to card channel with a simple throughput benchmark
startFpgaC2H() will initialize the card to host channel with a simple throughput benchmark
axilReadBenchmark() will test the axi lite read latency

`mmio_bridge`

This aims to benchmark mmio performance. Any bitstream with the IP QDMABlackBox is suitable for this.

benchmark_bridge_write() will test mmio write performance in either UC mode or WC mode. Note that the mode is set by adding PAT records on x86 platforms, and Linux kernel doesn't provide an interface to remove PAT records, so after mode switching a reboot is required.

- **Attention!** Before you run these binaries, you must program FPGA and reboot the host. Each time you reboot you need to redo the insmod step (i.e., sudo insmod src/qdma-pf.ko)

And following instructions needs to be executed before you run binaries.

sudo su
echo 1024 > /sys/bus/pci/devices/0000:1a:00.0/qdma/qmax
dma-ctl qdma1a000 q add idx 0 mode st dir bi
dma-ctl qdma1a000 q start idx 0 dir bi desc_bypass_en pfetch_bypass_en
dma-ctl qdma1a000 q add idx 1 mode st dir bi
dma-ctl qdma1a000 q start idx 1 dir bi desc_bypass_en pfetch_bypass_en
dma-ctl qdma1a000 q add idx 2 mode st dir bi
dma-ctl qdma1a000 q start idx 2 dir bi desc_bypass_en pfetch_bypass_en
dma-ctl qdma1a000 q add idx 3 mode st dir bi
dma-ctl qdma1a000 q start idx 3 dir bi desc_bypass_en pfetch_bypass_en

- **Attention!** If you meet errors like "bash: echo: write error: Invalid argument" while executing instructions, check if related instruction has been in /etc/rc.local file and has auto-started on boot.

- **Attention!** Other errors see [here](https://www.notion.so/rc4mlzju/QDMA-d0778b6595e440ae9c87ed7bc76873b3).

Run your binaries according to which bitstream is in FPGA.

There are some useful commands (provided by Xilinx QDMA Linux Kernel Driver) in cmd.txt.

Benchmark results

Testbed: amax2, U280 board.

This benchmark the axi-lite read latency when dma channel is busy.

AXILite latency

AXIL Latency		QDMA Bandwidth
axi lite read(yes/no)	latency(us)	read(GBps)	write(GBps)
no	/	12.79	12.99
yes	0.88	0	0
yes	0.95	12.79	0
yes	1.47	0	12.98
yes	2.95	4.8	12.9
(512 * 4's average)

Randowm access throughput (GB/s)

Host memory size = 1GB, total cmds = 256*1024. This benchmark the random access throughput

H2C

package size (Bytes)	Qs = 1	2	4	8	16
64	2.07	1.97	1.96	2.05	2.03	OPS ~ 32M
128	3.93	3.96	3.98	3.79	3.98	OPS ~ 32M
256	7.24	7.27	7.85	7.52	7.58	OPS ~ 29M

C2H (more than 8 qs will always fail)

package size (Bytes)	Qs = 1	2	4	8
64	4.97	4.97	4.97	4.97	OPS ~ 80M
128	9.85	9.93	9.93	9.93	OPS ~ 79M
256	11.92	11.92	11.92	11.92	OPS ~ 48M

Concurrent

package size (Bytes)	Qs=1		2		4		8
	H2C	C2H	H2C	C2H	H2C	C2H	H2C	C2H
64	1.64	1.83	1.77	1.83	1.79	1.8	1.64	1.84	OPS~29M
128	3.42	3.52	3.31	3.35	3.36	3.44	3.52	3.5	OPS~28M
256	6.56	6.63	6.37	6.45	6.64	6.4	6.57	6.47	OPS~26M
512	6.66	12.35	x	x	x	x	x	x	x
1024	10.68	12.12	x	x	x	x	x	x	x
2048	10.88	x	x	x	x	x	x	x	x
4096	11.39	x	x	x	x	x	x	x	x

DMA latency

Host memory = 1GB, total cmds = 256*1024. This benchmark the dma read/write latency.

Wait cycles is the minimum duration when issuing two cmds, thus the maximum OPS is limited.

Latency CMD calculated duration begin when cmd issues, ends when axibridge returns.

Latency DATA calculated duration begin when last beat data issues, ends when axibridge returns.

*this latency can be thousands us sometimes, because write latency use bridge channel to reply, single thread can issue around 8M bridge write, when ops exceeds this, the latency increase a lot.

H2C

Packet Size	Wait cycles	OPS limit	Throughput (Mops)	Throughput (GB/s)	Latency (us)
64B	50	5M	4.6	0.3	1.0
	25	10M	8.8	0.6	1.1
	12	20M	17.0	1.1	1.0
	6	40M	29.8	1.9	1.0
	0	N/A	36.2	2.3	2.5

4KB	100	2.5M	2.3	9.3	1.4
	90	2.8M	2.6	10.4	1.5
	85	2.9M	2.7	11.0	1.5
	80	3.1M	2.9	11.6	1.6
	75	3.3M	3.1	12.4	1.7
	70	3.6M	3.2	12.8	11.2
	50	5M	3.2	12.8	11.3

C2H

Packet Size	Wait cycles	OPS limit	Throughput (Mops)	Throughput (GB/s)	Latency CMD (us)	Latency DATA (us)
64B	50	5M	4.6	0.29	1.3	1.3
	25	10M	8.8	0.55	1.7*	1.7*

4KB	100	2.5M	2.3	9.35	1.4	1.2
	90	2.8M	2.6	10.37	1.3	1
	85	2.9M	2.7	10.96	1.5	1.2
	80	3.1M	2.9	11.63	1.6	1.3
	75	3.3M	3.1	12.39	1.6	1.3
	70	3.6M	3.2	12.8	6.8	3.8
	50	5M	3.2	12.82	6.6	3.6

QDMA C2H bug

When running with more than 8 qs, it will always fail. QDMA C2H data port's ready would be down after receiving several data.
Even running with less than or equal to 8 qs, it can sometimes fail, try reprogram the FPGA. I guess only one q has the most chance to pass.
Tested situation:

packet size: 1K/32K(which would be splited into multiple packets)
I have tried fetching tag index for each queue using dma-ctl, it's useless.

Be careful

All statistics are calculated at the 250M user clock, (so if your speed is 10.6 GB/s at most, maybe you have used 300M user clock).

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
example		example
fmt @ e8259c5		fmt @ e8259c5
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
QDMAController.cpp		QDMAController.cpp
QDMAController.h		QDMAController.h
QDMAController.hpp		QDMAController.hpp
README.md		README.md
docs.md		docs.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of contents

QDMA Software Guide

Driver Installation

QDMA Lib Installation

Lib install

`qdma_throughput`

`qdma_random`

`qdma_latency`

`axil_latency`

`mmio_bridge`

Benchmark results

AXILite latency

Randowm access throughput (GB/s)

H2C

C2H (more than 8 qs will always fail)

Concurrent

DMA latency

H2C

C2H

QDMA C2H bug

Be careful

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

RC4ML/rc4ml_qdma

Folders and files

Latest commit

History

Repository files navigation

Table of contents

QDMA Software Guide

Driver Installation

QDMA Lib Installation

Lib install

qdma_throughput

qdma_random

qdma_latency

axil_latency

mmio_bridge

Benchmark results

AXILite latency

Randowm access throughput (GB/s)

H2C

C2H (more than 8 qs will always fail)

Concurrent

DMA latency

H2C

C2H

QDMA C2H bug

Be careful

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

`qdma_throughput`

`qdma_random`

`qdma_latency`

`axil_latency`

`mmio_bridge`

Packages