Name		Name	Last commit message	Last commit date
parent directory ..
doc		doc
mps		mps
src		src
.clang-format		.clang-format
.gitignore		.gitignore
AddCustomTask.md		AddCustomTask.md
CLA.md		CLA.md
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
add_client_server.diff		add_client_server.diff
add_mps_support.diff		add_mps_support.diff
pre-commit		pre-commit
readme.md		readme.md
trt_bench.cpp		trt_bench.cpp

readme.md

MPS Sample for Drive Orin

[toc]

Overview

This sample demonstrates how to enable Multi-Process Service (MPS) on Drive Orin to allow concurrent execution of CUDA kernels from different CUDA contexts on Orin iGPU.

https://www.nvidia.cn/content/dam/en-zz/zh_cn/assets/webinars/31oct2019c/20191031_MPS_davidwu.pdf

This MPS enabling sample is added on an application that already supports to run following tasks in parallel within one CUDA context:

Multiple TensorRT GPU inference instances
Multiple TensorRT DLA inference instances
Multiple CUDA kernels

How to Enable MPS

The package includes two "*.diff" patch files:

add_client_server.diff: add two applications, mps_server and mps_client, without MPS enabled

add_mps_support.diff: add MPS-related changes to enable MPS for both mps_server and mps_client applications. Refer to "add_mps_support.diff" for the necessary changes to enable MPS.

MPS on Primary CUDA context
- In the beginning of the process, call etiEnableSharedPrimaryCtx to enable primary contex as mps shared context
```
etblSharedCtxFunc->etiEnableSharedPrimaryCtx(&shareKey, 0);
```
- Call CUDA runtime api like cudaSetDeviceFalgs to create primary CUDA context In this sample, in mps_utils.h, we called:
```
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
```

Build and Run

1. Build

$ mkdir build && cd build
$ cmake .. && make

2. Preparation

Generate a TensorRT engine with trtexec

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --int8 --saveEngine=ResNet50GPU.engine
$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --useDLACore=0 --fp16 --saveEngine=ResNet50DLA.engine  --allowGPUFallback

3. Run

The sample supports running with or without MPS, allowing users to compare performance.

Run with MPS

Terminal#1:

$ ./build/mps_server --mps

Terminal#2, #3, and more:

$ ./build/mps_client --mps --GPU=ResNet50GPU.engine,ResNet50GPU.engine --DLA_0=ResNet50DLA.engine --DLA_1=ResNet50DLA.engine --custom=CudaKernelTask
$ ./build/mps_client --mps --GPU=ResNet50GPU.engine,ResNet50GPU.engine --DLA_0=ResNet50DLA.engine --DLA_1=ResNet50DLA.engine --custom=CudaKernelTask
...

Performance data will be aggregated and displayed in Terminal#1

Run without MPS

Terminal#1:

$ ./build/mps_server

Terminal#2, #3, and more:

$ ./build/mps_client --mps --GPU=ResNet50GPU.engine,ResNet50GPU.engine --DLA_0=ResNet50DLA.engine --DLA_1=ResNet50DLA.engine --custom=CudaKernelTask
$ ./build/mps_client --mps --GPU=ResNet50GPU.engine,ResNet50GPU.engine --DLA_0=ResNet50DLA.engine --DLA_1=ResNet50DLA.engine --custom=CudaKernelTask

Performance data will be aggregated and displayed in Terminal#1:

Additional Application Capabilities

The app also supports you to:

Benchmark the TensorRT/GPU, TensorRT/DLA with multiple CUDA streams in one process(one CUDA context)
Benchmark the TensorRT/GPU, TensorRT/DLA in multiple processes without MPS
Benchmark the TensorRT/GPU, TensorRT/DLA in multiple processes with MPS

With this app, you can simulate the multiple process Deep learning tasks arrangement with MPS

Diagram

file: mps_server.cpp, mps_client.cpp

Benchmark

Testing device: Drive Orin-X, Drive OS 6.0.6.0
Unit: img/sec

model	2 process without MPS	2 process with MPS	2 CUDA-streams in 1 CUDA-context	MPS/stream
Resnet50_224_b1	1380	1498	1630	0.920
Resnet50_224_b8	3163	3192	3608	0.885
Resnet50_224_b16	3640	3648	3955	0.922
Resnet50_224_b32	3802	4064	4198	0.968
Resnet50_224_b64	3939	4269	4366	0.978
yolov4_416_b1	254	297	300	0.990
yolov4_416_b8	373	416	416	1.000
yolov4_416_b16	387	430	432	0.995
yolov4_416_b32	397	441	445	0.991
yolov4_416_b64	403	446	451	0.990

Notes

1. CUDA_DEVICE_MAX_CONNECTIONS

Ensure the CUDA_DEVICE_MAX_CONNECTIONS is set to a value equal or larger than CUDA stream number ($STREAM_NUM) in the CUDA context. This can avoid initialization failures due to over-allocation of limited GPU resources, and the false dependencies among the CUDA streams, false dependencies could cause unexpected long CUDA syteam sync time.
User can run "export CUDA_DEVICE_MAX_CONNECTIONS=$STREAM_NUM" under the same terminal of the application, or just call setenv("CUDA_DEVICE_MAX_CONNECTIONS", "$STREAM_NUM, 1) to set CUDA_DEVICE_MAX_CONNECTIONS, max value of CUDA_DEVICE_MAX_CONNECTIONS on Orin is 32.

2. MPS best practice

MPS on Primary cuda context(RECOMMANDED)

the best way to use MPS is enabling the primary CUDA context to be a "MPS context", please call etiEnableSharedPrimaryCtx at the beginning of the process. By doing so, when you call a CUDA runtime API such as cudaSetDeviceFlags or cudaFree, the primary context will be created with MPS enabled.

Create Explicited CUDA context

User can create and manage the MPS CUDA context by calling:

etiSharedCtxCreate(&cuda_context, 0, &(mps_resource.createParams));

3. Others

After enabling MPS, there will be only one CUDA context as you can find in nsys log since all the contexts running on MPS will be bound into one.
MPS will not cause higher CPU utilization, if meet high CPU loading while cudaStreamSynchronize, Check the flags with cudaGetDeviceFlags.
MPS only supports compute, so graphics context can't run on MPS now
CUDA objects are CUDA context locally, sharing them accross CUDA contextes could cause error,e.g., cuda error 705 : cudaErrorPeerAccessNotEnabled
MPS Shared key and Device key are different among processes

User can check with nsight-system if there is GPU context switch inside the application, sample command

nsys profile -t cuda,cudnn,osrt,nvtx --gpuctxsw true --duration=30 --accelerator-trace=tegra-accelerators --process-scope=system-wide ./build/mps_client --mps --GPU=ResNet50GPU.engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device-benchmark-mps

device-benchmark-mps

readme.md

MPS Sample for Drive Orin

Overview

How to Enable MPS

Build and Run

1. Build

2. Preparation

3. Run

Run with MPS

Run without MPS

Additional Application Capabilities

Diagram

Benchmark

Notes

1. CUDA_DEVICE_MAX_CONNECTIONS

2. MPS best practice

3. Others

Files

device-benchmark-mps

Directory actions

More options

Directory actions

More options

Latest commit

History

device-benchmark-mps

Folders and files

parent directory

readme.md

MPS Sample for Drive Orin

Overview

How to Enable MPS

Build and Run

1. Build

2. Preparation

3. Run

Run with MPS

Run without MPS

Additional Application Capabilities

Diagram

Benchmark

Notes

1. CUDA_DEVICE_MAX_CONNECTIONS

2. MPS best practice

3. Others