The performance of ResNet50 using CUDNN and CUBLAS is significantly slower compared to Cutlass. #5

umiswing · 2023-05-12T09:13:49Z

Hello! I am currently testing the ResNet50 model on NVIDIA's 40G A100 platform. After utilizing Cutlass, the compiled ResNet50 is 2.9 times faster compared to compiling it with CUDNN and CUBLAS. However, I find it strange that Cutlass can achieve such a significant speed-up. Is there anything I am overlooking? Below are the specifics of my testing process:

In order to run the resnet50/run.py using TVM's main branch API, I made some modifications to the code:https://github.com/umiswing/tvm-cutlass-eval/commit/3b4bd377763d8d8eb3a0817fbed6cde9e6708bf3
Link to reproduce the testing (python run.py):https://github.com/umiswing/tvm-cutlass-eval/blob/master/resnet50/run.py

Performance using cutlass:

Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
    1.7029       1.5734       1.8596       1.5677       0.1418

Performance using cudnn+cublas:

Execution time summary:
  mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
    4.7330       4.6858       5.0248       4.6459       0.0859

cuda version

11.7

tvm version

commit 1d145f112115ca20a0cd2e37a726b1d1519cac4b

config.cmake

@@ -46,7 +46,7 @@
 # - ON: enable CUDA with cmake's auto search
 # - OFF: disable CUDA
 # - /path/to/cuda: use specific path to cuda toolkit
-set(USE_CUDA OFF)
+set(USE_CUDA ON)

 # Whether enable ROCM runtime
 #
@@ -142,7 +142,7 @@ set(USE_MICRO_STANDALONE_RUNTIME OFF)
 # - OFF: disable llvm, note this will disable CPU codegen
 #        which is needed for most cases
 # - /path/to/llvm-config: enable specific LLVM when multiple llvm-dev is available.
-set(USE_LLVM OFF)
+set(USE_LLVM /usr/bin/llvm-config-11)

 #---------------------------------------------
 # Contrib libraries
@@ -217,10 +217,10 @@ set(USE_EDGETPU OFF)
 # - ON: enable cuDNN with cmake's auto search in CUDA directory
 # - OFF: disable cuDNN
 # - /path/to/cudnn: use specific path to cuDNN path
-set(USE_CUDNN OFF)
+set(USE_CUDNN ON)

 # Whether use cuBLAS
-set(USE_CUBLAS OFF)
+set(USE_CUBLAS ON)

 # Whether use MIOpen
 set(USE_MIOPEN OFF)
@@ -416,7 +416,7 @@ set(USE_GTEST AUTO)

 # Enable using CUTLASS as a BYOC backend
 # Need to have USE_CUDA=ON
-set(USE_CUTLASS OFF)
+set(USE_CUTLASS ON)

umiswing · 2023-05-12T11:25:22Z

Let's keep our discussion in https://discuss.tvm.apache.org/t/the-performance-of-resnet50-using-cudnn-and-cublas-is-significantly-slower-compared-to-cutlass/14914, sorry for bother.

umiswing closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance of ResNet50 using CUDNN and CUBLAS is significantly slower compared to Cutlass. #5

The performance of ResNet50 using CUDNN and CUBLAS is significantly slower compared to Cutlass. #5

umiswing commented May 12, 2023 •

edited

Loading

umiswing commented May 12, 2023

The performance of ResNet50 using CUDNN and CUBLAS is significantly slower compared to Cutlass. #5

The performance of ResNet50 using CUDNN and CUBLAS is significantly slower compared to Cutlass. #5

Comments

umiswing commented May 12, 2023 • edited Loading

Performance using cutlass:

Performance using cudnn+cublas:

cuda version

tvm version

config.cmake

umiswing commented May 12, 2023

umiswing commented May 12, 2023 •

edited

Loading