You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! I am currently testing the ResNet50 model on NVIDIA's 40G A100 platform. After utilizing Cutlass, the compiled ResNet50 is 2.9 times faster compared to compiling it with CUDNN and CUBLAS. However, I find it strange that Cutlass can achieve such a significant speed-up. Is there anything I am overlooking? Below are the specifics of my testing process:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
1.7029 1.5734 1.8596 1.5677 0.1418
Performance using cudnn+cublas:
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
4.7330 4.6858 5.0248 4.6459 0.0859
cuda version
11.7
tvm version
commit 1d145f112115ca20a0cd2e37a726b1d1519cac4b
config.cmake
@@ -46,7 +46,7 @@
# - ON: enable CUDA with cmake's auto search# - OFF: disable CUDA# - /path/to/cuda: use specific path to cuda toolkit
-set(USE_CUDA OFF)
+set(USE_CUDA ON)
# Whether enable ROCM runtime#
@@ -142,7 +142,7 @@ set(USE_MICRO_STANDALONE_RUNTIME OFF)
# - OFF: disable llvm, note this will disable CPU codegen# which is needed for most cases# - /path/to/llvm-config: enable specific LLVM when multiple llvm-dev is available.
-set(USE_LLVM OFF)
+set(USE_LLVM /usr/bin/llvm-config-11)
#---------------------------------------------# Contrib libraries
@@ -217,10 +217,10 @@ set(USE_EDGETPU OFF)
# - ON: enable cuDNN with cmake's auto search in CUDA directory# - OFF: disable cuDNN# - /path/to/cudnn: use specific path to cuDNN path
-set(USE_CUDNN OFF)
+set(USE_CUDNN ON)
# Whether use cuBLAS
-set(USE_CUBLAS OFF)
+set(USE_CUBLAS ON)
# Whether use MIOpen
set(USE_MIOPEN OFF)
@@ -416,7 +416,7 @@ set(USE_GTEST AUTO)
# Enable using CUTLASS as a BYOC backend# Need to have USE_CUDA=ON
-set(USE_CUTLASS OFF)
+set(USE_CUTLASS ON)
The text was updated successfully, but these errors were encountered:
Hello! I am currently testing the ResNet50 model on NVIDIA's 40G A100 platform. After utilizing Cutlass, the compiled ResNet50 is 2.9 times faster compared to compiling it with CUDNN and CUBLAS. However, I find it strange that Cutlass can achieve such a significant speed-up. Is there anything I am overlooking? Below are the specifics of my testing process:
In order to run the resnet50/run.py using TVM's main branch API, I made some modifications to the code:https://github.com/umiswing/tvm-cutlass-eval/commit/3b4bd377763d8d8eb3a0817fbed6cde9e6708bf3
Link to reproduce the testing (python run.py):https://github.com/umiswing/tvm-cutlass-eval/blob/master/resnet50/run.py
Performance using cutlass:
Execution time summary: mean (ms) median (ms) max (ms) min (ms) std (ms) 1.7029 1.5734 1.8596 1.5677 0.1418
Performance using cudnn+cublas:
Execution time summary: mean (ms) median (ms) max (ms) min (ms) std (ms) 4.7330 4.6858 5.0248 4.6459 0.0859
cuda version
11.7
tvm version
commit 1d145f112115ca20a0cd2e37a726b1d1519cac4b
config.cmake
The text was updated successfully, but these errors were encountered: