You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When testing the performance of DeepSeek-v2-lite using Megatron and TransformerEngine, I encountered an issue where GroupedLinear exhibits unusually high duration. The TEGroupedLinear forward operation typically takes about 1ms as observed in the nsys timeline, but there are anomalous events that exceed 200ms. What could be causing this issue?
I cannot provide the timeline for some reason. The table following provides the duration of abnormal events and normal events which were extracted from nsys timeline. Why is there such a large difference in duration between nvte_multi_stream_cublas_gemm and TERowParallelGroupedLinear? and why does the start time of the abnormal event nvte_multi_stream_cublas_gemm lag behind the start time of TERowParallelGroupedLinear by about 200ms?
And if I directly use TEGroupedLinear and input tensors of the same shape for microBenchmark, the time consumption returns to normal, is the training workflow affecting the execution efficiency of the kernel?
<style>
</style>
Name
Start
Duration
TID
#TEGroupLinear forward
3.67097s
206.707 ms
179103
##TERowParallelGroupedLinear forward
3.67099s
206.660 ms
179103
###nvte_multi_stream_cublas_gemm
3.87704s
387.909 μs
179103
#TEGroupLinear forward
1.4103s
3.373 ms
179103
##TERowParallelGroupedLinear forward
1.41032s
3.327 ms
179103
###nvte_multi_stream_cublas_gemm
1.41077s
1.008 ms
179103
#TEGroupLinear forward
2.58523s
3.103 ms
179103
##TERowParallelGroupedLinear forward
2.58525s
3.055 ms
179103
###nvte_multi_stream_cublas_gemm
2.58579s
1.128 ms
179103
Optimization of hopper?
And is there a plan to optimize GroupedLinear for Hopper architecture? Based on the parameters of DeepSeek-v2, the tflops of H800 compared to A800 did not improve significantly, and overall performance is quite poor. The test results and code are as follows:
# H800
Average execution time: 0.0011188620805740357 s, tflops: 253.35369430928066
Average execution time: 0.001063387517929077 s, tflops: 133.2852966376957
# A800
Average execution time: 0.0018983731222152712 s, tflops: 149.32145752527958
Average execution time: 0.0013353574371337891 s, tflops: 106.13931283613297
Can you summary your questions into the following two?
Q: Why is there such a large difference in duration between nvte_multi_stream_cublas_gemm and TERowParallelGroupedLinear?
A: It's the CPU overheads of PyTorch ops, such as torch.split(), and torch.empty() (2xnum_gemms calls) under fused_multi_cast_transpose. You can capture them in Nsys by adding the context with torch.autograd.profiler.emit_nvtx(enabled=True) to your code during profiling. It's not trivial to eliminate these overheads.
Q: These are some abnormal iterations that consume much more time than usual, while in micro benchmark, there is no problem.
A: I have no idea of this issue. Maybe you can enable nvtx for torch ops using the context mentioned above and see what it is actually doing there.
Issue
When testing the performance of DeepSeek-v2-lite using Megatron and TransformerEngine, I encountered an issue where GroupedLinear exhibits unusually high duration. The TEGroupedLinear forward operation typically takes about 1ms as observed in the nsys timeline, but there are anomalous events that exceed 200ms. What could be causing this issue?
environment
Duration of GroupedLinear event
I cannot provide the timeline for some reason. The table following provides the duration of abnormal events and normal events which were extracted from nsys timeline. Why is there such a large difference in duration between nvte_multi_stream_cublas_gemm and TERowParallelGroupedLinear? and why does the start time of the abnormal event nvte_multi_stream_cublas_gemm lag behind the start time of TERowParallelGroupedLinear by about 200ms?
And if I directly use TEGroupedLinear and input tensors of the same shape for microBenchmark, the time consumption returns to normal, is the training workflow affecting the execution efficiency of the kernel?
<style> </style>Optimization of hopper?
And is there a plan to optimize GroupedLinear for Hopper architecture? Based on the parameters of DeepSeek-v2, the tflops of H800 compared to A800 did not improve significantly, and overall performance is quite poor. The test results and code are as follows:
The text was updated successfully, but these errors were encountered: