-
Notifications
You must be signed in to change notification settings - Fork 244
Open
Labels
enhancementNew feature or requestNew feature or requestperformanceHow fast can we go?How fast can we go?
Description
GemmKernels.jl is shaping up to be usable for replacing the GPUArrays fallback matmul implementation, which much better performance. For example a 2048x2048x2048 Float32 matmul.
GPUArrays:
julia> @benchmark CUDA.@sync GPUArrays.generic_matmatmul!(dC, dA, dB, true, false)
BenchmarkTools.Trial: 576 samples with 1 evaluation.
Range (min … max): 6.001 ms … 9.519 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.698 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.677 ms ± 352.550 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █▅▁
▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▂▃▃▃▄▅▅█▇████▅▅▄▃▃▃ ▃
6 ms Histogram: frequency by time 9.22 ms <
Memory estimate: 4.81 KiB, allocs estimate: 88.
GemmKernels:
julia> @benchmark CUDA.@sync GemmKernels.mul!(dC, dA, dB)
BenchmarkTools.Trial: 7421 samples with 1 evaluation.
Range (min … max): 579.075 μs … 7.580 ms ┊ GC (min … max): 0.00% … 81.60%
Time (median): 670.235 μs ┊ GC (median): 0.00%
Time (mean ± σ): 669.440 μs ± 81.284 μs ┊ GC (mean ± σ): 0.12% ± 0.95%
▁ ▂█▅▂
▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▂▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▁▂▂▃▂▃▆▆█▄▅████▆▄▃▃▃▂▂▂▂ ▃
579 μs Histogram: frequency by time 692 μs <
Memory estimate: 6.34 KiB, allocs estimate: 115.
CUBLAS:
julia> @benchmark CUDA.@sync mul!(dC, dA, dB)
BenchmarkTools.Trial: 9659 samples with 1 evaluation.
Range (min … max): 404.287 μs … 593.765 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 513.306 μs ┊ GC (median): 0.00%
Time (mean ± σ): 512.696 μs ± 10.767 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁▆▇██▇▅▃▃▂▁ ▂
▄██▆▃▁▁▁▁▃▁▁▁▁▁▃▅▁▁▁▁▁▃▃▁▁▁▅█▄▁▃▁▁▃▁▅█▅▄▄▆███████████████████ █
404 μs Histogram: log(frequency) by time 530 μs <
Memory estimate: 3.81 KiB, allocs estimate: 84.
And for completion, OpenBLAS:
julia> @benchmark mul!(C, A, B)
BenchmarkTools.Trial: 376 samples with 1 evaluation.
Range (min … max): 12.956 ms … 15.188 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.197 ms ┊ GC (median): 0.00%
Time (mean ± σ): 13.294 ms ± 356.341 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▅▄▇███▇▅▅▃▂▅▂▂▁
████████████████▆▁▆▁▅▆▆▅▅▅▁▇▁█▅▅▁▅▅▁▁▅▁▁▁▁▁▁▅▁▅▅▅▁▁▁▁▁▅▅▁▁▁▅ ▇
13 ms Histogram: log(frequency) by time 15.1 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
mzy2240, RomeoV and thomasfaingnaert
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceHow fast can we go?How fast can we go?