-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port oss f16_fast_gemv into fbcode #3610
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D68470488 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
7e680f3
to
b031f69
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
b031f69
to
e6c0730
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
e6c0730
to
0c02e5a
Compare
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
This pull request was exported from Phabricator. Differential Revision: D68470488 |
0c02e5a
to
9c2cd76
Compare
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1720649201 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1720649201 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
9c2cd76
to
6c11e8e
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
6c11e8e
to
2864803
Compare
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1722119058 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` heuristic sweep results from the 4 problem sizes we care about: P1722043272 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1722119058 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 | | 1 | 8192 | 1024 | cuda_lite | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 | | 1 | 8192 | 3584 | cuda_lite | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | heuristic sweep results from the 4 problem sizes we care about: P1722043272 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
2864803
to
c79bbc3
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with fast_gemv kernel and cuda_lite not matched yet. so need fp8 to coming in for a fairer result. heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
c79bbc3
to
86df5d1
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with fast_gemv kernel and cuda_lite not matched yet. so need fp8 to coming in for a fairer result. heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
86df5d1
to
949bdfc
Compare
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
876d9c3
to
5de93cd
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
5de93cd
to
7104abc
Compare
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
7104abc
to
57b68a3
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
57b68a3
to
f8bcdd6
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
f8bcdd6
to
f348dc9
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
f348dc9
to
d3aecb7
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
d3aecb7
to
4cb9654
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
4cb9654
to
15549a3
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488
15549a3
to
0b15c28
Compare
This pull request was exported from Phabricator. Differential Revision: D68470488 |
Summary:
This diff content includes:
fp16
kernel into fbcode and expose to python as a step 1 -torch.ops.fbgemm.f16_fast_gemv
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
fp16_oss_fast_gemv
to quantize ops benchmark scripttorch.ops.fbgemm.f16_fast_gemv
to testtorch.compile()
ableNext step:
Need fp8 mixed precision support for fast gemv kernel which is what we want
Differential Revision: D68470488