Port oss f16_fast_gemv into fbcode #3610

YUNQIUGUO · 2025-01-23T23:19:12Z

Summary:
This diff content includes:

Port OSS FastGEMV fp16 kernel into fbcode and expose to python as a step 1 - torch.ops.fbgemm.f16_fast_gemv
https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14
Add fp16_oss_fast_gemv to quantize ops benchmark script
Add two simple tests for custom optorch.ops.fbgemm.f16_fast_gemv to test
- torch.compile() able
- correctness

Next step:
Need fp8 mixed precision support for fast gemv kernel which is what we want

Differential Revision: D68470488

facebook-github-bot · 2025-01-23T23:19:37Z

This pull request was exported from Phabricator. Differential Revision: D68470488

netlify · 2025-01-23T23:19:38Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`0b15c28`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67a25821a3c6b60008a1c83f
😎 Deploy Preview	https://deploy-preview-3610--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-24T00:48:31Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-24T02:41:30Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-27T22:35:27Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1720649201 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-28T20:09:34Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1720649201 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-28T22:19:08Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1722119058 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` heuristic sweep results from the 4 problem sizes we care about: P1722043272 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-30T22:12:01Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers: P1722119058 compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 | | 1 | 8192 | 1024 | cuda_lite | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 | | 1 | 8192 | 3584 | cuda_lite | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | heuristic sweep results from the 4 problem sizes we care about: P1722043272 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-30T22:40:05Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with fast_gemv kernel and cuda_lite not matched yet. so need fp8 to coming in for a fairer result. heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-01-31T18:56:00Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `fp16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.f16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `fp16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `f16_baseline,fp16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | fp16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | fp16_oss_fast_gemv | 0.010 | 1.763 | 1765.033 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | fp16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | fp16_oss_fast_gemv | 0.026 | 2.268 | 1134.843 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with fast_gemv kernel and cuda_lite not matched yet. so need fp8 to coming in for a fairer result. heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T03:20:56Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T03:23:20Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T03:23:51Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T03:54:10Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T05:10:50Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T05:22:56Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T16:39:53Z

This pull request was exported from Phabricator. Differential Revision: D68470488

Summary: X-link: facebookresearch/FBGEMM#688 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Differential Revision: D68470488

facebook-github-bot · 2025-02-04T18:10:48Z

This pull request was exported from Phabricator. Differential Revision: D68470488

facebook-github-bot · 2025-02-05T19:50:28Z

This pull request has been merged in 8d6c0bf.

Summary: Pull Request resolved: facebookresearch/FBGEMM#688 X-link: pytorch#3610 This diff content includes: 1. Port OSS FastGEMV `bf16` kernel into fbcode and expose to python as a step 1 - `torch.ops.fbgemm.bf16_fast_gemv` https://github.com/wangsiping97/FastGEMV/blob/1fdff6f74aade033c02727a419afd6a4b4bfbc3f/fast_gemv.cu#L14 2. Add `bf16_oss_fast_gemv` to quantize ops benchmark script 3. Add two simple tests for custom op`torch.ops.fbgemm.f16_fast_gemv` to test - `torch.compile()` able - correctness Perf numbers compared with `bf16_baseline,bf16_oss_fast_gemv,cuda_lite,marlin_bf16i4,machete_bf16i4` ====================== ### Benchmark Results on H100 | **M** | **N** | **K** | **Method** | **Elapsed Time (ms)** | **TFLOPS** | **GB/s** | | --- | --- | --- | --- | --- | --- | --- | | 1 | 1280 | 8192 | bf16_baseline | 0.024 | 0.860 | 861.042 | | 1 | 1280 | 8192 | bf16_oss_fast_gemv | 0.019 | 1.126 | 1127.391 | | 1 | 1280 | 8192 | cuda_lite_fp8 | 0.015 | 1.357 | 679.032 | | 1 | 1280 | 8192 | marlin_bf16i4 | 0.027 | 0.768 | 192.612 | | 1 | 1280 | 8192 | machete_bf16i4 | 0.026 | 0.810 | 203.219 | | 1 | 8192 | 1024 | bf16_baseline | 0.018 | 0.952 | 953.176 | | 1 | 8192 | 1024 | bf16_oss_fast_gemv | 0.015 | 1.100 | 1100.900 | | 1 | 8192 | 1024 | cuda_lite_fp8 | 0.014 | 1.198 | 600.054 | | 1 | 8192 | 1024 | marlin_bf16i4 | 0.015 | 1.144 | 287.150 | | 1 | 8192 | 1024 | machete_bf16i4 | 0.014 | 1.187 | 298.096 | | 1 | 7168 | 8192 | bf16_baseline | 0.073 | 1.609 | 1608.983 | | 1 | 7168 | 8192 | bf16_oss_fast_gemv | 0.069 | 1.697 | 1697.308 | | 1 | 7168 | 8192 | cuda_lite_fp8 | 0.044 | 2.679 | 1340.093 | | 1 | 7168 | 8192 | marlin_bf16i4 | 0.033 | 3.590 | 898.436 | | 1 | 7168 | 8192 | machete_bf16i4 | 0.039 | 3.017 | 755.147 | | 1 | 8192 | 3584 | bf16_baseline | 0.045 | 1.312 | 1312.239 | | 1 | 8192 | 3584 | bf16_oss_fast_gemv | 0.041 | 1.427 | 1427.166 | | 1 | 8192 | 3584 | cuda_lite_fp8 | 0.026 | 2.271 | 1136.151 | | 1 | 8192 | 3584 | marlin_bf16i4 | 0.021 | 2.808 | 703.164 | | 1 | 8192 | 3584 | machete_bf16i4 | 0.024 | 2.460 | 615.990 | Note that currently the precision with `fast_gemv` kernel and `cuda_lite` does not match yet. so need fp8 to coming in for a fairer result. Also no_cuda_graph flag enabled when running the quantize_bench heuristic sweep results from the 4 problem sizes we care about: P1722806148 **Next step:** Need fp8 mixed precision support for fast gemv kernel which is what we want Reviewed By: ipiszy Differential Revision: D68470488 fbshipit-source-id: ac2e6b857c0b9984ee25b5df2a90cce6d1a93dd8

facebook-github-bot added the cla signed label Jan 23, 2025

facebook-github-bot added the fb-exported label Jan 23, 2025

YUNQIUGUO force-pushed the export-D68470488 branch from 7e680f3 to b031f69 Compare January 24, 2025 00:48

YUNQIUGUO force-pushed the export-D68470488 branch from b031f69 to e6c0730 Compare January 24, 2025 02:41

YUNQIUGUO force-pushed the export-D68470488 branch from e6c0730 to 0c02e5a Compare January 27, 2025 22:35

YUNQIUGUO force-pushed the export-D68470488 branch from 0c02e5a to 9c2cd76 Compare January 28, 2025 20:09

YUNQIUGUO force-pushed the export-D68470488 branch from 9c2cd76 to 6c11e8e Compare January 28, 2025 22:19

YUNQIUGUO force-pushed the export-D68470488 branch from 6c11e8e to 2864803 Compare January 30, 2025 22:11

YUNQIUGUO force-pushed the export-D68470488 branch from 2864803 to c79bbc3 Compare January 30, 2025 22:39

YUNQIUGUO force-pushed the export-D68470488 branch from c79bbc3 to 86df5d1 Compare January 31, 2025 18:55

YUNQIUGUO force-pushed the export-D68470488 branch from 86df5d1 to 949bdfc Compare January 31, 2025 20:15

YUNQIUGUO force-pushed the export-D68470488 branch from 7104abc to 57b68a3 Compare February 4, 2025 03:23

YUNQIUGUO force-pushed the export-D68470488 branch from 57b68a3 to f8bcdd6 Compare February 4, 2025 03:23

YUNQIUGUO force-pushed the export-D68470488 branch from f8bcdd6 to f348dc9 Compare February 4, 2025 03:54

YUNQIUGUO force-pushed the export-D68470488 branch from f348dc9 to d3aecb7 Compare February 4, 2025 05:10

YUNQIUGUO force-pushed the export-D68470488 branch from d3aecb7 to 4cb9654 Compare February 4, 2025 05:22

YUNQIUGUO force-pushed the export-D68470488 branch from 4cb9654 to 15549a3 Compare February 4, 2025 16:39

YUNQIUGUO force-pushed the export-D68470488 branch from 15549a3 to 0b15c28 Compare February 4, 2025 18:10

facebook-github-bot closed this in 8d6c0bf Feb 5, 2025

facebook-github-bot added the Merged label Feb 5, 2025

q10 added feature:fp8 category:new labels Feb 8, 2025

Port oss f16_fast_gemv into fbcode #3610

Port oss f16_fast_gemv into fbcode #3610

Uh oh!

Conversation

YUNQIUGUO commented Jan 23, 2025

Uh oh!

facebook-github-bot commented Jan 23, 2025

Uh oh!

netlify bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Jan 24, 2025

Uh oh!

facebook-github-bot commented Jan 24, 2025

Uh oh!

facebook-github-bot commented Jan 27, 2025

Uh oh!

facebook-github-bot commented Jan 28, 2025

Uh oh!

facebook-github-bot commented Jan 28, 2025

Uh oh!

facebook-github-bot commented Jan 30, 2025

Uh oh!

facebook-github-bot commented Jan 30, 2025

Uh oh!

facebook-github-bot commented Jan 31, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

facebook-github-bot commented Feb 5, 2025

Uh oh!

Uh oh!

netlify bot commented Jan 23, 2025 •

edited

Loading