Commit 3582d66
FP8 Grouped Gemm Optimization (#3655)
Summary:
X-link: facebookresearch/FBGEMM#731
While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`.
To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead.
To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor.
In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads.
Reviewed By: jiawenliu64
Differential Revision: D690725291 parent dced756 commit 3582d66
File tree
5 files changed
+204
-110
lines changed- fbgemm_gpu/experimental
- gemm/triton_gemm
- gen_ai
- bench
- src/quantize
- cutlass_extensions
- test/quantize
5 files changed
+204
-110
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2447 | 2447 | | |
2448 | 2448 | | |
2449 | 2449 | | |
| 2450 | + | |
| 2451 | + | |
| 2452 | + | |
| 2453 | + | |
| 2454 | + | |
| 2455 | + | |
| 2456 | + | |
2450 | 2457 | | |
2451 | 2458 | | |
2452 | 2459 | | |
| |||
2484 | 2491 | | |
2485 | 2492 | | |
2486 | 2493 | | |
2487 | | - | |
| 2494 | + | |
2488 | 2495 | | |
2489 | 2496 | | |
2490 | 2497 | | |
| |||
2514 | 2521 | | |
2515 | 2522 | | |
2516 | 2523 | | |
2517 | | - | |
2518 | | - | |
2519 | | - | |
2520 | | - | |
2521 | | - | |
2522 | | - | |
2523 | | - | |
2524 | | - | |
2525 | | - | |
2526 | | - | |
2527 | | - | |
| 2524 | + | |
2528 | 2525 | | |
2529 | 2526 | | |
2530 | 2527 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
553 | 554 | | |
554 | 555 | | |
555 | 556 | | |
556 | | - | |
| 557 | + | |
557 | 558 | | |
558 | 559 | | |
559 | 560 | | |
560 | 561 | | |
561 | | - | |
| 562 | + | |
562 | 563 | | |
563 | 564 | | |
564 | 565 | | |
| |||
573 | 574 | | |
574 | 575 | | |
575 | 576 | | |
576 | | - | |
577 | | - | |
578 | | - | |
579 | | - | |
580 | | - | |
581 | 577 | | |
582 | | - | |
583 | | - | |
584 | | - | |
585 | | - | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
586 | 582 | | |
587 | 583 | | |
588 | 584 | | |
| |||
0 commit comments