Implement persistent matmul scheduling #3812

jacobhinkle · 2025-02-03T16:04:11Z

Stacked on #3642

This is a followup to #3792 that implements persistent scheduling.

There is a current limitation that affects both persistent scheduling and "grid swizzling": if MatmulOp or LinearOp are present in the fusion, we will hit inlining errors. This is because in that case we have a non-trivial AxisMapping on the MmaOp. The missing input dimensions are not tracked through the scheduling transforms (merges and splits) required for either grid swizzling or persistent scheduling. Because of this, I introduced three new parametrized tests matching the original MLPBenchmarkTests but with _BroadcastInputs suffix. These tests use fusedMultiplySum instead of linear. The persistent variant of the non BroadcastInputs tests are skipped until we fix the inlining issue.

I currently observe a correctness issue in the MLPBenchmarkTest.FwdEpilogueFusion_BroadcastInputs test regardless of parametrization. This means that we are getting incorrect results even for data parallel scheduling. I confirmed this test also fails on main. I currently skip this test with a warning mesage.

Fixes #3636

I think this covers the motivation for #3616

There is still one case that fails, which we should fix. I'll create an issue for it.

Co-authored-by: Ryan Spring <[email protected]>

…option

…ersistent_kernel_impl

…impl

jacobhinkle · 2025-02-03T16:22:09Z

!test

github-actions · 2025-02-03T16:22:49Z

Review updated until commit 0cbb3e6

Description

Implement persistent scheduling for Hopper matmul kernels
Add support for warp specialization on Hopper by default
Parametrize MLP Benchmark tests to include persistent and warp specialization configurations
Add new tests for broadcast inputs in MLP benchmarks

Changes walkthrough 📝

Relevant files

Enhancement

hopper_multi_matmul.cpp `Add persistent scheduling support` csrc/scheduler/hopper_multi_matmul.cpp Include `matmul_heuristic.h` Remove temporary check for persistent scheduling Add persistent kernel scheduling logic Update block parallelization based on tiling strategy	+41/-16
matmul_utils.cpp `Set warp specialization default` csrc/scheduler/matmul_utils.cpp Set warp specialization as default on Hopper	+4/-0
test_matmul.cpp `Update MLPBenchmarkTest for persistent kernels` tests/cpp/test_matmul.cpp Parametrize MLPBenchmarkTest to include persistent and warp specialization Add tests for broadcast inputs Skip persistent kernel tests for unsupported operations Adjust parameters for smem and register constraints	+235/-80

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Possible Issue The PR introduces persistent kernel scheduling, but there are known issues with inlining errors when `MatmulOp` or `LinearOp` are present in the fusion. This should be addressed to ensure correctness. if (params_->tiling_strategy != MatmulParams::TilingStrategy::OneTilePerCTA) { NVF_CHECK( params_->splitk_factor == 1, "Hopper matmul scheduler does not support scheduling persistent split-K kernels"); } NVF_CHECK( params_->tiling_strategy != MatmulParams::TilingStrategy::DistributeStagesAcrossSMs, "Hopper matmul scheduler does not support distributing stages across SMs a la stream-K"); Failing Tests The `FwdEpilogueFusion_BroadcastInputs` test is currently failing. This should be investigated and resolved to ensure the correctness of the persistent kernel implementation. TEST_P(MLPBenchmarkTest, FwdEpilogueFusion_BroadcastInputs) { GTEST_SKIP() << "THIS TEST IS CURRENTLY FAILING" << std::endl; Test Skips Multiple tests are skipped due to unsupported features or known issues. These should be addressed or documented to ensure comprehensive testing. auto options = at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA); auto a_ref = at::randn({M, K}, options);

rdspring1

LGTM.

Looks like some overlap with #3642. Do you plan to merge it first?

rdspring1 · 2025-02-05T01:43:12Z

tests/cpp/test_matmul.cpp

        MLPBenchmarkTestParams{
            .warp_specialization = true,
            .persistent_kernel = true}),
    [](const testing::TestParamInfo<MLPBenchmarkTestParams>& info) {
      std::stringstream ss;
-      ss << (info.param.persistent_kernel ? "persistent" : "data_parallel");
+      ss << (info.param.persistent_kernel ? "persistent" : "dataparallel");


Is this rename from a bad merge?

I just hadn't merged #3642 in a while I think.

rdspring1 · 2025-02-05T01:46:42Z

csrc/scheduler/matmul_utils.cpp

@@ -308,6 +308,10 @@ bool fillDefaultHopperHeuristic(

  mparams->tile_sizes = {cta_tile, warp_tile};

+  // Use warp specialization on hopper by default


I thought using warp specialization by default was causing some test failures.

Not anymore. I think that was before integrating the warp tile split.

…impl

jacobhinkle · 2025-02-06T14:32:47Z

!test

…impl

jacobhinkle · 2025-02-06T16:21:22Z

Grrr. Failures on ampere. Will fix before merging.

jacobhinkle · 2025-02-06T16:45:43Z

!test

jacobhinkle and others added 29 commits December 23, 2024 20:54

Split Hopper MMA by warp-tile before instruction tile

851669a

Fixes #3636

Use 4 warpgroups, disable smem epilogue

8b42cd6

Merge branch 'main' into hopper_warptile_split

7c6d417

Use warp_tile for tma_m and tma_n

521d5cc

Two warp tiles per CTA in each dim, increase instr to 64_64_16

dce16ad

Also split by K

f5e084c

I think this covers the motivation for #3616

Add ScheduleWithTranslation test (failing)

be705bf

Merge remote-tracking branch 'origin/main' into hopper_warptile_split

9de3202

Merge remote-tracking branch 'origin/main' into hopper_warptile_split

41e2b94

Merge remote-tracking branch 'origin/main' into hopper_warptile_split

e010ead

Update to fix compilation

5246fb3

Don't do K split. Fix TMA offset

1dccf22

Merge remote-tracking branch 'origin/main' into hopper_warptile_split

496d8d7

Add options for warp specialization and persistence strategy

dfa6ff8

Temporarily revert change to scheduleStMatrixForMmaOutput

21c508d

Parametrize MLP Benchmark tests to run three configurations

174deda

Unguard most matmul node translation tests on Hopper

7868900

There is still one case that fails, which we should fix. I'll create an issue for it.

lintrunner

9b5e73c

Apply suggestions from code review

21a2710

Co-authored-by: Ryan Spring <[email protected]>

Reparametrize and place a big comment explaining

f1fff43

Update python bindings

a3b8fd4

Add more checks for valid configs

bfd65f3

Set warp specialization as default on hopper

694e0fe

Merge remote-tracking branch 'origin/main' into jh/persistent_kernel_…

794285b

…option

Guard MLPBenchmarkTest to Hopper only

95cf199

Merge remote-tracking branch 'origin/hopper_warptile_split' into jh/p…

ffa276e

…ersistent_kernel_impl

Merge in from #3642. Add persistent change

86d75de

Add BroadcastInputs tests

6d98405

Remove debug prints

438e1a0

jacobhinkle requested a review from rdspring1 February 3, 2025 16:04

Merge remote-tracking branch 'origin/main' into jh/persistent_kernel_…

68c07a0

…impl

Fix block parallelization

4d0226c

rdspring1 reviewed Feb 5, 2025

View reviewed changes

Override params for horizontal fusion tests

07c93c6

jacobhinkle force-pushed the jh/persistent_kernel_impl branch from 52d2bca to 07c93c6 Compare February 6, 2025 14:03

jacobhinkle added 4 commits February 6, 2025 09:15

Merge commit '9dc94c0' into jh/persistent_kernel_impl

37a7282

Merge commit '3ac19f0' into jh/persistent_kernel_impl

74751b3

Merge commit 'a1baafa' into jh/persistent_kernel_impl

2527dc0

Merge remote-tracking branch 'origin/main' into jh/persistent_kernel_…

e4486c8

…impl

Merge remote-tracking branch 'origin/main' into jh/persistent_kernel_…

b0359a2

…impl

jacobhinkle added 2 commits February 6, 2025 11:31

Uncomment correctness checks in tests

7f161bc

Guard failing MLPBenchmarkTest cases on Ampere

0cbb3e6

jacobhinkle requested a review from rdspring1 February 7, 2025 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement persistent matmul scheduling #3812

Implement persistent matmul scheduling #3812

jacobhinkle commented Feb 3, 2025 •

edited

Loading

jacobhinkle commented Feb 3, 2025

github-actions bot commented Feb 3, 2025 •

edited

Loading

rdspring1 left a comment

rdspring1 Feb 5, 2025

jacobhinkle Feb 6, 2025

rdspring1 Feb 5, 2025

jacobhinkle Feb 7, 2025

jacobhinkle commented Feb 6, 2025

jacobhinkle commented Feb 6, 2025

jacobhinkle commented Feb 6, 2025

		@@ -308,6 +308,10 @@ bool fillDefaultHopperHeuristic(

		mparams->tile_sizes = {cta_tile, warp_tile};

		// Use warp specialization on hopper by default

Implement persistent matmul scheduling #3812

Are you sure you want to change the base?

Implement persistent matmul scheduling #3812

Conversation

jacobhinkle commented Feb 3, 2025 • edited Loading

jacobhinkle commented Feb 3, 2025

github-actions bot commented Feb 3, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

rdspring1 left a comment

Choose a reason for hiding this comment

rdspring1 Feb 5, 2025

Choose a reason for hiding this comment

jacobhinkle Feb 6, 2025

Choose a reason for hiding this comment

rdspring1 Feb 5, 2025

Choose a reason for hiding this comment

jacobhinkle Feb 7, 2025

Choose a reason for hiding this comment

jacobhinkle commented Feb 6, 2025

jacobhinkle commented Feb 6, 2025

jacobhinkle commented Feb 6, 2025

jacobhinkle commented Feb 3, 2025 •

edited

Loading

github-actions bot commented Feb 3, 2025 •

edited

Loading