Skip to content

Disable igemm bwd v1r1 on gfx942 #3682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 23, 2025
Merged

Conversation

randyspauldingamd
Copy link
Contributor

@randyspauldingamd randyspauldingamd commented Apr 16, 2025

Kernel contains asm instructions which do not exist on these asics. Disables the following solvers and the associated tests:

ConvHipImplicitGemmBwdDataV1R1
ConvHipImplicitGemmBwdDataV4R1
ConvHipImplicitGemmV4R1Fwd
ConvHipImplicitGemmV4R1WrW
ConvHipImplicitGemmV4R4Fwd
ConvHipImplicitGemmV4R4WrW

@@ -645,6 +645,9 @@ bool ConvHipImplicitGemmBwdDataV1R1::IsApplicable(const ExecutionContext& ctx,
return false;
if(!ctx.use_hip_kernels)
return false;
const std::string name = ctx.GetStream().GetDeviceName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, but need to repeat with 5 more
ConvHipImplicitGemmBwdDataV1R1
ConvHipImplicitGemmBwdDataV4R1
ConvHipImplicitGemmV4R1Fwd
ConvHipImplicitGemmV4R1WrW
ConvHipImplicitGemmV4R4Fwd
ConvHipImplicitGemmV4R4WrW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kamil! I included those, and I'm running lots of additional tests in search of any more.

Comment on lines 649 to 650
if(name == "gfx942" || name == "gfx950")
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to explicitly enable this solver for some gpus, otherwise we'll have to keep updating the list for every new device.

Copy link
Contributor Author

@randyspauldingamd randyspauldingamd Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. The primary goal of this was to give Adam ideas for temporary workarounds to get his stuff working. The proper fixes are obviously much more involved , and I expect this will never go in. I'll leave this up for reference for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes have a temporary feel to them, so I shall not be making this extra change at this time.

@randyspauldingamd
Copy link
Contributor Author

Fixing these test crashes was determined to be necessary for ROCm 6.5. Time does not allow correcting the kernels in question, so the solvers are being disabled.

Note: all of the failures were with FP32 and this change causes some other types to be disabled unnecessarily. This issue will be discussed among the team and shall not be addressed in this PR nor planned at this time.

Comment on lines 120 to 121
if(problem.IsFp32() && (name == "gfx942" || name == "gfx950"))
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this solver only an issue for FP32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually a few type-specific issues in these solvers. I originally had them all properly flagged, but then I couldn't find a quick way to replicate it in the tests and removed them. This one got missed. I'll clean it up.

@bghimireamd
Copy link
Contributor

bghimireamd commented Apr 23, 2025

Basically :

  1. disabling these solvers ConvHipImplicitGemmBwdDataV1R1, ConvHipImplicitGemmBwdDataV4R1, ConvHipImplicitGemmV4R1Fwd, ConvHipImplicitGemmV4R1WrW, ConvHipImplicitGemmV4R4Fwd, ConvHipImplicitGemmV4R4WrW in gfx942" and "gfx950".
  2. also removed Gpu::gfx94X from testing against those solves

May be add this to description too.

LGTM

@randyspauldingamd randyspauldingamd dismissed shurale-nkn’s stale review April 23, 2025 19:25

Did not respond in a timely fashion

@randyspauldingamd randyspauldingamd merged commit 8a81e7a into develop Apr 23, 2025
13 of 66 checks passed
@randyspauldingamd randyspauldingamd deleted the rjs/no_igemmV1R1_on_942 branch April 23, 2025 19:26
BrianHarrisonAMD added a commit that referenced this pull request May 5, 2025
BrianHarrisonAMD added a commit that referenced this pull request May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants