HIP mixed TurboQuant vec FA on gfx900/gfx906#99
Open
2bigO wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
Open
HIP mixed TurboQuant vec FA on gfx900/gfx906#992bigO wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
2bigO wants to merge 2 commits intoTheTom:feature/turboquant-kv-cachefrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds a HIP-only guard to disable mixed TurboQuant vec FlashAttention by default on
gfx900/gfx906, while keeping same-type turbo vec FA enabled.On this class of AMD GPUs, mixed KV setups such as
q8_0 + turbo4were building but failing at runtime because the HIP vec FA path was selected even though the mixed turbo vec specializations were not reliable on these targets. The result was a warmup-time abort infattn.cu.This change:
GGML_HIP_DISABLE_MIXED_TURBO_VEC_FAgfx900andgfx906This keeps newer HIP targets unchanged by default, and makes
q8_0 / turbo4usable on gfx906-class hardware.Additional information
I hit this on a ROCm/gfx906 deployment while bringing up
llama-serverwith TurboQuant KV compression. The issue was not a build-only problem: after fixing link-time vec symbol mismatches, warmup still crashed because runtime kernel selection could still route mixed turbo KV into the vec path.This patch is intentionally narrow:
gfx900/gfx906Requirements
PR 2 — stop installing tests by default
Overview
This PR changes
LLAMA_TESTS_INSTALLto default toOFF.Tests can still be built and installed explicitly, but installing them by default makes partial/package builds more fragile than necessary. In my case,
cmake --installfailed because install rules expected test binaries that were not built as part of the requested target set.Turning test installation off by default keeps normal runtime/library installs cleaner and avoids unexpected install failures in downstream packaging/container builds.
Additional information
This does not remove test installation support. It only changes the default so that projects embedding or packaging llama.cpp do not need to opt out of installing test binaries they are not using.
Requirements