Add bitonic topk #3862

pfultz2 · 2025-03-03T20:55:09Z

This implements a faster GPU topk.

Update the ref version of topk to take a parameter for the indices, and also updated to handle any layout.
Added a gpu bitonic topk version. This will do a bitonic sort per wavefront and then do a partial sort in shared memory to get the final topk values
Added a rewrite_topk pass that will split large topk's into 2 operators. This needs the indices to be passed along as they wont be the same for one batch.

lakhinderwalia · 2025-04-01T07:13:49Z

src/targets/gpu/kernels/include/migraphx/kernels/float_equal.hpp

+template <class T, class U>
+constexpr bool float_equal(T x, U y)
+{
+    if constexpr(is_integral<T>{} or is_integral<U>{})


Shouldn't both T & U be of integral type?

Copilot

Copilot reviewed 15 out of 35 changed files in this pull request and generated no comments.

Files not reviewed (20)

src/CMakeLists.txt: Language not supported
src/include/migraphx/op/topk.hpp: Language not supported
src/include/migraphx/raw_data.hpp: Language not supported
src/include/migraphx/rewrite_topk.hpp: Language not supported
src/include/migraphx/shape.hpp: Language not supported
src/include/migraphx/tensor_view.hpp: Language not supported
src/rewrite_reduce.cpp: Language not supported
src/rewrite_topk.cpp: Language not supported
src/shape.cpp: Language not supported
src/targets/gpu/jit/topk.cpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/algorithm.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/bit.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/dpp.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/float_equal.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/functional.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/index.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/integral_constant.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/math.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/operators.hpp: Language not supported
src/targets/gpu/kernels/include/migraphx/kernels/shape.hpp: Language not supported

lakhinderwalia · 2025-04-07T05:42:22Z

src/targets/gpu/kernels/include/migraphx/kernels/bit.hpp

+
+constexpr uint64_t bit_ceil(uint64_t x) noexcept
+{
+    if(x <= 1)


This if clause can be removed for a faster overall GPU performance.

I dont see a difference.

The if clause is redundant here, thus extra instructions for a GPU. Not sure what kind of difference are you looking for :-)

lakhinderwalia · 2025-04-07T05:50:07Z

src/targets/gpu/kernels/include/migraphx/kernels/topk.hpp

+{
+    friend constexpr bool operator<(const topk_pair& x, const topk_pair& y)
+    {
+        if(not float_equal(x.key, y.key))


float_equal does un-neccessary calculations for the logic required below; it would be more efficient to just check x.key < y.key, first:

This logic could be rewritten to be faster along the lines of:

if(x.key < y.key) return true; if(x.key > y.key) return false; return x.val < y.val;

That doesn't really make a difference, most likely CSE removes the duplicate comparisons.

Incorrect or duplicate code is best handled in coding stage :-)

lakhinderwalia · 2025-04-07T06:01:23Z

src/targets/gpu/kernels/include/migraphx/kernels/topk.hpp

+        MIGRAPHX_ASSERT(trimmed_n <= n);
+
+        array<pair, nper_lane> local_buf;
+        for(index_int i : range(nper_lane))


why initialize the whole array, when it is being (partially or completely) filled-in later?

Because we cant use uninitialized values.

lakhinderwalia · 2025-04-07T06:11:35Z

src/targets/gpu/kernels/include/migraphx/kernels/topk.hpp

+        __shared__ pair buf[aligned_n];
+        // Copy to LDS
+        idx.local_stride(aligned_n, [&](auto i) {
+            auto key   = i < x.get_shape().elements() ? x[i] : init;


Is x.get_shape().elements() a constexpr?

Yes, but more importantly this returns an integral_constant so it gets folded in the AST.

lakhinderwalia · 2025-04-08T02:56:47Z

src/targets/gpu/kernels/include/migraphx/kernels/topk.hpp

+template <class T, class U>
+struct topk_pair
+    : conditional_t<(sizeof(T) >= sizeof(U)), topk_pair_t_u<T, U>, topk_pair_u_t<T, U>>,
+      partially_ordered<topk_pair<T, U>>


Incidentally, partially_ordered is an incorrect choice for topk_pair, which is actually a strong_ordered or at least a weak_ordered type.

lakhinderwalia · 2025-04-08T03:05:55Z

src/targets/gpu/kernels/include/migraphx/kernels/operators.hpp

+
+template <class T>
+struct partially_ordered
+{


Shouldn't this struct also contain the primitives/operators that were defined for less_than_comparable? Is there a reason that > isn't defined in this case? I am not a fan of these kind of pre-processor macros;
Regardless this ordering type of partially_ordered isn't correct. We should be able to exactly compare two topk tuples.

Thats because a < b != b > a when the keys are equal because the order of the indices is always < even when the key comparison is >. Its probably better to remove the comparison operators and use a custom comparator instead as this gets confusing.

pfultz2 · 2025-04-08T19:04:17Z

So the stable sorting is 2.5x slower on some config. We can recover some perf by lowering the split threshold.

Most of the perf cost comes from the wavefront sorting as the number of elements to sort gets larger, it seems to increase the register pressure. For now, I think this could be merged, and I can investigate the perf issues in the future.

lakhinderwalia · 2025-04-14T17:05:08Z

src/include/migraphx/rewrite_topk.hpp

@@ -36,6 +36,7 @@ struct module;
 /// Rewrite topk operators ideally to better performing operators
 struct MIGRAPHX_EXPORT rewrite_topk
 {
+    std::size_t split_threshold = 8192;


I think these naked constants need a better handling. For example, there is n_threshold, which is double of this split_threshold. Better to derive it as n_threshold/2.

split_threshold sets the n_threshold. This is so we can set the threshold different then the default.

Let me set n_threshold to 0 in the constructor instead to avoid confusion here.

lakhinderwalia · 2025-04-14T17:13:05Z

src/include/migraphx/op/topk.hpp

+        return [=](auto p1, auto p2) {
+            auto [x, i] = p1;
+            auto [y, j] = p2;
+            if(not float_equal(x, y))


compare_pair should be functioning along the lines of std::less(), and there is no reason to fitrst compare using float_equal, fail that test and then run a compare. The comparison should be for less_than or something like that in the first calcuating step. This will make it logical and also more efficient. Thanks.

This is doing lexicographical-like comparison, if the first elements are not equal then do the comparison of the first elements only, and if they are equal do a comparison of the second elements. One difference is that the first elements are compared with the custom comparator that is passed in, and the second elements are always compared with < because the indices always have the same order regardless of what largest is set to.

constexpr bool float_equal(T x, U y) { if constexpr(is_integral<T>{} and is_integral<U>{}) return x == y; return not(x < y or x > y); }

Consider the first case when x1 < y1, and this above float_equal call would first compare <, and then do a follow-up compare() which is another std::less(). This should be just a one comparison and not use two steps. This above code translates to : if(x1 < y1) return std::less(x1,y1). All that was required was a single comparison.

In the second case when x1 > y1, this code translates to if(x1<y1 or x1 > y1) return std::less(x1,y1). This has two extra comparisons.

So the lambda could be written along these lines (no need for float_equal()):

if(x < y) return true; if(x > y) return false; return (i < j);

float_equal doesn't do x < y, it does std::nextafter(x, std::numeric_limits<T>::lowest()) <= y and std::nextafter(x, std::numeric_limits<T>::max()) >= y.

So the lambda could be written along these lines:

No it can't, because we arent doing x < y we are doing compare(x, y).

This is for the ref version anyways so I would prefer to keep it simple and straightforward.

migraphx-bot · 2025-04-15T02:36:01Z

Test	Batch	Rate new 6915ef	Rate old ecb974	Diff	Compare
torchvision-resnet50	64	3,253.49	3,231.64	0.68%	✅
torchvision-resnet50_fp16	64	6,897.82	6,867.39	0.44%	✅
torchvision-densenet121	32	2,442.88	2,432.50	0.43%	✅
torchvision-densenet121_fp16	32	4,239.06	4,212.33	0.63%	✅
torchvision-inceptionv3	32	1,623.24	1,613.30	0.62%	✅
torchvision-inceptionv3_fp16	32	2,707.12	2,696.30	0.40%	✅
cadene-inceptionv4	16	754.25	750.25	0.53%	✅
cadene-resnext64x4	16	814.37	809.71	0.58%	✅
slim-mobilenet	64	6,690.65	6,654.55	0.54%	✅
slim-nasnetalarge	64	197.39	203.03	-2.78%	✅
slim-resnet50v2	64	3,453.13	3,434.71	0.54%	✅
bert-mrpc-onnx	8	1,150.68	1,142.05	0.76%	✅
bert-mrpc-tf	1	475.59	464.19	2.46%	✅
pytorch-examples-wlang-gru	1	476.32	476.27	0.01%	✅
pytorch-examples-wlang-lstm	1	437.02	442.88	-1.32%	✅
torchvision-resnet50_1	1	808.35	813.23	-0.60%	✅
cadene-dpn92_1	1	423.61	421.24	0.56%	✅
cadene-resnext101_1	1	393.47	392.62	0.22%	✅
onnx-taau-downsample	1	397.19	395.87	0.33%	✅
dlrm-criteoterabyte	1	31.95	31.80	0.46%	✅
dlrm-criteoterabyte_fp16	1	51.03	50.96	0.13%	✅
agentmodel	1	9,042.00	9,458.32	-4.40%	🔴
unet_fp16	2	58.61	58.33	0.48%	✅
resnet50v1_fp16	1	1,050.42	1,071.23	-1.94%	✅
resnet50v1_int8	1	857.26	893.66	-4.07%	🔴
bert_base_cased_fp16	64	1,171.05	1,162.33	0.75%	✅
bert_large_uncased_fp16	32	363.50	353.92	2.71%	✅
bert_large_fp16	1	201.73	194.83	3.55%	🔆
distilgpt2_fp16	16	2,230.94	2,215.11	0.71%	✅
yolov5s	1	515.35	543.55	-5.19%	🔴
tinyllama	1	43.85	43.59	0.60%	✅
vicuna-fastchat	1	44.06	44.05	0.02%	✅
whisper-tiny-encoder	1	412.78	411.57	0.29%	✅
whisper-tiny-decoder	1	411.53	411.31	0.05%	✅
llama2_7b	1	nan	nan	nan%	❌
qwen1.5-7b	1	23.56	23.41	0.62%	✅
phi3-3.8b	1	nan	nan	nan%	❌
mask-rcnn	1	21.04	18.55	13.44%	🔆
llama3-8b	1	21.73	21.65	0.37%	✅
whisper-large-encoder	1	10.22	10.17	0.49%	✅
whisper-large-decoder	1	98.18	97.78	0.41%	✅
mistral-7b	1	23.76	23.63	0.51%	✅
FLUX.1-schnell	1	893.26	904.60	-1.25%	✅
nan	nan	nan	nan	nan%	❌

This build is not recommended to merge 🔴

migraphx-bot · 2025-04-15T02:36:02Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

❌llama2_7b: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:264: parse_from: PARSE_FROM: Failed reading onnx file: /new-saved-models/llama2_7b/decoder_model.onnx

❌#qwen1.5-7b: ERROR - check error output

usage: accuracy_checker.py [-h] [--onnx ONNX] [--tf TF] [--provider PROVIDER]
[--batch BATCH] [--fill1] [--fill0] [--fp16]
[--argmax] [--verbose] [--tolerance TOLERANCE]
[--input-dim INPUT_DIM] [--target TARGET]
[--ort-run] [--ort-logging]
[--disable-offload-copy] [--disable-fast-math]
[--exhaustive_tune]
accuracy_checker.py: error: unrecognized arguments: input_ids attention_mask position_ids 1 256 @attention_mask 1 256 @position_ids 1 256

❌phi3-3.8b: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/onnx/onnx_parser.cpp:264: parse_from: PARSE_FROM: Failed reading onnx file: /new-saved-models/phi3-3.8b/model.onnx

🔴mask-rcnn: FAILED: MIGraphX is not within tolerance - check verbose output

✅ llama3-8b: PASSED: MIGraphX meets tolerance

❌#whisper-large-encoder: ERROR - check error output

Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 340, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 205, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /src/AMDMIGraphX/src/include/migraphx/op/convolution.hpp:100: normalize_compute_shape: CONVOLUTION: mismatched channel numbers

✅ whisper-large-decoder: PASSED: MIGraphX meets tolerance

✅ mistral-7b: PASSED: MIGraphX meets tolerance

✅ FLUX.1-schnell: PASSED: MIGraphX meets tolerance

pfultz2 added 30 commits January 30, 2025 16:53

Add bitonic topk

878f970

Refactor

a992475

Adjust tile size

da2d455

Add wave sort

be70a65

Compare backwards

f39178a

Try to use block topk

41d06c3

Some mroe bug fixes

795f6f5

Add compare HOF

4c3a387

Add split topk

6bd26a0

Adjust min_size

4742fe9

Adjust the split

7a5c65c

Return bool

7a3c569

Fix bug

1c621c0

Fix out of bounds access

a055fd6

Handle larger sizes correctly

994206f

Add tests

af4b374

Format

9909e75

Update license

738e30c

Fix perf for smaller size

9d8d203

Remove standard requirement and increase batch in test

2796b25

Set index type

84a8470

Pack pair

f6cdd43

Improve constants

ebd3657

Use slice header

cc1aef3

Fix ctad

87116da

Add sort header

e4b9ea5

Use unrolled repeat

09ecb57

Move var

547d1d6

Use repeat_by_2 functions instead of recursion

bf58ba9

Move to funcitonal

0e7f659

pfultz2 added 3 commits March 31, 2025 16:02

Format

ae94ae5

Add license

34a4c11

Format

ea7461e

lakhinderwalia reviewed Apr 1, 2025

View reviewed changes

pfultz2 added 4 commits April 1, 2025 06:56

Use float_equal

a4a6165

Format

d560fb0

Add missing array header for windows

741ec45

Make sure both are integers

fb834c1

pfultz2 requested a review from Copilot April 1, 2025 15:19

Copilot AI reviewed Apr 1, 2025

View reviewed changes

lakhinderwalia reviewed Apr 7, 2025

View reviewed changes

lakhinderwalia reviewed Apr 8, 2025

View reviewed changes

pfultz2 added 4 commits April 8, 2025 08:50

Remove comparison operators

832a6a6

Format

7cde482

Use xor

0cda669

Lower threshold

3d52c57

pfultz2 added 3 commits April 8, 2025 12:05

Fix tidy

1d7a450

Fix threshold

ffc1fd9

Format

38a552f

lakhinderwalia reviewed Apr 14, 2025

View reviewed changes

pfultz2 added 2 commits April 14, 2025 11:14

Set to 0

90a5355

Format

6915ef9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bitonic topk #3862

Add bitonic topk #3862

pfultz2 commented Mar 3, 2025 •

edited

Loading

lakhinderwalia Apr 1, 2025

Copilot AI left a comment

lakhinderwalia Apr 7, 2025

pfultz2 Apr 7, 2025

lakhinderwalia Apr 8, 2025

lakhinderwalia Apr 7, 2025

pfultz2 Apr 7, 2025

lakhinderwalia Apr 8, 2025

lakhinderwalia Apr 7, 2025

pfultz2 Apr 7, 2025

lakhinderwalia Apr 7, 2025

pfultz2 Apr 7, 2025

lakhinderwalia Apr 8, 2025

lakhinderwalia Apr 8, 2025

pfultz2 Apr 8, 2025

pfultz2 commented Apr 8, 2025

lakhinderwalia Apr 14, 2025

pfultz2 Apr 14, 2025

pfultz2 Apr 14, 2025

lakhinderwalia Apr 14, 2025

pfultz2 Apr 14, 2025 •

edited

Loading

lakhinderwalia Apr 14, 2025

pfultz2 Apr 14, 2025

migraphx-bot commented Apr 15, 2025

migraphx-bot commented Apr 15, 2025

Add bitonic topk #3862

Are you sure you want to change the base?

Add bitonic topk #3862

Conversation

pfultz2 commented Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfultz2 commented Apr 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pfultz2 Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

migraphx-bot commented Apr 15, 2025

migraphx-bot commented Apr 15, 2025

pfultz2 commented Mar 3, 2025 •

edited

Loading

pfultz2 Apr 14, 2025 •

edited

Loading