using rocFFT in an OpenCL application #120

pszi1ard · 2018-04-10T13:49:55Z

What is the expected behavior

I would expect the ROCm platform to provide an OpenCL API for the rocFFT library.

What actually happens

Only HIP bindings seems to be available

How to reproduce

Try to use rocFFT with OpenCL

Environment

| Hardware | description |
Any

Software	version
ROCK	1.7.137
ROCR	1.1.7-12-gf0de514
HCC	1.2.18063
Library	git master

Note that this is a critical dependency for our work on bringing feature-parity with CUDA in the next GROMACS release.

bragadeesh · 2018-04-10T23:49:48Z

At present, there are no opencl bindings for rocFFT, because it requires some sort of interop functionality between HIP and OpenCL. We don't have that. There's no translation from HIP created memory buffer to a cl_mem object and vice-versa, for example.

The way I see it, there are 2 options.

Since you have CUDA based code, you could try to hipify that and use HIP as solution on AMD. This way you could switch to using rocFFT; rocFFT has a hipFFT interface similar to cuFFT.
If you want to keep the opencl infrastructure, clFFT is your best bet, but it is in maintenance mode with known failures; you can try using it on the rocm stack and let us know what doesn't work for you

whchung · 2018-04-11T13:54:07Z

@gstoner for awareness

@pszi1ard @bragadeesh I'd like to point out it's not entirely impossible, but does require some work in existing ROCm stack. To gain enough attention, perhaps the better place for the ticket shall be at: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime

No matter which programming languages you use for GPU computing on AMD hardware, on ROCm platform eventually they are compiled as "HSA code objects". HIP runtime has no trouble loading kernels compiled by OpenCL compiler, as long as the kernel arguments are properly prepared to match API and ABI requirements asked by HIP runtime. Basically that's how ROCm ports of DNN frameworks such as Caffe / TensorFlow / PyTorch / Caffe2 MXNet / CNTK are made possible to load & run computation kernels in MIOpen, which are mostly written in OpenCL.

On the other hand, as @pszi1ard pointed out, we probably need to extend OpenCL runtime on ROCm be able to load, prepare kernel arguments per HIP application ABI, and invoke them.

pszi1ard · 2018-04-11T14:34:57Z

Thanks for the quick feedback!

To the two options @bragadeesh suggested:

Can't/won't do; we want standards-based portable code, that's why we're working on OpenCL.
clFFT is broken on Vega, and has questionable performance (from the benchmarks I've seen before).

@whchung

To gain enough attention, perhaps the better place for the ticket shall be at:
https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime

What should the ticket state? I assumed filing an issue/RFE against rocFFT is the right thing as it's rocFFT that should have the bindings to take cl_mem objects as arguments.

To @whchung 's further points: it does not help us what the platform compiles code to because we want to develop the application, not the compilers/toolchain and we want to use OpenCL. From that pov, while it's just a nuance, what would be ideal is if rocFFT actually had an OpenCL API rather than the OpenCL runtime had some special HIP-capable/compatible extensions.

whchung · 2018-04-11T15:12:39Z

@pszi1ard What I proposed was to retrieve kernels within rocfft and load / launch it with clCreateProgramWithBinary, that's possible with perhaps just a few tweaks in https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime , so I proposed to raise the ticket there.

Since it seems your desired goal is to change rocfft API so it takes cl_mem objects as arguments. That nullifies my proposal and I'll leave that for @bragadeesh and @gstoner to decide the priority for that.

tingxingdong · 2018-04-11T15:12:59Z

Can't/won't do; we want standards-based portable code, that's why we're working on OpenCL.

If you only care about AMD and NVIDIA GPU for the "portable code" here, I mean not Intel, FPGA, etc. You can hipify your code. You hip code will run on Nvidia. You do not need to maintain a CUDA version.
The HIP code will automatically call CUDA for you. But, you must have the source code.

https://gpuopen.com/hip-to-be-squared-an-introductory-hip-tutorial/

====================
like here, you only maintain *.cpp HIP code. HIP run it on NVIDIA Geforce Titan
TITAN1:~/ben/hip/samples/square$ hipcc square.cpp -o square.hip.out

TITAN1:~/ben/hip/samples/square$ ./square.hip.out
info: running on device GeForce GTX TITAN X
info: allocate host mem ( 7.63 MB)
info: allocate device mem ( 7.63 MB)
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
info: check result
PASSED!

====================
The identical *.cpp code recompile and run it on AMD Fiji, as well.

Fiji1:~/hip/samples/square$ hipcc square.cpp -o square.hip.out

Fiji1:~/hip/samples/square$ ./square.hip.out
info: running on device Fiji
info: allocate host mem ( 7.63 MB)
info: allocate device mem ( 7.63 MB))
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
info: check result
PASSED!

pszi1ard · 2018-04-11T18:10:46Z

@whchung Thanks for clarifying; actually, it was not entirely clear what you suggested, but now that I understand it better, I do think using clCreateProgramWithBinary to load precompiled rocfft kernels could be a sensible route too -- especially if that allows earlier OpenCL support. Hence, I'll file an RFE.

As a side-note, the main question is whether in the short run we should just hope that clFFT gets fixed (for Vega) and it is not too bad in performance or whether there is a chance there'll be some form of rocFFT support that will also be competitive in performance -- I know this is a broader question, but this is the original question that led me here. Note that we plan to release GROMACS code that would rely on FFTs this fall (and we hope that it will be more than just functional, but also competitive).

@tingxingdong Short answer: we don't want to hipify as, to be frank, before there is no major tractions around HIP, it's just technical debt that we'd be adding to our code-base. Additionally, we want portability not just to NVIDIA and AMD.

bragadeesh · 2018-04-11T23:17:08Z

@pszi1ard just to be clear, what @whchung suggesting is not directly usable by you. If we get such a support, then there is considerable rework that needs to be done in the rocFFT library to support a opencl interface. For you to use @whchung idea directly, you would have to take single 1D kernels and do all transposing and copying of data, essentially writing about half of fft functionality yourself.

We are discussing this internally on what is best way forward, we will let you know.

Can you give more info on the problems you are interested in? Is it all 3D FFT? single precision? real or complex? what factors for the sizes (pow 2,3,5 etc)?

yupinov · 2018-04-12T11:09:47Z

Hi @bragadeesh and everyone,
I'm working on the Gromacs OpenCL implementation together with @pszi1ard.
What we are interested in is indeed, 3D FFT, real to and from hermitian interleaved, single precision. Each dimension can realistically be from 24 up to 192. We can scale the grid dimensions, so the large prime factor support is not very important, while nice to have. Here is the issue I filed recently against clFFT, asking about the rocm support status. clMathLibraries/clFFT#218

pszi1ard · 2018-04-12T11:49:46Z

@bragadeesh Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform.

That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster.

Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs).

gstoner · 2018-04-12T14:07:52Z

Let’s build open Rav for what you want I have section in github for this. You write in markdown Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Szilárd Páll <[email protected]> Sent: Thursday, April 12, 2018 6:49:47 AM To: ROCmSoftwarePlatform/rocFFT Cc: Gregory Stoner; Assign Subject: Re: [ROCmSoftwarePlatform/rocFFT] using rocFFT in an OpenCL application (#120) @bragadeesh<https://github.com/bragadeesh> Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform. That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster. Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub<#120 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuauUS1yHiQctmc3ze_WIUodJqGylks5tnz9bgaJpZM4TOS9k>.

pszi1ard · 2018-04-13T09:30:18Z

Let’s build open Rav for what you want I have section in github for this. You write in markdown

Can you please clarify what you mean? Do you want me to write down something? What is "open Rav"?

gstoner · 2018-04-13T13:23:29Z

IOS changed RFQ to RAV. I asking we build out RFQ-SRS for what you need.

gstoner · 2018-04-13T13:25:28Z

Here is the RFC template and place to manage them.
https://github.com/RadeonOpenCompute/rfcs

pszi1ard · 2018-05-15T13:35:44Z

@gstoner I filed and RFC pull request, also noted some meta-stuff about the RFC format there.

Is there still a point to file a separate bug report or RFC for the OpenCL runtime feature @whchung suggested?

pszi1ard · 2018-08-24T14:48:44Z

Ping. Quite some time has passed and I've yet to receive feedback here or on the RFC.

bragadeesh · 2018-08-28T16:54:31Z

@pszi1ard
we don't have any update on the addition of cl interfaces for rocFFT; as I mentioned before it requires a lot of plumbing in the stack which we don't have consensus on yet
we did put resources to get clFFT compiler issues fixed and it is progressing

pszi1ard · 2018-08-29T08:54:53Z

Thanks for the update!

we did put resources to get clFFT compiler issues fixed and it is progressing

Do you have a release ETA?

bragadeesh · 2018-08-29T22:58:41Z

Unfortunately do not have timeline, other than to say clFFT validation on rocm platform is getting attention, what hardware do you plan to use with rocm?

pszi1ard · 2018-08-29T23:06:33Z

We use RX 560s in CI and do development/testin on Vega (and Fiji).

I was however hoping to recommend ROCm to our users as the recommended platform for our next release (ETA ~end of 2018), but for that we'd need a stable, if not performant FFT library. From that point of view, it would be great if all ROCm-supported hardware was at least validated / correct with clFFT.

psteinb · 2019-01-10T13:11:50Z

Hi to all, just stumbled upon this issue and was wondering if @pszi1ard could make the earlier statement

clFFT is broken on Vega, and has questionable performance (from the benchmarks I've seen before).
more precise.

What we've seen so far, is that clfft under ROCM on vega64 works decently:

I hope the screenshot contains most needed details on the benchmark @tdd11235813 did.

pszi1ard · 2019-02-18T09:41:04Z

@psteinb Last time I checked (with ROCm 2.0) there were still failing regression tests, see clMathLibraries/clFFT#218

In terms of performance, I'm doubtful the performance is competitive with the state of the art. It may be that clFFT is slower on the GV100 is slower, but that's the wrong comparison IMHO, In this particular case the right comparison is cuFFT which is a lot faster (up to 5x in the small 3D transforms regime we care about).

pszi1ard · 2019-04-25T11:04:03Z

@bragadeesh any updates? can we expect any changes on either clFFT or rocFFT in the foreseeable future? Performance with clFFT is still very poor an in fact it seems to be regressing [1];

GROMACS regression in ROCm 2.3 ROCm#773

bragadeesh · 2019-04-30T00:54:35Z

@pszi1ard on the rocFFT side supporting opencl interface is not getting high priority at this time; and clFFT not actively developed; are you still locked to opencl? is HIP an option? Let me explore what can be done.

OTOH, can you describe the 3D FFTs and sizes you are looking for? Sorry if you have given this info before, if you can point me to relevant sizes of interest, that would be helpful.

pszi1ard · 2019-04-30T11:04:19Z

@pszi1ard on the rocFFT side supporting opencl interface is not getting high priority at this time;

@bragadeesh quite unhappy to hear.

and clFFT not actively developed;

That is what I've implied based on the level of activity. Is there no community interest either -- as far as you know?

are you still locked to opencl? is HIP an option? Let me explore what can be done.

Short answer: Yes /No(t really(

No, GROMACS is not "locked in", on the contrary, we are choosing open standards-based programming models and given our limited resources, and especially when it comes to hardware that has negligible use in our user-base, we can't invest in proprietary stacks.

BTW if there is easy OpenCL - HIP interop, we have quite modular code and could plug in HIP-based FFTs into the application (this is all we need: https://github.com/gromacs/gromacs/blob/master/src/gromacs/ewald/pme_gpu_3dfft_ocl.cpp). However, realistically, if we are to get something better for AMD GPUs before the ~2021 timeframe, we need something soon (before mid-September in time for our 2020 release freeze).

OTOH, can you describe the 3D FFTs and sizes you are looking for? Sorry if you have given this info before, if you can point me to relevant sizes of interest, that would be helpful.

Sure, briefly this is what we need: R2C / C2R, float, 3D transforms, data resident on GPU (grids generated by preceding kernel). Sizes anywhere between 64-256 per dim (not only power of two) most commonly, less commonly <32 or >256; we do filter out "nasty" factors and can tweak grid size if there is a known heuristic to apply (also see the above linked file).

Let me know if you have thoughts on how to proceed.

bragadeesh · 2019-04-30T19:53:12Z

@feizheng10 @malcolmroberts
please note the sizes of interest, we can discuss offline on opencl interface

doctorcolinsmith · 2022-08-22T16:55:17Z

Closing due to no new activity.

pszi1ard · 2022-08-29T14:30:34Z

@doctorcolinsmith Can you please clarify what exactly do you mean? Interop with OpenCL is a major shortcoming of the ROCm libraries not just rocFFT.

Closgin due to no activity is quite unclear, should we interpret this as a "wontfix"? i.e. does this mean that AMD has no intentions to support rocFFT (or ROCm libs in general) from OpenCL?

whchung assigned gstoner Apr 11, 2018

pszi1ard mentioned this issue Apr 11, 2018

[RFE] allow using rocFFT kernels with clCreateProgramWithBinary ROCm/ROCm-OpenCL-Runtime#53

Open

bragadeesh mentioned this issue Aug 28, 2018

Vega/RocM support status clMathLibraries/clFFT#218

Open

psteinb unassigned gstoner Jan 10, 2019

skyreflectedinmirrors mentioned this issue Jun 10, 2020

Initial HIP Platform implementation for AMD GPUs on ROCm openmm/openmm#2736

Closed

doctorcolinsmith closed this as completed Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using rocFFT in an OpenCL application #120

using rocFFT in an OpenCL application #120

pszi1ard commented Apr 10, 2018 •

edited

Loading

bragadeesh commented Apr 10, 2018 •

edited

Loading

whchung commented Apr 11, 2018

pszi1ard commented Apr 11, 2018

whchung commented Apr 11, 2018

tingxingdong commented Apr 11, 2018 •

edited

Loading

pszi1ard commented Apr 11, 2018

bragadeesh commented Apr 11, 2018

yupinov commented Apr 12, 2018 •

edited

Loading

pszi1ard commented Apr 12, 2018

gstoner commented Apr 12, 2018 via email

pszi1ard commented Apr 13, 2018

gstoner commented Apr 13, 2018

gstoner commented Apr 13, 2018

pszi1ard commented May 15, 2018

pszi1ard commented Aug 24, 2018

bragadeesh commented Aug 28, 2018

pszi1ard commented Aug 29, 2018

bragadeesh commented Aug 29, 2018

pszi1ard commented Aug 29, 2018

psteinb commented Jan 10, 2019

pszi1ard commented Feb 18, 2019

pszi1ard commented Apr 25, 2019

bragadeesh commented Apr 30, 2019

pszi1ard commented Apr 30, 2019

bragadeesh commented Apr 30, 2019

doctorcolinsmith commented Aug 22, 2022

pszi1ard commented Aug 29, 2022

using rocFFT in an OpenCL application #120

using rocFFT in an OpenCL application #120

Comments

pszi1ard commented Apr 10, 2018 • edited Loading

What is the expected behavior

What actually happens

How to reproduce

Environment

bragadeesh commented Apr 10, 2018 • edited Loading

whchung commented Apr 11, 2018

pszi1ard commented Apr 11, 2018

whchung commented Apr 11, 2018

tingxingdong commented Apr 11, 2018 • edited Loading

pszi1ard commented Apr 11, 2018

bragadeesh commented Apr 11, 2018

yupinov commented Apr 12, 2018 • edited Loading

pszi1ard commented Apr 12, 2018

gstoner commented Apr 12, 2018 via email

pszi1ard commented Apr 13, 2018

gstoner commented Apr 13, 2018

gstoner commented Apr 13, 2018

pszi1ard commented May 15, 2018

pszi1ard commented Aug 24, 2018

bragadeesh commented Aug 28, 2018

pszi1ard commented Aug 29, 2018

bragadeesh commented Aug 29, 2018

pszi1ard commented Aug 29, 2018

psteinb commented Jan 10, 2019

pszi1ard commented Feb 18, 2019

pszi1ard commented Apr 25, 2019

bragadeesh commented Apr 30, 2019

pszi1ard commented Apr 30, 2019

bragadeesh commented Apr 30, 2019

doctorcolinsmith commented Aug 22, 2022

pszi1ard commented Aug 29, 2022

pszi1ard commented Apr 10, 2018 •

edited

Loading

bragadeesh commented Apr 10, 2018 •

edited

Loading

tingxingdong commented Apr 11, 2018 •

edited

Loading

yupinov commented Apr 12, 2018 •

edited

Loading