-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using rocFFT in an OpenCL application #120
Comments
At present, there are no opencl bindings for rocFFT, because it requires some sort of interop functionality between HIP and OpenCL. We don't have that. There's no translation from HIP created memory buffer to a cl_mem object and vice-versa, for example. The way I see it, there are 2 options.
|
@pszi1ard @bragadeesh I'd like to point out it's not entirely impossible, but does require some work in existing ROCm stack. To gain enough attention, perhaps the better place for the ticket shall be at: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime No matter which programming languages you use for GPU computing on AMD hardware, on ROCm platform eventually they are compiled as "HSA code objects". On the other hand, as @pszi1ard pointed out, we probably need to extend |
Thanks for the quick feedback! To the two options @bragadeesh suggested:
What should the ticket state? I assumed filing an issue/RFE against rocFFT is the right thing as it's rocFFT that should have the bindings to take To @whchung 's further points: it does not help us what the platform compiles code to because we want to develop the application, not the compilers/toolchain and we want to use OpenCL. From that pov, while it's just a nuance, what would be ideal is if rocFFT actually had an OpenCL API rather than the OpenCL runtime had some special HIP-capable/compatible extensions. |
@pszi1ard What I proposed was to retrieve kernels within Since it seems your desired goal is to change |
Can't/won't do; we want standards-based portable code, that's why we're working on OpenCL. If you only care about AMD and NVIDIA GPU for the "portable code" here, I mean not Intel, FPGA, etc. You can hipify your code. You hip code will run on Nvidia. You do not need to maintain a CUDA version. https://gpuopen.com/hip-to-be-squared-an-introductory-hip-tutorial/ ==================== TITAN1:~/ben/hip/samples/square$ ./square.hip.out ==================== Fiji1:~/hip/samples/square$ hipcc square.cpp -o square.hip.out Fiji1:~/hip/samples/square$ ./square.hip.out |
@whchung Thanks for clarifying; actually, it was not entirely clear what you suggested, but now that I understand it better, I do think using As a side-note, the main question is whether in the short run we should just hope that clFFT gets fixed (for Vega) and it is not too bad in performance or whether there is a chance there'll be some form of rocFFT support that will also be competitive in performance -- I know this is a broader question, but this is the original question that led me here. Note that we plan to release GROMACS code that would rely on FFTs this fall (and we hope that it will be more than just functional, but also competitive). @tingxingdong Short answer: we don't want to hipify as, to be frank, before there is no major tractions around HIP, it's just technical debt that we'd be adding to our code-base. Additionally, we want portability not just to NVIDIA and AMD. |
@pszi1ard just to be clear, what @whchung suggesting is not directly usable by you. If we get such a support, then there is considerable rework that needs to be done in the rocFFT library to support a opencl interface. For you to use @whchung idea directly, you would have to take single 1D kernels and do all transposing and copying of data, essentially writing about half of fft functionality yourself. We are discussing this internally on what is best way forward, we will let you know. Can you give more info on the problems you are interested in? Is it all 3D FFT? single precision? real or complex? what factors for the sizes (pow 2,3,5 etc)? |
Hi @bragadeesh and everyone, |
@bragadeesh Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform. That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster. Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs). |
Let’s build open Rav for what you want I have section in github for this. You write in markdown
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Szilárd Páll <[email protected]>
Sent: Thursday, April 12, 2018 6:49:47 AM
To: ROCmSoftwarePlatform/rocFFT
Cc: Gregory Stoner; Assign
Subject: Re: [ROCmSoftwarePlatform/rocFFT] using rocFFT in an OpenCL application (#120)
@bragadeesh<https://github.com/bragadeesh> Thanks for the correction -- I should have realized myself that a 3D FFT (typically) computation consist of more than just a single kernel invocation, so it won't be as easy as loading a cl kernel from a binary for the full 3D transform.
That said, depending on how much effort it is and how much performance benefit it brings, I it might be worth for you to provide fused single-kernel small 3D transforms. In our experience from other platforms, the overheads involved the large-transform optimized multi-kernel 3D transforms seem so high, that moderately optimized fused 3D kernels (e.g. for factors 2 or 2/3) could end up being a lot faster.
Additionally, on the longer term we would definitely consider rolling our own 3D transforms based on the 1D FFT kernels, but I think these would need to be device-side callable for it to be worth it (considering kernel launch overheads and that we could overlap our grid generation with FFTs).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub<#120 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuauUS1yHiQctmc3ze_WIUodJqGylks5tnz9bgaJpZM4TOS9k>.
|
Can you please clarify what you mean? Do you want me to write down something? What is "open Rav"? |
IOS changed RFQ to RAV. I asking we build out RFQ-SRS for what you need. |
Here is the RFC template and place to manage them. |
Ping. Quite some time has passed and I've yet to receive feedback here or on the RFC. |
@pszi1ard |
Thanks for the update!
Do you have a release ETA? |
Unfortunately do not have timeline, other than to say clFFT validation on rocm platform is getting attention, what hardware do you plan to use with rocm? |
We use RX 560s in CI and do development/testin on Vega (and Fiji). I was however hoping to recommend ROCm to our users as the recommended platform for our next release (ETA ~end of 2018), but for that we'd need a stable, if not performant FFT library. From that point of view, it would be great if all ROCm-supported hardware was at least validated / correct with clFFT. |
Hi to all, just stumbled upon this issue and was wondering if @pszi1ard could make the earlier statement
What we've seen so far, is that clfft under ROCM on vega64 works decently: |
@psteinb Last time I checked (with ROCm 2.0) there were still failing regression tests, see clMathLibraries/clFFT#218 In terms of performance, I'm doubtful the performance is competitive with the state of the art. It may be that clFFT is slower on the GV100 is slower, but that's the wrong comparison IMHO, In this particular case the right comparison is cuFFT which is a lot faster (up to 5x in the small 3D transforms regime we care about). |
@bragadeesh any updates? can we expect any changes on either clFFT or rocFFT in the foreseeable future? Performance with clFFT is still very poor an in fact it seems to be regressing [1]; |
@pszi1ard on the rocFFT side supporting opencl interface is not getting high priority at this time; and clFFT not actively developed; are you still locked to opencl? is HIP an option? Let me explore what can be done. OTOH, can you describe the 3D FFTs and sizes you are looking for? Sorry if you have given this info before, if you can point me to relevant sizes of interest, that would be helpful. |
@bragadeesh quite unhappy to hear.
That is what I've implied based on the level of activity. Is there no community interest either -- as far as you know?
Short answer: Yes /No(t really( No, GROMACS is not "locked in", on the contrary, we are choosing open standards-based programming models and given our limited resources, and especially when it comes to hardware that has negligible use in our user-base, we can't invest in proprietary stacks. BTW if there is easy OpenCL - HIP interop, we have quite modular code and could plug in HIP-based FFTs into the application (this is all we need: https://github.com/gromacs/gromacs/blob/master/src/gromacs/ewald/pme_gpu_3dfft_ocl.cpp). However, realistically, if we are to get something better for AMD GPUs before the ~2021 timeframe, we need something soon (before mid-September in time for our 2020 release freeze).
Sure, briefly this is what we need: R2C / C2R, float, 3D transforms, data resident on GPU (grids generated by preceding kernel). Sizes anywhere between 64-256 per dim (not only power of two) most commonly, less commonly <32 or >256; we do filter out "nasty" factors and can tweak grid size if there is a known heuristic to apply (also see the above linked file). Let me know if you have thoughts on how to proceed. |
@feizheng10 @malcolmroberts |
Closing due to no new activity. |
@doctorcolinsmith Can you please clarify what exactly do you mean? Interop with OpenCL is a major shortcoming of the ROCm libraries not just rocFFT. Closgin due to no activity is quite unclear, should we interpret this as a "wontfix"? i.e. does this mean that AMD has no intentions to support rocFFT (or ROCm libs in general) from OpenCL? |
What is the expected behavior
What actually happens
How to reproduce
Environment
| Hardware | description |
Any
Note that this is a critical dependency for our work on bringing feature-parity with CUDA in the next GROMACS release.
The text was updated successfully, but these errors were encountered: