ROCM EP convolution fails due to missing #19566

dmnieto · 2024-02-19T19:31:28Z

Describe the issue

"MIOpen Error: No invoker was registered for convolution forward." happens when trying to use any model for inference with convolution codes. This is because the caching system in the Update() call will check for previous used algorithms. But the ROCM API (contrary to CUDA) requires the algo search call as described here:
https://rocmdocs.amd.com/projects/MIOpen/en/latest/MIOpen_Porting_Guide.html

To reproduce

This has been reproduced with immich latest version, on all the models being used.

Urgency

Relatively urgent as ROCM EP is broken.

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Other / Unknown

Execution Provider Library Version

ROCMm - any version

cloudhan · 2024-02-20T10:42:28Z

@PeixuanZuo Any idea, I think we didn't observe conv issue in SD.

@jeffdaily Could you please contact MIOpen dev for this. I think we introduced the manual cache for this since some older version of MIOpen, but it seems we need to remove the manual caching logic for this again?

dmnieto · 2024-02-20T13:44:07Z

The logic in Immich for the model training is quitesimple, I have done several tests and it does not look like the bug could be specific there ( I have a bug open in parallel to track the integration of the rocm EP)I think this just makes be a case on landing in an untested code path… it does not look like the rocm ep gets used that much. Regardless, the porting guide makes clear we can’t cache the algos…On Feb 20, 2024, at 02:42, cloudhan ***@***.***> wrote: @PeixuanZuo Any idea, I think we didn't observe conv issue in SD. @jeffdaily Could you please contact MIOpen dev for this. I think we introduced the manual cache for this since some older version of MIOpen, be it seems we need to remove the manual caching logic for this again? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

jeffdaily · 2024-02-20T16:24:56Z

I have reached out to our MIOpen team.

dmnieto · 2024-02-20T16:40:36Z

Thanks Jeff, In the code review there are mentions that tensorflow uses algo catching. I am going to instrument a bit more the testcode to see if there is any reason that we could end up in a sitation where we launch the convolutions *without* calling the findalgo first... Maybe something is broken on the caching itself instead... but so far I have not been successful in identifying the corner case.

…

On Tue, 20 Feb 2024 at 08:25, Jeff Daily ***@***.***> wrote: I have reached out to our MIOpen team. — Reply to this email directly, view it on GitHub <#19566 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANL3WMQGLJUUE5A5DXN3GLYUTE6LAVCNFSM6AAAAABDP653KWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJUGU4DENZQGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jeffdaily · 2024-02-20T17:49:09Z

Can you provide more details? Docker container? With reproducer steps?

dmnieto · 2024-02-20T21:49:48Z

You can see the discussion (including dockerfile) here https://github.com/immich-app/immich/discussions/7169You likely don’t want to go through the mess of building the project, I’ll provide a more concise repro setup. But due completion sake:The issue is happening when setting up the machine_learning docker after building onnxruntime with rocm 6.0.2 against the main branch. The error happens on the second call to the face detection model (buffalo_l) when instantiating the predict function.I will grab de logs in a few days from an earlier run On Feb 20, 2024, at 09:49, Jeff Daily ***@***.***> wrote: Can you provide more details? Docker container? With reproducer steps? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

github-actions · 2024-03-22T15:00:46Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

tianleiwu · 2024-07-11T04:38:21Z

The root cause is the cache key is x shape (but not a combination of x shape + w shape) and w shape is not constant. When there is multiple threading:

thread 1 uses w shape w1 and x shape x1, it runs algo search then adds a cache of <x1, algo1> for w1.
thread 2 uses w shape w2 and x shape x1, it clear cache since w shape change; right after this, thread 1 insert cache <x1, algo1> before thread 2 looking up cache. Then thread 2 looks up cache with key x1 and found <x1, algo1> inserted by thread 1, then choose algo1, but algo1 cannot apply to w2, so thread 2 raise runtime error.

The solution is that we need to use x shape + w shape as key for cache, and never clear cache; and add mutex to guard the cache to make sure there is only one thread is looking up or updating the cache.

Current cache is applied to one node so Conv algo search will need multiple times when a model has multiple Conv nodes. A ideal solution is to have a global cache (like PyTorch code), and use all conv parameters (including device id) as key, that could avoid duplicated algo search.

dmnieto · 2024-07-11T16:30:43Z

That can make a lot of sense, since the issues were discovered when running two different models in parallel. I can try that on my setup.

…

On Wed, 10 Jul 2024 at 21:38, Tianlei Wu ***@***.***> wrote: The root cause is the cache key is x shape (but not a combination of x shape + w shape). When there is multiple threading: - thread 1 use w shape w1 and x shape x1, it adds a cache of <x1, algo1>, use algo1 to run - thread 2 use w shape w2 and x shape x1, it lookup cache and found <x1, algo1>, then use algo1 run, but algo1 cannot apply to w2, so it raise error. The solution is that we need to use x shape + w shape as key for cache, and never clear cache; and add mutex to guard the cache to make sure there is only one thread is looking up or updating the cache. — Reply to this email directly, view it on GitHub <#19566 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANL3WIUIIR2KKFBZ3RCNC3ZLYD5FAVCNFSM6AAAAABDP653KWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRSGAYTGMJYGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mertalev · 2025-01-14T18:18:41Z

The root cause is the cache key is x shape (but not a combination of x shape + w shape) and w shape is not constant. When there is multiple threading:

thread 1 uses w shape w1 and x shape x1, it runs algo search then adds a cache of <x1, algo1> for w1.

thread 2 uses w shape w2 and x shape x1, it clear cache since w shape change; right after this, thread 1 insert cache <x1, algo1> before thread 2 looking up cache. Then thread 2 looks up cache with key x1 and found <x1, algo1> inserted by thread 1, then choose algo1, but algo1 cannot apply to w2, so thread 2 raise runtime error.

The solution is that we need to use x shape + w shape as key for cache, and never clear cache; and add mutex to guard the cache to make sure there is only one thread is looking up or updating the cache.

Current cache is applied to one node so Conv algo search will need multiple times when a model has multiple Conv nodes. A ideal solution is to have a global cache (like PyTorch code), and use all conv parameters (including device id) as key, that could avoid duplicated algo search.

Hi @tianleiwu, I implemented these changes (using x shape + w shape in the key, not clearing cache, guarding cache with mutex) in this patch. However, based on these results, it seems the threading issue still exists. Do you have any ideas for why this might be?

I also noticed that there is already a std::lock_guard<OrtMutex> lock(s_.mutex); in ComputeInternal, called before UpdateState. Shouldn't that make this whole section serial?

mertalev · 2025-01-14T18:22:00Z

The specific error is this, which is different from the original error:

Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero     
immich_machine_learning_rocm  |                              status code returned while running Conv node.      
immich_machine_learning_rocm  |                              Name:'Conv_426' Status Message: MIOPEN failure 7:  
immich_machine_learning_rocm  |                              miopenStatusUnknownError ; GPU=0 ;                 
immich_machine_learning_rocm  |                              hostname=b86317b041da ;                            
immich_machine_learning_rocm  |                              file=/code/onnxruntime/onnxruntime/core/providers/r
immich_machine_learning_rocm  |                              ocm/nn/conv.cc ; line=336 ;                        
immich_machine_learning_rocm  |                              expr=miopenConvolutionForward(miopen_handle,       
immich_machine_learning_rocm  |                              &alpha, s_.x_tensor, s_.x_data, s_.w_desc,         
immich_machine_learning_rocm  |                              s_.w_data, s_.conv_desc, s_.fwd_algo, &beta,       
immich_machine_learning_rocm  |                              s_.y_tensor, s_.y_data, workspace.get(),           
immich_machine_learning_rocm  |                              s_.workspace_bytes);

My best guess is that it's not just a threading issue, but a caching issue where it tries to use an incompatible algo. But it does work when single-threaded, so I'm not sure.

tianleiwu · 2025-01-15T22:59:59Z

I also noticed that there is already a std::lock_guard<OrtMutex> lock(s_.mutex); in ComputeInternal, called before UpdateState. Shouldn't that make this whole section serial?

@mertalev, you are right. That mutex shall make the op thread-safe. I missed that in previous review.

dmnieto mentioned this issue Feb 19, 2024

Disable algo caching in ROCM EP #19567

Open

skottmckay added the ep:ROCm questions/issues related to ROCm execution provider label Feb 20, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Mar 22, 2024

tianleiwu mentioned this issue Jul 10, 2024

ROCm Conv is not thread safe #21311

Closed

mertalev mentioned this issue Jan 14, 2025

feat(ml): introduce support of onnxruntime-rocm for AMD GPU immich-app/immich#11063

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCM EP convolution fails due to missing #19566

ROCM EP convolution fails due to missing #19566

dmnieto commented Feb 19, 2024

cloudhan commented Feb 20, 2024 •

edited

Loading

dmnieto commented Feb 20, 2024 via email

jeffdaily commented Feb 20, 2024

dmnieto commented Feb 20, 2024 via email

jeffdaily commented Feb 20, 2024

dmnieto commented Feb 20, 2024 via email

github-actions bot commented Mar 22, 2024

tianleiwu commented Jul 11, 2024 •

edited

Loading

dmnieto commented Jul 11, 2024 via email

mertalev commented Jan 14, 2025

mertalev commented Jan 14, 2025 •

edited

Loading

tianleiwu commented Jan 15, 2025

ROCM EP convolution fails due to missing #19566

ROCM EP convolution fails due to missing #19566

Comments

dmnieto commented Feb 19, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

cloudhan commented Feb 20, 2024 • edited Loading

dmnieto commented Feb 20, 2024 via email

jeffdaily commented Feb 20, 2024

dmnieto commented Feb 20, 2024 via email

jeffdaily commented Feb 20, 2024

dmnieto commented Feb 20, 2024 via email

github-actions bot commented Mar 22, 2024

tianleiwu commented Jul 11, 2024 • edited Loading

dmnieto commented Jul 11, 2024 via email

mertalev commented Jan 14, 2025

mertalev commented Jan 14, 2025 • edited Loading

tianleiwu commented Jan 15, 2025

cloudhan commented Feb 20, 2024 •

edited

Loading

tianleiwu commented Jul 11, 2024 •

edited

Loading

mertalev commented Jan 14, 2025 •

edited

Loading