Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucx error while running HPCC application #10451

Open
Shubham-Bokadiya opened this issue Jan 26, 2025 · 2 comments
Open

ucx error while running HPCC application #10451

Shubham-Bokadiya opened this issue Jan 26, 2025 · 2 comments

Comments

@Shubham-Bokadiya
Copy link

1737742661.094513] [cn0244:260472:0] ib_mlx5_dv.c:430 UCX ERROR mlx5dv_devx_obj_create(QP) failed on mlx5_0, syndrome 0x2c4154: Remote I/O error
[cn0244:260472] pml_ucx.c:421 Error: ucp_ep_create(proc=47) failed: Input/output error
[cn0244:260472] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 19
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 1

@keisukefukuda @yshestakov @pathscale @khamidouche @yuq

why such kind of error are comming any help ?

@yosefe
Copy link
Contributor

yosefe commented Feb 13, 2025

What is the current FW version, is it possible upgrading to latest one?
Also, can try exporting UCX_IB_MLX5_DEVX_OBJECTS="" (empty string)

@yosefe
Copy link
Contributor

yosefe commented Feb 13, 2025

Internal ref: https://redmine.mellanox.com/issues/3743237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants