-
Notifications
You must be signed in to change notification settings - Fork 940
Open
Description
It looks like a mutex in the pml/ucx is borked on cleanup after a communicator is revoked. Below is the error and backtrace:
======================================================================
FAIL: testRevoke (test_ulfm.TestULFMSelf.testRevoke)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/gpfs/projects/SchuchartGroup/src/openmpi/mpi4py/test/test_ulfm.py", line 38, in testRevoke
self.assertTrue(comm.Is_revoked())
AssertionError: False is not true
======================================================================
FAIL: testRevoke (test_ulfm.TestULFMWorld.testRevoke)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/gpfs/projects/SchuchartGroup/src/openmpi/mpi4py/test/test_ulfm.py", line 38, in testRevoke
self.assertTrue(comm.Is_revoked())
AssertionError: False is not true
----------------------------------------------------------------------
Ran 1963 tests in 55.431s
FAILED (failures=2, skipped=163)
python: ../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:86: opal_thread_internal_mutex_lock: Assertion `0 == ret' failed.
Thread 1 "python" received signal SIGABRT, Aborted.
0x00007ffff6a8bedc in __pthread_kill_implementation () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff6a8bedc in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007ffff6a3eb46 in raise () from /lib64/libc.so.6
#2 0x00007ffff6a28833 in abort () from /lib64/libc.so.6
#3 0x00007ffff6a2875b in __assert_fail_base.cold () from /lib64/libc.so.6
#4 0x00007ffff6a37886 in __assert_fail () from /lib64/libc.so.6
#5 0x00007fffe2c9fa3c in opal_thread_internal_mutex_lock (p_mutex=0x7fffeaf46530 <ompi_request_f_to_c_table+80>) at ../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:86
#6 0x00007fffe2c9faac in opal_mutex_lock (mutex=0x7fffeaf46508 <ompi_request_f_to_c_table+40>) at ../../opal/mca/threads/mutex.h:122
#7 0x00007fffe2ca0f09 in opal_pointer_array_set_item (table=0x7fffeaf464e0 <ompi_request_f_to_c_table>, index=3, value=0x0) at ../../opal/class/opal_pointer_array.c:285
#8 0x00007fffeaab0ff8 in mca_pml_ucx_persisternt_request_destruct (req=0x62d000cc4a38) at ../../../../../ompi/mca/pml/ucx/pml_ucx_request.c:265
#9 0x00007fffe2c8dabb in opal_obj_run_destructors (object=0x62d000cc4a38) at ../../opal/class/opal_object.h:472
#10 0x00007fffe2c8fb65 in opal_free_list_destruct (fl=0x7fffeaf09280 <ompi_pml_ucx+640>) at ../../opal/class/opal_free_list.c:96
#11 0x00007fffeaa9e49a in opal_obj_run_destructors (object=0x7fffeaf09280 <ompi_pml_ucx+640>) at ../../../../../opal/class/opal_object.h:472
#12 0x00007fffeaaa3e6d in mca_pml_ucx_cleanup () at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:400
#13 0x00007fffeaab63e9 in mca_pml_ucx_component_fini () at ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:158
#14 0x00007fffeaa981cd in mca_pml_base_finalize () at ../../../../ompi/mca/pml/base/pml_base_select.c:54
#15 0x00007fffe2cb3b04 in opal_finalize_cleanup_domain (domain=0x7fffe301cae0 <opal_init_domain>) at ../../opal/runtime/opal_finalize_core.c:128
#16 0x00007fffe2c87b70 in opal_finalize () at ../../opal/runtime/opal_finalize.c:56
#17 0x00007fffea38ac70 in ompi_rte_finalize () at ../../ompi/runtime/ompi_rte.c:1045
#18 0x00007fffea3995a7 in ompi_mpi_instance_finalize_common () at ../../ompi/instance/instance.c:951
#19 0x00007fffea399c15 in ompi_mpi_instance_finalize (instance=0x7fffeaf47bc0 <ompi_mpi_instance_default>) at ../../ompi/instance/instance.c:996
#20 0x00007fffea379b44 in ompi_mpi_finalize () at ../../ompi/runtime/ompi_mpi_finalize.c:294
#21 0x00007fffea438597 in PMPI_Finalize () at finalize_generated.c:53
#22 0x00007ffff6ebc155 in call_ll_exitfuncs (runtime=<optimized out>) at Python/pylifecycle.c:2930
#23 Py_FinalizeEx () at Python/pylifecycle.c:1929
#24 0x00007ffff6ebc7c7 in Py_Exit (sts=1) at Python/pylifecycle.c:2940
#25 0x00007ffff6ebe9e2 in handle_system_exit () at Python/pythonrun.c:771
#26 0x00007ffff7055428 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x7ffff74d1678 <_PyRuntime+166328>) at Python/pythonrun.c:781
#27 PyErr_PrintEx (set_sys_last_vars=1) at Python/pythonrun.c:876
#28 0x00007ffff6ebeda3 in PyErr_Print () at Python/pythonrun.c:882
#29 _PyRun_SimpleFileObject (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffff8078) at Python/pythonrun.c:446
#30 0x00007ffff7055664 in _PyRun_AnyFileObject (fp=fp@entry=0x615000001980, filename=filename@entry=0x7ffff5b2fec0, closeit=closeit@entry=1, flags=flags@entry=0x7fffffff8078)
at Python/pythonrun.c:79
#31 0x00007ffff705bbf6 in pymain_run_file_obj (skip_source_first_line=<optimized out>, filename=0x7ffff5b2fec0, program_name=0x7ffff3f27330) at Modules/main.c:360
#32 pymain_run_file (config=0x7ffff74b76c0 <_PyRuntime+59904>) at Modules/main.c:379
#33 pymain_run_python (exitcode=0x7fffffff8074) at Modules/main.c:601
#34 Py_RunMain () at Modules/main.c:680
#35 0x00007ffff705b819 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:734
#36 0x00007ffff6a295d0 in __libc_start_call_main () from /lib64/libc.so.6
#37 0x00007ffff6a29680 in __libc_start_main_impl () from /lib64/libc.so.6
#38 0x000000000040076e in _start ()
Metadata
Metadata
Assignees
Labels
No labels