You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A segmentation fault occurred when attempting to reproduce certain PGEMM test samples from the CA3D paper using example_AB.exe. Currently, I am unable to ascertain whether this error is associated with a specific environment.
Compilation and Execution
Compilation
Compile according to README.md, that is, by using the command make -f icc-mkl-anympi.make -j
I tried my best to restore the compilation information from the binary file as follows,
By the way, while testing with M=N=K=32768 using the same sbatch script, unfortunately, the job did not complete within 15 minutes. Regrettably, I had to forcefully cancel the job as it incurs charges, and consequently, I couldn't obtain any useful log information.
GDB Core Dumps
The method of reproducing errors as described in the Execution section.
It shows that a segment fault occurs when mat_redist_engine_exec calls MPI_Neighbor_alltoallv for redistributing matrices A and B. Entering mat_redist_engine_exec to view MPI_Neighbor_alltoallv's parameter sendbuf_h shows "Address 0x1462ced84e50 out of bounds"
Core was generated by `$USER/CA3DMM/examples/example_AB.exe 1200000 6000 6000 0 0 1 1 0'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000041e2e4 in __intel_avx_rep_memcpy ()
(gdb) backtrace
#0 0x000000000041e2e4 in __intel_avx_rep_memcpy ()
#1 0x00001469ca7de08f in mca_btl_self_get ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_btl_self.so
#2 0x00001469ca1aa174 in mca_pml_ob1_recv_request_get_frag ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#3 0x00001469ca1a9c5f in mca_pml_ob1_recv_request_progress_rget ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#4 0x00001469ca19e4aa in mca_pml_ob1_recv_frag_match_proc ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#5 0x00001469ca1a08f9 in mca_pml_ob1_recv_frag_callback_rget ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#6 0x00001469ca7dda30 in mca_btl_self_send ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_btl_self.so
#7 0x00001469ca1acf81 in mca_pml_ob1_send_request_start_rdma ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#8 0x00001469ca19aa81 in mca_pml_ob1_isend ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#9 0x00001469c935a3a0 in mca_coll_basic_neighbor_alltoallv ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_coll_basic.so
#10 0x00001469d7f82b1d in PMPI_Neighbor_alltoallv ()
from /opt/app/openmpi/4.0.4/intel/2020/lib/libmpi.so.40
#11 0x000000000040ad2c in mat_redist_engine_exec (engine=0x1462ced84e50, src_blk=0x1462f844aa50,
src_ld=-129717664, dst_blk=0xd693a30, dst_ld=6769816) at mat_redist.c:357
#12 0x0000000000406397 in ca3dmm_engine_exec (engine=0x1462ced84e50, src_A=0x1462f844aa50,
ldA=-129717664, src_B=0xd693a30, ldB=6769816, dst_C=0x41e2e0 <__intel_avx_rep_memcpy+672>,
ldC=1200000) at ca3dmm.c:988
#13 0x0000000000404e3e in main (argc=375, argv=0x0) at example_AB.c:169
(gdb) frame 11
#11 0x000000000040ad2c in mat_redist_engine_exec (engine=0x1462ced84e50, src_blk=0x1462f844aa50,
src_ld=-129717664, dst_blk=0xd693a30, dst_ld=6769816) at mat_redist.c:357
357 MPI_Neighbor_alltoallv(
(gdb) l
352 int *recv_displs = engine->recv_displs;
353 void *recvbuf_h = engine->recvbuf_h;
354 void *recvbuf_d = engine->recvbuf_d;
355 if (dev_type == DEV_TYPE_HOST)
356 {
357 MPI_Neighbor_alltoallv(
358 sendbuf_h, send_sizes, send_displs, engine->dtype,
359 recvbuf_h, recv_sizes, recv_displs, engine->dtype, engine->graph_comm
360 );
361 }
(gdb) p sendbuf_h
$1 = 0x1462ced84e50 <Address 0x1462ced84e50 out of bounds>
(gdb) p recvbuf_h
$2 = (void *) 0x1462c16f1410
Check Dynamic Library
Replaced some sensitive information, such as the home directory with username.
Thank you for trying CA3DMM and reporting the error. Based on your description, I think you could try other MPI libraries first. I have little experience using OpenMPI. My impression is that OpenMPI usually needs some arguments for running on the IB network, OmniPath network, or other high-speed network. You may also check that.
Brief
A segmentation fault occurred when attempting to reproduce certain PGEMM test samples from the CA3D paper using
example_AB.exe
. Currently, I am unable to ascertain whether this error is associated with a specific environment.Compilation and Execution
Compilation
Compile according to README.md, that is, by using the command
make -f icc-mkl-anympi.make -j
I tried my best to restore the compilation information from the binary file as follows,
Execution
Submitting jobs to run on the slurm cluster via sbatch scripts(see below). Error occurred when executing
example_AB.exe
with the following parameters,The same error occurred when M=N=6000, K=1200000. But for M=N=K=50,000 and M=N=100,000,K=5,000, it worked fine.
SBATCH Script
By the way, while testing with M=N=K=32768 using the same sbatch script, unfortunately, the job did not complete within 15 minutes. Regrettably, I had to forcefully cancel the job as it incurs charges, and consequently, I couldn't obtain any useful log information.
GDB Core Dumps
The method of reproducing errors as described in the Execution section.
It shows that a segment fault occurs when
mat_redist_engine_exec
callsMPI_Neighbor_alltoallv
for redistributing matrices A and B. Enteringmat_redist_engine_exec
to viewMPI_Neighbor_alltoallv
's parametersendbuf_h
shows "Address 0x1462ced84e50 out of bounds"Check Dynamic Library
Replaced some sensitive information, such as the home directory with username.
My environment
More detailed information about the dependencies used that may be useful.
The text was updated successfully, but these errors were encountered: