Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault #8

Open
kicurry opened this issue Nov 15, 2023 · 1 comment
Open

Segmentation fault #8

kicurry opened this issue Nov 15, 2023 · 1 comment

Comments

@kicurry
Copy link

kicurry commented Nov 15, 2023

Brief

A segmentation fault occurred when attempting to reproduce certain PGEMM test samples from the CA3D paper using example_AB.exe. Currently, I am unable to ascertain whether this error is associated with a specific environment.

Compilation and Execution

Compilation

Compile according to README.md, that is, by using the command make -f icc-mkl-anympi.make -j

I tried my best to restore the compilation information from the binary file as follows,

$ strings example_AB.exe | grep -i -B 2 example_AB.c.o
example_AB.c
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.1.217 Build 20200306
-I../include -I/opt/app/openmpi/4.0.4/intel/2020/include -Wall -g -std=gnu11 -O3 -fPIC -DUSE_MKL -fopenmp -xHost -mkl -c -o example_AB.c.o -pthread -Wl,-rpath,/home/$USERNAME/intel/oneapi/mkl/2023.2.0/lib/intel64

Execution

Submitting jobs to run on the slurm cluster via sbatch scripts(see below). Error occurred when executing example_AB.exe with the following parameters,

$ ./example_AB.exe 1200000 6000 6000 0 0 1 1 0

The same error occurred when M=N=6000, K=1200000. But for M=N=K=50,000 and M=N=100,000,K=5,000, it worked fine.

SBATCH Script

#!/bin/bash
#SBATCH --job-name=CA3D      # Job name
#SBATCH --output=CA3D-1200000-6000-6000-16-%j.out # Stdout (%j expands to jobId)
#SBATCH --error=CA3D-1200000-6000-6000-16-%j.err  # Stderr (%j expands to jobId)
#SBATCH --partition=cpu
#SBATCH --nodes=16                   # Number of nodes requested
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=52
#SBATCH --time=01:00:00             # walltime

export MPIRUN_OPTIONS="--bind-to none -report-bindings"
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export NUM_CORES=${SLURM_NTASKS}*${SLURM_CPUS_PER_TASK}

module load app/intel/2020
module load mpi/openmpi/4.0.4/intel

EXEC_PATH=$USER/CA3DMM/examples/example_AB.exe
mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXEC_PATH} 1200000 6000 6000 0 0 1 1 0

By the way, while testing with M=N=K=32768 using the same sbatch script, unfortunately, the job did not complete within 15 minutes. Regrettably, I had to forcefully cancel the job as it incurs charges, and consequently, I couldn't obtain any useful log information.

GDB Core Dumps

The method of reproducing errors as described in the Execution section.

It shows that a segment fault occurs when mat_redist_engine_exec calls MPI_Neighbor_alltoallv for redistributing matrices A and B. Entering mat_redist_engine_exec to view MPI_Neighbor_alltoallv's parameter sendbuf_h shows "Address 0x1462ced84e50 out of bounds"

Core was generated by `$USER/CA3DMM/examples/example_AB.exe 1200000 6000 6000 0 0 1 1 0'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000041e2e4 in __intel_avx_rep_memcpy ()
(gdb) backtrace 
#0  0x000000000041e2e4 in __intel_avx_rep_memcpy ()
#1  0x00001469ca7de08f in mca_btl_self_get ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_btl_self.so
#2  0x00001469ca1aa174 in mca_pml_ob1_recv_request_get_frag ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#3  0x00001469ca1a9c5f in mca_pml_ob1_recv_request_progress_rget ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#4  0x00001469ca19e4aa in mca_pml_ob1_recv_frag_match_proc ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#5  0x00001469ca1a08f9 in mca_pml_ob1_recv_frag_callback_rget ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#6  0x00001469ca7dda30 in mca_btl_self_send ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_btl_self.so
#7  0x00001469ca1acf81 in mca_pml_ob1_send_request_start_rdma ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#8  0x00001469ca19aa81 in mca_pml_ob1_isend ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_pml_ob1.so
#9  0x00001469c935a3a0 in mca_coll_basic_neighbor_alltoallv ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/openmpi/mca_coll_basic.so
#10 0x00001469d7f82b1d in PMPI_Neighbor_alltoallv ()
   from /opt/app/openmpi/4.0.4/intel/2020/lib/libmpi.so.40
#11 0x000000000040ad2c in mat_redist_engine_exec (engine=0x1462ced84e50, src_blk=0x1462f844aa50, 
    src_ld=-129717664, dst_blk=0xd693a30, dst_ld=6769816) at mat_redist.c:357
#12 0x0000000000406397 in ca3dmm_engine_exec (engine=0x1462ced84e50, src_A=0x1462f844aa50, 
    ldA=-129717664, src_B=0xd693a30, ldB=6769816, dst_C=0x41e2e0 <__intel_avx_rep_memcpy+672>, 
    ldC=1200000) at ca3dmm.c:988
#13 0x0000000000404e3e in main (argc=375, argv=0x0) at example_AB.c:169
(gdb) frame 11
#11 0x000000000040ad2c in mat_redist_engine_exec (engine=0x1462ced84e50, src_blk=0x1462f844aa50, 
    src_ld=-129717664, dst_blk=0xd693a30, dst_ld=6769816) at mat_redist.c:357
357             MPI_Neighbor_alltoallv(
(gdb) l
352         int  *recv_displs = engine->recv_displs;
353         void *recvbuf_h   = engine->recvbuf_h;
354         void *recvbuf_d   = engine->recvbuf_d;
355         if (dev_type == DEV_TYPE_HOST)
356         {
357             MPI_Neighbor_alltoallv(
358                 sendbuf_h, send_sizes, send_displs, engine->dtype, 
359                 recvbuf_h, recv_sizes, recv_displs, engine->dtype, engine->graph_comm
360             );
361         }
(gdb) p sendbuf_h 
$1 = 0x1462ced84e50 <Address 0x1462ced84e50 out of bounds>
(gdb) p recvbuf_h
$2 = (void *) 0x1462c16f1410

Check Dynamic Library

Replaced some sensitive information, such as the home directory with username.

$ ldd example_AB.exe 
        linux-vdso.so.1 =>  (0x00007fffb21a3000)
        libmkl_intel_lp64.so.2 => $USER/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_intel_lp64.so.2 (0x00001518454c0000)
        libmkl_intel_thread.so.2 => $USER/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_intel_thread.so.2 (0x0000151841e94000)
        libmkl_core.so.2 => $USER/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_core.so.2 (0x000015183db1c000)
        libiomp5.so => /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libiomp5.so (0x000015183d71c000)
        libmpi.so.40 => /opt/app/openmpi/4.0.4/intel/2020/lib/libmpi.so.40 (0x000015183d3da000)
        libm.so.6 => /lib64/libm.so.6 (0x000015183d0d8000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000015183cec2000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000015183cca6000)
        libc.so.6 => /lib64/libc.so.6 (0x000015183c8d8000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000015183c6d4000)
        /lib64/ld-linux-x86-64.so.2 (0x00001518469bc000)
        libopen-rte.so.40 => /opt/app/openmpi/4.0.4/intel/2020/lib/libopen-rte.so.40 (0x000015183c40e000)
        libopen-pal.so.40 => /opt/app/openmpi/4.0.4/intel/2020/lib/libopen-pal.so.40 (0x000015183c0ce000)
        librt.so.1 => /lib64/librt.so.1 (0x000015183bec6000)
        libutil.so.1 => /lib64/libutil.so.1 (0x000015183bcc3000)
        libz.so.1 => /lib64/libz.so.1 (0x000015183baad000)
        libimf.so => /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libimf.so (0x000015183b42a000)
        libsvml.so => /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libsvml.so (0x0000151839978000)
        libirng.so => /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libirng.so (0x000015183960e000)
        libintlc.so.5 => /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x0000151839396000)

My environment

More detailed information about the dependencies used that may be useful.

  1. CPU: Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                52
On-line CPU(s) list:   0-51
Thread(s) per core:    1
Core(s) per socket:    26
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
  1. Scheduler: slurm
$ slurmctld -V
slurm 19.05.3-2
  1. MPI: openmpi 4.0.4(compiled by icc)
$ ompi_info 
                 Package: Open MPI hpctest@n6 Distribution
                Open MPI: 4.0.4
  Open MPI repo revision: v4.0.4
   Open MPI release date: Jun 10, 2020
                Open RTE: 4.0.4
  Open RTE repo revision: v4.0.4
   Open RTE release date: Jun 10, 2020
                    OPAL: 4.0.4
      OPAL repo revision: v4.0.4
       OPAL release date: Jun 10, 2020
                 MPI API: 3.1.0
            Ident string: 4.0.4
                  Prefix: /opt/app/openmpi/4.0.4/intel/2020
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: n6
           Configured by: hpctest
           Configured on: Sat Aug 29 10:04:41 CST 2020
          Configure host: n6
  Configure command line: '--prefix=/opt/app/openmpi/4.0.4/intel/2020'
                Built by: hpctest
                Built on: Sat Aug 29 10:23:49 CST 2020
              Built host: n6
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the ifort compiler and/or Open MPI,
                          does not support the following: array subsections,
                          direct passthru (where possible) to underlying Open
                          MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: icc
     C compiler absolute: /opt/intel/compilers_and_libraries_2020.1.217/linux/bin/intel64/icc
  C compiler family name: INTEL
      C compiler version: 1910.20200306
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: ifort
       Fort compiler abs: /opt/intel/compilers_and_libraries_2020.1.217/linux/bin/intel64/ifort
         Fort ignore TKR: yes (!DEC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: no
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.0.4)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.0.4)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.4)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.4)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.4)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.0.4)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.0.4)
               MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
               MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.0.4)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.0.4)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.0.4)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.0.4)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.0.4)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.0.4)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.0.4)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.0.4)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.0.4)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.0.4)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.0.4)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.0.4)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.4)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.0.4)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.0.4)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.0.4)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.0.4)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.0.4)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.0.4)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.0.4)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.0.4)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.0.4)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.0.4)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.0.4)
              MCA schizo: singularity (MCA v2.1.0, API v1.0.0, Component
                          v4.0.4)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.4)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.0.4)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.0.4)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.0.4)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.0.4)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.0.4)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.0.4)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v4.0.4)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.0.4)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.0.4)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.0.4)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.0.4)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.0.4)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.0.4)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v4.0.4)
  1. C Compiler: Intel ICC
$ icc --version
icc (ICC) 19.1.1.217 20200306
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.
  1. GEMM library: Intel OneAPI MKL
$ spack find -v [email protected] 
-- linux-centos7-cascadelake / [email protected] -----------------------
[email protected]+cluster+envmods~ilp64+shared build_system=generic mpi_family=openmpi threads=openmp
==> 1 installed package
@huanghua1994
Copy link
Collaborator

Thank you for trying CA3DMM and reporting the error. Based on your description, I think you could try other MPI libraries first. I have little experience using OpenMPI. My impression is that OpenMPI usually needs some arguments for running on the IB network, OmniPath network, or other high-speed network. You may also check that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants