-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange parallel performance of scalapack routines #25
Comments
Thanks for reporting this. I've done scaling tests before, and they've always seemed fine (scaled up to > 300 nodes), so I guess we need to figure out where the difference is coming from. How exactly are you running the test? i.e. how many processes? how many nodes? how many threads? This kind of behaviour is often exhibited if MPI was placing the processes incorrectly. For example if it was placing them all on one node rather than spreading them out. It would also be helpful if you could send the MPI command line you used to run the test, then I can see if I can reproduce it. Another point which I noticed in your test code, you've set your process grid to be |
Thank you very much for your reply. Since I only tested with 1-64 processes, the nodes are from only 1-6. The environment for doing the above tests is as follows: python: anaconda3 + mpi4py + scalapy It was tested on a supercomputer. Since at the beginning, I always get the following error messages: cn9394:UCM:7376:4f820460: 1870 us(17 us): open_hca: ibv_get_device_list() failed I believe that it is related to RDMA. Thus, in the mpi command, I added the following paramter to avoid using RDMA: By the way, for different number of processors, I used different input file so that the processor grids are as follows: |
There may be problems by adding -env I_MPI_DEVICE sock to avoid the error messages. However, I have also done tests on my own cluster with: mpirun --map-by node --mca oob_tcp_if_include "192.168.1.11/24" --mca btl_tcp_if_include "192.168.1.11/24" -hostfile hostfile -n 1 python p0.py >log.txt The host file contains 4 nodes: The results are as follows: 1 3000 83.80727100372314 6.165726900100708 From this test, we find that for the single process calculations with scipy, the time is almost constant with any number of processes. This is OK. However, for the parallel calculations, the time is still always much much longer than that by scipy. Even for the 32 processes case, I have waited for 2 hour and a half, namely 9000 seconds, the calculation is still ended. When I use "top" to see the status on one of the nodes, I get: KiB Mem: 65747808 total, 10933660 used, 54814148 free, 279452 buffers PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND It means that it is always running, but it consumes very little memory. It is a little abnormal. It looks like it enters deadlock. So, do you have any idea about this, please? Please just focus on this test first. |
Hi!! I am also using this package for parallel computing. |
I am testing some scalapack routines with scalapy. I find that the time of parallel with scalapy is much much longer than just by using scipy. We can take pdsyev as an example. I have done a scaling test with different number of processes. The following are part of the results. The first colomn is the number of processes, the second is the matrix size, the third is the time by scalapy.lowlevel.pdsyev subroutine and the last one is the time by scipy.linalg.eigh with only one process:
4 3000 415.0334515571594 72.1984703540802
8 3000 774.7427942752838 91.13431119918823
16 3000 2001.8645451068878 131.86768698692322
32 3000 9216.71579861641 173.42458987236023
64 3000 9288.961811304092 173.8198528289795
There are three strange points:
This test has been performed in two different supercomputers and the same conclusion is obtained. Does somebody also see this and know why? Please find the code of my test in the attachment.
pdsyev.txt
The text was updated successfully, but these errors were encountered: