-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regent: Scalability Issue of Cholesky.rg in legion/language/examples #663
Comments
Well, first of all, none of those loops are being index launched because they're failing in the optimizer. The easiest way to tell is to put For the But the other two loops cannot possibly be a valid index launch. The problem is that the loops aren't properly nested:
The point of linking the slides is that it's possible to write this in a way where they are properly nested:
You'll see there aren't any loops with two or more task calls, each loop nest contains exactly one task call. The other nice things about this is that the loop at the bottom is a square. Whereas the one in the current Regent code is a triangle (i.e. the bounds on the Bottom line is that the code does not match what I linked in that slide, and will have to be rewritten if you really want it to work well. |
@magnatelee pointed out to me that 80*40 is probably a quite small problem size for Cholesky, so even aside from the issue with index launches (which you should still certainly aim to fix), increasing the problem size might help quite a bit. We'd need a Legion Prof chart to be sure. |
Thanks! I have changed the code to the followng: for x = 0, np do I ran the code with the flag -foverride-demand-index-launch 1 on a small problem of size 80*4 and everything is good: ELAPSED TIME = 25.915 ms However, when I change the size back to 80*40, I got the following error: [0 - 7ff1d475c700] {6}{compqueue}: completion queue overflow: cq=1900000000000003 size=1024 This seems to only happen when I am under the master branch (The code with same size can run normally in the stable branch). |
Yes I have that update in my commit:
commit a4f42af
commit e4b4847
commit fcc1e34
commit 70f3699
commit 268ebe7
commit 61cbe9d
commit b8e06dc [0 - 7f3f4c26a700] {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 1 of operation transpose_copy (UID 793) in parent task cholesky (UID 3) is using uninitialized data for field(s) ,101 of logical region (1014,1,1) (from file /home/users/adncat/yizhou/Legion/runtime/legion/legion_ops.cc:539)[warning 1071] [0 - 7f3f4c26a700] {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 1 of operation transpose_copy (UID 795) in parent task cholesky (UID 3) is using uninitialized data for field(s) ,101 of logical region (1054,1,1) (from file /home/users/adncat/yizhou/Legion/runtime/legion/legion_ops.cc:539)[warning 1071] [0 - 7f3f4c26a700] {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 1 of operation transpose_copy (UID 797) in parent task cholesky (UID 3) is using uninitialized data for field(s) ,101 of logical region (1094,1,1) (from file /home/users/adncat/yizhou/Legion/runtime/legion/legion_ops.cc:539)[warning 1071] [0 - 7f3f4c229700] {6}{compqueue}: completion queue overflow: cq=1900000000000003 size=1024 |
Just to confirm, you did re-run |
Yes. |
Is the version of the code you are running the same as what you had in the pull request? Or do you want to put up a new version? |
I created a new pull request: |
Passing this over to @lightsighter . Legion is creating a bounded completion queue with a size of 1024 and it's overflowing:
(Note that you need the modified cholesky.rg from #665 for this.) |
Fixed with 105cc4b |
@qyz96 Can you pull and try again? |
Thanks for the quick fix! Yes the new commit works for me now. |
Ok, can you go ahead and run your code with Legion Prof, and we'll see how well it's actually performing now? (You can zip the file and attach it here.) |
I am having some difficulties opening index.html in the legion_prof folder. It looks like the entire folder is too big to upload here (1GB). |
Maybe you can put it on some sort of a shared location on Sherlock? |
I have shared it on Dropbox: |
So what I see in this profile is that the utility processor is 100% full, while the CPU is only 50% full. This is not surprising since I would change |
I have changed the size to 200*24 and ran the program on 1 node with 24 cores (using salloc). The profile link is here: It looks like the use of CPU is unusually low in this profile. I tried to increase the size of the task by changing the matrix size to 400*24, but I got the out of memory problem (which I did not encounter with other systems): Also, I tried compiling with different number of threads (using install.py -j NUM_THREADS) but the performance seems to be always the same. Are there ways check the number of threads the program is using? Thanks! |
Never mind, I just found the flags that specify the number of cores and memory per core. I will let you know if I have further problems. |
Do you know how I can deal with this bug? I am running on a node with four cores. Thanks! |
A couple things:
|
A quick way to compare with reference implementation without actual running the reference code is to calculate the flops of cholesky. A high performance DPOTRF would have similar peak flops as DGEMM, and MKL DGEMM is close to the peak flops of CPU. |
I have shared the profiling output here: |
The trace shows you're 100% bottlenecked on the utility processor. This means you either need a larger problem size (so that the task get bigger), or you have to make fewer blocks (do you really need 400 blocks when you only have 24 processors?), or you need to use tracing (which means adding a timestep loop around the outside, and using In general Legion is designed to reserve 1 or 2 cores for the runtime's internal use. So it's not surprising that it ticks up at the end. But given the utility bottleneck, that's the biggest issue. |
Hi, I am encountering this issue again: Could you help me with this? I have run the code in gdb and the output says: Starting program: /home/users/adncat/yizhou/Legion/language/terra/terra yizhou/Legion/language/examples/cholesky.rg -ll:cpu 12 -ll:util 4 -ll:csize 8192 -fflow 0 -foverride-demand-index-launch 1 -n 20000 -np 20 Program received signal SIGILL, Illegal instruction. Program received signal SIGILL, Illegal instruction. Program terminated with signal SIGILL, Illegal instruction. Thanks! |
You got a SIGILL, which makes me wonder if the code was compiled correctly. What machine are you running on again? If it's a distributed machine, is it possible the node you compiled on has a different architecture than the node you're running on? |
I am running on sherlock cluster, but possibly on a different node each time. Does that mean I have to run install.py everytime? By the way, is there a way to disable printing out the following warning? [0 - 7fb5c004b700] {4}{runtime}: [warning 1019] LEGION WARNING: Region requirements 0 and 2 of index task dgemm (UID 16794) in parent task cholesky (UID 3) are potentially interfering. It's possible that this is a false positive if there are projection region requirements and each of the point tasks are non-interfering. If the runtime is built in debug mode then it will check that the region requirements of all points are actually non-interfering. If you see no further error messages for this index task launch then everything is good. (from file /home/users/adncat/yizhou/Legion/runtime/legion/legion_tasks.cc:6993)[warning 1019] Thanks! |
I don't know, but just for the moment, let's try that to make sure we remove it as a potential cause. You can run with |
If you do a run with the runtime built in debug mode and it doesn't report an error, then you can safely ignore the warning about potentially interfering region requirements. |
Hi, is cholesky.rg here a distributed version that I can run on multinodes? I tried running with gasnetrun_ibv but it says the following: Thank you! |
That error has nothing to do with this program. We do not recommend running with |
Yes I compiled with the ibv configuration, which I believe has the --enable-mpi-compat flag. I tried running with the following command: |
How about this?
Note though if this is OpenMPI you need some additional flags to make sure all the environment variables get passed through. |
I am using intel MPI library: That suggestion does work! But I am having seg fault in the end: ===================================================================================
|
I also have the following error: when I try to run this command on multiple nodes: Thank you! |
Is it possible you rebuilt or something? This isn't a symbol directly used by Regent, so it would be only used transitively through some sort of Legion API call. I wouldn't expect that to go wrong unless there was a miscompilation or perhaps a dirty compilation of some sort. |
I ran the script file ./setup_env.py and then built with ./install.py --gasnet. It seems that the error disappeared after I reinstalled with the command ./install.py. The seg fault that I was seeing on one node also disappeared: LAUNCHER="mpirun -n 1" ./regent.py examples/cholesky.rg -fflow 0 -foverride-demand-index-launch 1 -level 5 -n 10000 -p 20 -ll:cpu 16 -ll:csize 8192 However, I did not really see any scaling with multiple nodes. I saw the following output with 8 nodes: I am not sure if I am running the code correctly, the exact command was: Am I missing something here? Am I supposed to install with gasnet support (with the flag --gasnet) and resolve the previous issue, or I missed some configuration in the command? I could profile the process, but I am also not sure how I can generate profiling results for each node in the process. |
@qyz96 Your matrix size is too small for a 8 node run. Cholesky is not a suitable application for strong scale, you can try weak scale (1 node, 4 nodes, 16 nodes, ...); also try some large block sizes, the default 500*500 may not be large enough to saturate the CPU. Instead of using the actual run time, converting the time into flops can tell how far the cholesky is close to the peak flops of the machine. Last, I am not sure if this cholesky is optimized for a distributed run. A high performance choleksy should use 2d block cyclic data distribution, which is not in the default mapper, so you might need to create a custom mapper. |
I see. I am not very familiar with using customized mappers. Do you have some references or documentation regarding that? Thanks! |
Hi, Thank you! |
I don't know much about your dgemm code, but I believe for cholesky the number you give to |
We had some issues with the utility processor in the past but this looks different. Even if the utility processor is very busy the workload should still be evenly distributed but we are seeing that several nodes are basically idle. Here are some screenshots from the profiler. The data is here for the profiling: Notice the 50% Node 4 through 7 are idle. There is some utility activity towards the end. The tasks at the end are all You can find the Regent code here: If you look at the benchmark above, the matrix has size 16 x 16 blocks (of size 512). So each loop does 16 iterations. The code seems to stop scaling when we try to use more than 16 cores. So it seems that the loops are not parallelized correctly. Is there something preventing the compiler from parallelizing the loops? The performance seems consistent with parallelizing only one of the loops. In the code below, The code is quite simple. Three nested loops:
and the
|
Aha, I was wrong about the code that I wrote myself. :) In general, the code like cholesky needs a custom mapper, since the default mapper is not smart enough to handle irregular tasks graphs. I just didn't bother to write one for the cholesky example, as it wasn't really intended for any performance evaluation. (and I'd be happy to help write the right custom mapper, if you are planning to use it for any serious performance comparison.) For your gemm code though, I think what you need is a 2D index launch like the following (which may need
The default mapper should be able to distribute the tasks from this code evenly across all the nodes, as long as np^2 is bigger than or equal to the number of cores. |
Makes sense. Thanks. |
Thanks for the suggestion Wonchan, it does make a lot of sense. However we hit random segfaults/hangs sometimes on "large" problem. I attached a Gasnet backtrace to this message. Looking at the code, is there anything that seems off to you ? I built Legion on commit
|
In the backtrace above, the seg fault appears to be occurring inside an MKL-accelerated routine. Can you look at that call with a debugger and confirm the arguments are reasonable? We've seen problems with other BLAS libraries (i.e. Cray's) being unhappy with being called from multiple tasks concurrently, so I'd look into how to get MKL to run in a single-threaded mode and/or experiment with a different BLAS implementation. |
By the way, here is how to link with the single-threaded MKL |
Ha I actually didn't notice the MKL call in the backtrace, good catch. It's probably from there I agree. |
I tried Openblas instead, no difference, still hangs/segfault. However, I may (maybe ?) have found something.
and I saw this (full log below) on rank 3 before segfaulting
Notice the Finally, all other arguments (alpha, beta, n, nb, x, y, k, ..) look fine
|
Hi,
I am running cholesky.rg in legion/language/examples for some runtime comparisons. I ran the code on Sherlock cluster with different number of cores per task, number of tasks per node and number of nodes. The size of the matrix is set to 80*40 and the number of blocks is set to 40. The results are shown below:
8 cores per task, 1 task per node, 1 node:
214526.257 ms
16 cores per task, 1 task per node, 1 node:
255938.125 ms
24 cores per task, 1 task per node, 1 node:
210877.760 ms
24 cores per task, 1 task per node, 2 node:
213083.923 ms
8 cores per task, 3 tasks per node, 1 node:
208970.563 ms
4 cores per task, 6 tasks per node, 1 node:
211100.003 ms
4 cores per task, 6 tasks per node, 2 nodes:
229718.337 ms
8 cores per task, 3 tasks per node, 2 nodes:
229127.595 ms
It appears that cholesky.rg has worse performance on multiple nodes. I understand that there may be some problems with the structure of the code, as mentioned in #574. It was said that index launches needs to be used and the structure of the code needs to match the one shown in:
https://www.oerc.ox.ac.uk/sites/default/files/uploads/ProjectFiles/ASEArch/starpu.pdf#page=41
However, it seems to me that the for loop part, which is shown below,
var is = ispace(f2d, { x = n, y = n })
var cs = ispace(f2d, { x = np, y = np })
var rA = region(is, double)
var rB = region(is, double)
var pA = partition(equal, rA, cs)
var pB = partition(equal, rB, cs)
var bn = n / np
for x = 0, np do
dpotrf(x, n, bn, pB[f2d { x = x, y = x }])
for y = x + 1, np do
dtrsm(x, y, n, bn, pB[f2d { x = x, y = y }], pB[f2d { x = x, y = x }])
end
for k = x + 1, np do
dsyrk(x, k, n, bn, pB[f2d { x = k, y = k }], pB[f2d { x = x, y = k }])
for y = k + 1, np do
dgemm(x, y, k, n, bn,
pB[f2d { x = k, y = y }],
pB[f2d { x = x, y = y }],
pB[f2d { x = x, y = k }])
end
end
end
already matches the right looking structure, or am I missing something? Also, maybe I am just not so familiar with regent, but isn't the for loop of the decomposition in the code already using index spaces and partitions? Is the scalability issue caused by the fact that the code is parallelized but not optimized in some way?
The text was updated successfully, but these errors were encountered: