Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circuit segment fault when using the newinsts #893

Closed
eddy16112 opened this issue Jul 13, 2020 · 7 comments
Closed

Circuit segment fault when using the newinsts #893

eddy16112 opened this issue Jul 13, 2020 · 7 comments
Assignees

Comments

@eddy16112
Copy link
Contributor

I noticed a bug in the master after merging the newinsts, when running circuit like this
mpirun -x LEGION_FREEZE_ON_ERROR=1 -np 3 ./circuit -ll:gpu 1 -wpp 65536 -npp 65536
I am running on single node with REALM_NETWORKS=mpi

Here is the backtrace:
#2 0x0000000002b9abc1 in Realm::realm_freeze (signal=11) at /home/wwu/legion-newinsts/runtime/realm/runtime_impl.cc:138
#3
#4 Realm::Rect<1, long long>::contains (this=0x10, p=...) at /home/wwu/legion-newinsts/runtime/realm/point.inl:512
#5 0x0000000001c9ccbb in Realm::InstancePieceList<1, long long>::find_piece (this=0x2b59983c7e90, p=...) at /home/wwu/legion-newinsts/runtime/realm/inst_layout.inl:392
#6 0x0000000002d8f30e in Realm::TransferIteratorBase<1, long long>::step (this=0x2b59904c3350, max_bytes=4, info=..., flags=1536, tentative=true) at /home/wwu/legion-newinsts/runtime/realm/transfer/transfer.cc:198
#7 0x0000000002da4f57 in Realm::XferDes::default_get_requests (this=0x2b59904c3650, reqs=0x2b58ec501b48, nr=1, flags=1542) at /home/wwu/legion-newinsts/runtime/realm/transfer/channel.cc:894
#8 0x0000000002da8979 in Realm::GPUXferDes::get_requests (this=0x2b59904c3650, requests=0x2b58ec501b48, nr=1) at /home/wwu/legion-newinsts/runtime/realm/transfer/channel.cc:2013
#9 0x0000000002da8c02 in Realm::GPUXferDes::progress_xd (this=0x2b59904c3650, channel=0x88d6ab0, work_until=...) at /home/wwu/legion-newinsts/runtime/realm/transfer/channel.cc:2061
#10 0x0000000002dbb63c in Realm::XDQueue<Realm::GPUChannel, Realm::GPUXferDes>::do_work (this=0x88d6ad8, work_until=...) at /home/wwu/legion-newinsts/runtime/realm/transfer/channel.inl:303
#11 0x0000000002bd3298 in Realm::BackgroundWorkManager::Worker::do_work (this=0x2b58ec502130, max_time_in_ns=-1, interrupt_flag=0x0) at /home/wwu/legion-newinsts/runtime/realm/bgwork.cc:536
#12 0x0000000002bd1368 in Realm::BackgroundWorkThread::main_loop (this=0x52035f0) at /home/wwu/legion-newinsts/runtime/realm/bgwork.cc:135
#13 0x0000000002bd451c in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x52035f0) at /home/wwu/legion-newinsts/runtime/realm/threads.inl:97
#14 0x0000000002df0268 in Realm::KernelThread::pthread_entry (data=0x5203690) at /home/wwu/legion-newinsts/runtime/realm/threads.cc:729

Sean thought it might be a Legion bug - instances are being deleted before copies that target them are completed.

@lightsighter
Copy link
Contributor

Pushed a fix to master. Assigning back to @eddy16112 to confirm and close.

@eddy16112
Copy link
Contributor Author

I am sorry to say the bug is not fixed. I can still get the same error, and backtrace is also the same as before.

@lightsighter
Copy link
Contributor

Are you sure you are running on the right commit? You rebuilt from scratch?

Your reproducer is no longer reproducing for me.

@lightsighter
Copy link
Contributor

@eddy16112
Copy link
Contributor Author

Yes, I am on the right commit. Can you try to increase -np to 4 or 5 or set -ll:gpuworkthread 1 to see if you can catch it?

@lightsighter
Copy link
Contributor

Please pull and try again.

@eddy16112
Copy link
Contributor Author

So far so good! Thanks, I will close the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants