-
Notifications
You must be signed in to change notification settings - Fork 897
Remove stale solaris configure references #13206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A prior commit (PR open-mpi#13163) removed the stale Solaris components as we no longer support that environment. However, the PR left the Solaris configure references in the code base. This PR removes those references. It also removes a duplicate m4 file (opal_check_os_flavors.m4) that exists in the OAC configure area. All references to the OPAL version have been updated to OAC. Signed-off-by: Ralph Castain <[email protected]>
Not sure the failure has anything to do with this PR - the singleton test is timing out. I do see the following: testHIndexed (test_util_dtlib.TestUtilDTLib.testHIndexed) ... �20 more processes
have sent help message help-mca-bml-r2.txt / unreachable proc Any ideas? |
the unusual spelling is intentional. Signed-off-by: Ralph Castain <[email protected]>
Something is messed up in your main branch - I'm seeing a bunch of errors like this one: testCreateGroup (test_exceptions.TestExcSession.testCreateGroup) ... --------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.
MPI function: MPI_Group_from_session_pset
Reason: PMIx server unreachable
-------------------------------------------------------------------------- Looks like you are trying to test comm_spawn related functions and the "mpirun" server isn't getting spawned for some reason. The tests don't always just fail - you get lots of "proc not reachable" for the child job. Since it takes time for all those individual comm_spawn tests to fail, the overall CI test eventually times out. Again, I can't see how this is related to what is being done here. Did something sneak into your main branch? |
This is expected behavior. Nothing to do at all with spawning processes. Just no server to handle pmix group construct ops. |
Okay - so how do you guys get this CI to pass? I didn't touch the yaml file. |
Check the mpi4py testcreatefromgroup unit test. That’s where the exception is special cased. |
Just running it by hand, the problem is that mpi4py is running a bunch of singleton comm_spawn tests - and those are generating errors and an eventual hang. Here is a sample of them: testArgsOnlyAtRoot (test_spawn.TestSpawnSingleWorld.testArgsOnlyAtRoot) ... 64 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
testCommSpawn (test_spawn.TestSpawnSingleWorldMany.testCommSpawn) ... ERROR
testErrcodes (test_spawn.TestSpawnSingleWorldMany.testErrcodes) ... ERROR
testNoArgs (test_spawn.TestSpawnSingleWorldMany.testNoArgs) ... ERROR
testToMemory (test_status.TestStatus.testToMemory) ... ERROR
test_util_dtlib (unittest.loader._FailedTest.test_util_dtlib) ... ERROR and here is the traceback for the test that eventually hangs - note that it has called spawn 40 times! test_apply (test_util_pool.TestProcessPool.test_apply) ... 17 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
^CTraceback (most recent call last):
File "/opt/hpc/build/mpi4py/test/main.py", line 361, in <module>
main(module=None)
File "/usr/local/lib/python3.11/unittest/main.py", line 102, in __init__
self.runTests()
File "/opt/hpc/build/mpi4py/test/main.py", line 346, in runTests
super().runTests()
File "/usr/local/lib/python3.11/unittest/main.py", line 274, in runTests
self.result = testRunner.run(self.test)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/unittest/runner.py", line 217, in run
test(result)
File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
test(result)
File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
test(result)
File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
test(result)
File "/usr/local/lib/python3.11/unittest/case.py", line 678, in __call__
return self.run(*args, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/unittest/case.py", line 623, in run
self._callTestMethod(testMethod)
File "/usr/local/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
if method() is not None:
^^^^^^^^
File "/opt/hpc/build/mpi4py/test/test_util_pool.py", line 80, in test_apply
self.assertEqual(papply(sqr, (5,)), sqr(5))
^^^^^^^^^^^^^^^^^
File "/opt/hpc/build/mpi4py/build/lib.linux-aarch64-cpython-311/mpi4py/util/pool.py", line 79, in apply
return future.result()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 451, in result
self._condition.wait(timeout)
File "/usr/local/lib/python3.11/threading.py", line 327, in wait
waiter.acquire()
[rhc-node01:10447] dpm_disconnect_init: error -12 in isend to process 0 ...bunch of error outputs like the one below: --------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[24043,0],0]) is on host: rhc-node01
Process 2 ([[51572,40],0]) is on host: rhc-node01
BTLs attempted: self sm smcuda
Your MPI job is now going to abort; sorry.
-------------------------------------------------------------------------- |
I did some investigations here and it looks like some changes in this PR somehow broke the connection management in the TCP BTL. If one runs main with mpi4py with OMPI_MCA_btl=self,tcp things work. But with this PR it doesn't. There are some suspicioius differences between the opal_config.h generated in this PR vs main and I suspect that's a symptom of other config differences that wind up giving the TCP BTL problems. |
Okay, I'll leave this up in case someone else wants to pick it up. I don't have any more time, I'm afraid. There was a check_os_flavors.m4 file still in OMPI that is a duplicate of one in OAC - could be that there is some differences there. Otherwise, I'm not sure what pieces would cause the conflict. |
Thanks @rhc54. I confirm @hppritcha's findings -- there's weird @@ -536,10 +536,6 @@
checking __linux__... no
checking __sun__... no
checking __sun... no
-checking for netdb.h... (cached) yes
-checking for netinet/in.h... (cached) yes
-checking for netinet/tcp.h... (cached) yes
-checking for struct sockaddr_in... (cached) yes
checking for _SC_NPROCESSORS_ONLN... yes
checking whether byte ordering is bigendian... no
checking for broken qsort... no
@@ -3608,15 +3604,13 @@
--- MCA component if:bsdx_ipv4 (m4 configuration macro)
checking for MCA component if:bsdx_ipv4 compile mode... static
-checking struct sockaddr... yes (cached)
-checking NetBSD, FreeBSD, OpenBSD, or DragonFly... no
+checking struct sockaddr... no (cached)
checking if MCA component if:bsdx_ipv4 can compile... no
--- MCA component if:bsdx_ipv6 (m4 configuration macro)
checking for MCA component if:bsdx_ipv6 compile mode... static
-checking struct sockaddr... yes (cached)
-checking some flavor of BSD... yes
-checking if MCA component if:bsdx_ipv6 can compile... yes
+checking struct sockaddr... no (cached)
+checking if MCA component if:bsdx_ipv6 can compile... no
--- MCA component if:linux_ipv6 (m4 configuration macro)
checking for MCA component if:linux_ipv6 compile mode... static
@@ -3625,11 +3619,8 @@
--- MCA component if:posix_ipv4 (m4 configuration macro)
checking for MCA component if:posix_ipv4 compile mode... static
-checking struct sockaddr... yes (cached)
-checking not NetBSD, FreeBSD, OpenBSD, or DragonFly... yes
-checking for struct ifreq.ifr_hwaddr... no
-checking for struct ifreq.ifr_mtu... yes
-checking if MCA component if:posix_ipv4 can compile... yes
+checking struct sockaddr... no (cached)
+checking if MCA component if:posix_ipv4 can compile... no
+++ Configuring MCA framework installdirs
checking for no configure components in framework installdirs... I can investigate more this weekend. |
Smells like it might be the difference between opal and OAC check_os_flavors.m4? Note that I do have a PR open on the OAC version - don't think it will solve this problem, but just wanted to alert you to it: |
Yeah, the first hunk of that diff is definitely a difference between the OPAL and OAC version. But that shouldn't make a difference... I'm confused by the "(cached)" notation of the I'll investigate more this weekend. |
Just for another data point: I checked the diffs for PMIx and PRRTE before/after the change, and I don't see any of the differences that you flagged. Note that PMIx has the same interface components (and configury) as OPAL. All the differences I see are benign and expected. 🤷♂️ |
This PR disables most of the |
This speaks for having mpi4py being a required to merge CI test. |
I believe the thought for OAC was that "check_os_flavors" really meant "identify the OS" - and did not include testing the environment for all sorts of unrelated things. In PMIx/PRRTE, we do that in a separate section. Keeps things cleaner. Also explains why PMIx/PRRTE didn't see similar problems. Up to you how you want to resolve it - return the OPAL version or add the detection code in a separate area. Just please don't change the OAC version by adding in a bunch of stuff that goes beyond its intended purpose.
You don't need 30 min of mpi4py to detect this problem, though it does have other uses. |
Yes, I would agree -- there's a reason we removed the network detection stuff from the OAC "check OS" macro (because it has nothing to do with detecting the OS). So we don't really want to go back to the OPAL version, and we also don't want to add those checks into the OAC version. Worst case, we just do those checks in OMPI's This is the conclusion I came to before I posted my comment this morning, but the reason I didn't just do that outright because I want to understand the |
Maybe an OAC_check_network.m4? Could the caching be due to running the configure in PMIx and PRRTE prior to you doing it in OPAL? I believe all the var names are common between the projects. |
I found the issue (issues, actually). It's much more mundane than we thought. Here's the commit message for the commit I just pushed: opal/mca/if: fix "struct sockaddr_in" and OS checks Found a few more places we needed to adjust for changing from OPAL_CHECK_OS_FLAVORS to OAC_CHECK_OS_FLAVORS. Also, in the opal/mca/if components, we have configure.m4 scripts that explicitly check $opal_found_sockaddr. This was a problem for a few reasons:
Also, the additional network header checks and check for struct sockaddr_in in OPAL_CHECK_OS_FLAVORS were redundant: they were already being performed in OMPI's top-level configure.ac. Deleting these redundant tests -- and indeed, deleting all of OPAL_CHECK_OS_FLAVORS -- is fine. But we did need to set a global variable for the opal/mca/if/*/configure.m4 scripts to check. This commit therefore adjusts the top-level configure.ac script to explicitly save the result of checking for "struct sockaddr_in" into $opal_found_sockaddr_in. Finally, slightly change the AC_MSG_RESULT output in the opal/mca/if/*/configure.m4 scripts to make it clear that the check for "struct sockaddr_in" is not using the regular AC_CHECK_TYPES method. Instead, the "cached" results it is getting are from OPAL caching, not regular Autoconf test caching. Do this because the "(cached)" output that it previously emitted caused considerable confusion during this round of debugging (i.e., I assumed it was coming from regular Autoconf test caching, which is an entirely different mechanism). |
Found a few more places we needed to adjust for changing from OPAL_CHECK_OS_FLAVORS to OAC_CHECK_OS_FLAVORS. Also, in the opal/mca/if components, we have configure.m4 scripts that explicitly check $opal_found_sockaddr. This was a problem for a few reasons: 1. We just deleted the setting of $opal_found_sockaddr from the previous OPAL_CHECK_OS_FLAVORS macro (*why* it was set in that macro isn't really clear -- "struct sockaddr" doesn't really have anything to do with checking OS flavors). 2. The old OPAL_CHECK_OS_FLAVORS macro actually checking for "struct sockaddr_in", not "struct sockaddr". This led to a lot of confusion in this round of debugging. Also, the additional network header checks and check for struct sockaddr_in in OPAL_CHECK_OS_FLAVORS were redundant: they were already being performed in OMPI's top-level configure.ac. Deleting these redundant tests -- and indeed, deleting all of OPAL_CHECK_OS_FLAVORS -- is fine. But we did need to set a global variable for the opal/mca/if/*/configure.m4 scripts to check. This commit therefore adjusts the top-level configure.ac script to explicitly save the result of checking for "struct sockaddr_in" into $opal_found_sockaddr_in. Finally, slightly change the AC_MSG_RESULT output in the opal/mca/if/*/configure.m4 scripts to make it clear that the check for "struct sockaddr_in" is *not* using the regular AC_CHECK_TYPES method. Instead, the "cached" results it is getting are from OPAL caching, not regular Autoconf test caching. Do this because the "(cached)" output that it previously emitted caused considerable confusion during this round of debugging (i.e., I assumed it was coming from regular Autoconf test caching, which is an entirely different mechanism). Signed-off-by: Jeff Squyres <[email protected]>
A prior commit (PR #13163) removed the stale Solaris components as we no longer support that environment. However, the PR left the Solaris configure references in the code base.
This PR removes those references. It also removes a duplicate m4 file (opal_check_os_flavors.m4) that exists in the OAC configure area. All references to the OPAL version have been updated to OAC.