Skip to content

Remove stale solaris configure references #13206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Apr 22, 2025

A prior commit (PR #13163) removed the stale Solaris components as we no longer support that environment. However, the PR left the Solaris configure references in the code base.

This PR removes those references. It also removes a duplicate m4 file (opal_check_os_flavors.m4) that exists in the OAC configure area. All references to the OPAL version have been updated to OAC.

A prior commit (PR open-mpi#13163) removed the stale Solaris
components as we no longer support that environment.
However, the PR left the Solaris configure references
in the code base.

This PR removes those references. It also removes a
duplicate m4 file (opal_check_os_flavors.m4) that exists
in the OAC configure area. All references to the OPAL
version have been updated to OAC.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 requested review from jsquyres and bwbarrett April 22, 2025 15:20
@rhc54 rhc54 self-assigned this Apr 22, 2025
@rhc54
Copy link
Contributor Author

rhc54 commented Apr 22, 2025

Not sure the failure has anything to do with this PR - the singleton test is timing out. I do see the following:

testHIndexed (test_util_dtlib.TestUtilDTLib.testHIndexed) ... �20 more processes
have sent help message help-mca-bml-r2.txt / unreachable proc

Any ideas?

bwbarrett
bwbarrett previously approved these changes Apr 22, 2025
the unusual spelling is intentional.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Apr 22, 2025

Something is messed up in your main branch - I'm seeing a bunch of errors like this one:

testCreateGroup (test_exceptions.TestExcSession.testCreateGroup) ... --------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Group_from_session_pset
  Reason:       PMIx server unreachable
--------------------------------------------------------------------------

Looks like you are trying to test comm_spawn related functions and the "mpirun" server isn't getting spawned for some reason. The tests don't always just fail - you get lots of "proc not reachable" for the child job. Since it takes time for all those individual comm_spawn tests to fail, the overall CI test eventually times out.

Again, I can't see how this is related to what is being done here. Did something sneak into your main branch?

@hppritcha
Copy link
Member

This is expected behavior. Nothing to do at all with spawning processes. Just no server to handle pmix group construct ops.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 23, 2025

This is expected behavior. Nothing to do at all with spawning processes. Just no server to handle pmix group construct ops.

Okay - so how do you guys get this CI to pass? I didn't touch the yaml file.

@hppritcha
Copy link
Member

Check the mpi4py testcreatefromgroup unit test. That’s where the exception is special cased.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 23, 2025

Just running it by hand, the problem is that mpi4py is running a bunch of singleton comm_spawn tests - and those are generating errors and an eventual hang. Here is a sample of them:

testArgsOnlyAtRoot (test_spawn.TestSpawnSingleWorld.testArgsOnlyAtRoot) ... 64 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
testCommSpawn (test_spawn.TestSpawnSingleWorldMany.testCommSpawn) ... ERROR
testErrcodes (test_spawn.TestSpawnSingleWorldMany.testErrcodes) ... ERROR
testNoArgs (test_spawn.TestSpawnSingleWorldMany.testNoArgs) ... ERROR
testToMemory (test_status.TestStatus.testToMemory) ... ERROR
test_util_dtlib (unittest.loader._FailedTest.test_util_dtlib) ... ERROR

and here is the traceback for the test that eventually hangs - note that it has called spawn 40 times!

test_apply (test_util_pool.TestProcessPool.test_apply) ... 17 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
^CTraceback (most recent call last):
  File "/opt/hpc/build/mpi4py/test/main.py", line 361, in <module>
    main(module=None)
  File "/usr/local/lib/python3.11/unittest/main.py", line 102, in __init__
    self.runTests()
  File "/opt/hpc/build/mpi4py/test/main.py", line 346, in runTests
    super().runTests()
  File "/usr/local/lib/python3.11/unittest/main.py", line 274, in runTests
    self.result = testRunner.run(self.test)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/runner.py", line 217, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/case.py", line 678, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/usr/local/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/opt/hpc/build/mpi4py/test/test_util_pool.py", line 80, in test_apply
    self.assertEqual(papply(sqr, (5,)), sqr(5))
                     ^^^^^^^^^^^^^^^^^
  File "/opt/hpc/build/mpi4py/build/lib.linux-aarch64-cpython-311/mpi4py/util/pool.py", line 79, in apply
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)
  File "/usr/local/lib/python3.11/threading.py", line 327, in wait
    waiter.acquire()

[rhc-node01:10447] dpm_disconnect_init: error -12 in isend to process 0

...bunch of error outputs like the one below:

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[24043,0],0]) is on host: rhc-node01
  Process 2 ([[51572,40],0]) is on host: rhc-node01
  BTLs attempted: self sm smcuda

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

@hppritcha
Copy link
Member

I did some investigations here and it looks like some changes in this PR somehow broke the connection management in the TCP BTL. If one runs main with mpi4py with OMPI_MCA_btl=self,tcp things work. But with this PR it doesn't. There are some suspicioius differences between the opal_config.h generated in this PR vs main and I suspect that's a symptom of other config differences that wind up giving the TCP BTL problems.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 23, 2025

Okay, I'll leave this up in case someone else wants to pick it up. I don't have any more time, I'm afraid. There was a check_os_flavors.m4 file still in OMPI that is a duplicate of one in OAC - could be that there is some differences there. Otherwise, I'm not sure what pieces would cause the conflict.

@jsquyres
Copy link
Member

Thanks @rhc54.

I confirm @hppritcha's findings -- there's weird configure results differences between this PR and main. I diff'ed the output from running configure on main vs. running configure on this branch. The meaningful differences appear to be these (- == main, + == this PR):

@@ -536,10 +536,6 @@
 checking __linux__... no
 checking __sun__... no
 checking __sun... no
-checking for netdb.h... (cached) yes
-checking for netinet/in.h... (cached) yes
-checking for netinet/tcp.h... (cached) yes
-checking for struct sockaddr_in... (cached) yes
 checking for _SC_NPROCESSORS_ONLN... yes
 checking whether byte ordering is bigendian... no
 checking for broken qsort... no

@@ -3608,15 +3604,13 @@
 
 --- MCA component if:bsdx_ipv4 (m4 configuration macro)
 checking for MCA component if:bsdx_ipv4 compile mode... static
-checking struct sockaddr... yes (cached)
-checking NetBSD, FreeBSD, OpenBSD, or DragonFly... no
+checking struct sockaddr... no (cached)
 checking if MCA component if:bsdx_ipv4 can compile... no
 
 --- MCA component if:bsdx_ipv6 (m4 configuration macro)
 checking for MCA component if:bsdx_ipv6 compile mode... static
-checking struct sockaddr... yes (cached)
-checking some flavor of BSD... yes
-checking if MCA component if:bsdx_ipv6 can compile... yes
+checking struct sockaddr... no (cached)
+checking if MCA component if:bsdx_ipv6 can compile... no
 
 --- MCA component if:linux_ipv6 (m4 configuration macro)
 checking for MCA component if:linux_ipv6 compile mode... static
@@ -3625,11 +3619,8 @@
 
 --- MCA component if:posix_ipv4 (m4 configuration macro)
 checking for MCA component if:posix_ipv4 compile mode... static
-checking struct sockaddr... yes (cached)
-checking not NetBSD, FreeBSD, OpenBSD, or DragonFly... yes
-checking for struct ifreq.ifr_hwaddr... no
-checking for struct ifreq.ifr_mtu... yes
-checking if MCA component if:posix_ipv4 can compile... yes
+checking struct sockaddr... no (cached)
+checking if MCA component if:posix_ipv4 can compile... no
 
 +++ Configuring MCA framework installdirs
 checking for no configure components in framework installdirs... 

I can investigate more this weekend.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 24, 2025

Smells like it might be the difference between opal and OAC check_os_flavors.m4? Note that I do have a PR open on the OAC version - don't think it will solve this problem, but just wanted to alert you to it:

open-mpi/oac#20

@jsquyres
Copy link
Member

Yeah, the first hunk of that diff is definitely a difference between the OPAL and OAC version. But that shouldn't make a difference...

I'm confused by the "(cached)" notation of the checking struct sockaddr... test in the subsequent hunks. How can the first difference between those 2 configure outputs be a cached difference? I.e., that should be a test that was already run, and they would have had to come up with different answers. But I'm not seeing that. 🤷‍♂️

I'll investigate more this weekend.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 24, 2025

Just for another data point: I checked the diffs for PMIx and PRRTE before/after the change, and I don't see any of the differences that you flagged. Note that PMIx has the same interface components (and configury) as OPAL. All the differences I see are benign and expected. 🤷‍♂️

@bosilca
Copy link
Member

bosilca commented Apr 24, 2025

This PR disables most of the if components as it removed the check for support for sockaddr that is required by the if components. This is due to a difference between OAC's check_os_flavors.m4 and OMPI check_os_flavors .m4, the former does not check for network support (headers and sockaddr). Going back to using OMPI's check_os_flavors.m4 should fix these issues.

@hppritcha
Copy link
Member

This speaks for having mpi4py being a required to merge CI test.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 24, 2025

I believe the thought for OAC was that "check_os_flavors" really meant "identify the OS" - and did not include testing the environment for all sorts of unrelated things. In PMIx/PRRTE, we do that in a separate section. Keeps things cleaner. Also explains why PMIx/PRRTE didn't see similar problems.

Up to you how you want to resolve it - return the OPAL version or add the detection code in a separate area. Just please don't change the OAC version by adding in a bunch of stuff that goes beyond its intended purpose.

This speaks for having mpi4py being a required to merge CI test.

You don't need 30 min of mpi4py to detect this problem, though it does have other uses.

@jsquyres
Copy link
Member

jsquyres commented Apr 24, 2025

Yes, I would agree -- there's a reason we removed the network detection stuff from the OAC "check OS" macro (because it has nothing to do with detecting the OS). So we don't really want to go back to the OPAL version, and we also don't want to add those checks into the OAC version.

Worst case, we just do those checks in OMPI's configure.ac.

This is the conclusion I came to before I posted my comment this morning, but the reason I didn't just do that outright because I want to understand the (cached) oddity first: I don't understand why it says those results are cached when we previously ran those tests and agreed with the configure output from main. That makes no sense to me.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 24, 2025

Maybe an OAC_check_network.m4? Could the caching be due to running the configure in PMIx and PRRTE prior to you doing it in OPAL? I believe all the var names are common between the projects.

@jsquyres
Copy link
Member

jsquyres commented Apr 25, 2025

I found the issue (issues, actually). It's much more mundane than we thought. Here's the commit message for the commit I just pushed:


opal/mca/if: fix "struct sockaddr_in" and OS checks

Found a few more places we needed to adjust for changing from OPAL_CHECK_OS_FLAVORS to OAC_CHECK_OS_FLAVORS.

Also, in the opal/mca/if components, we have configure.m4 scripts that explicitly check $opal_found_sockaddr. This was a problem for a few reasons:

  1. We just deleted the setting of $opal_found_sockaddr from the previous OPAL_CHECK_OS_FLAVORS macro (why it was set in that macro isn't really clear -- "struct sockaddr" doesn't really have anything to do with checking OS flavors).
  2. The old OPAL_CHECK_OS_FLAVORS macro is actually checking for "struct sockaddr_in", not "struct sockaddr". This led to a lot of confusion in this round of debugging.

Also, the additional network header checks and check for struct sockaddr_in in OPAL_CHECK_OS_FLAVORS were redundant: they were already being performed in OMPI's top-level configure.ac. Deleting these redundant tests -- and indeed, deleting all of OPAL_CHECK_OS_FLAVORS -- is fine. But we did need to set a global variable for the opal/mca/if/*/configure.m4 scripts to check. This commit therefore adjusts the top-level configure.ac script to explicitly save the result of checking for "struct sockaddr_in" into $opal_found_sockaddr_in.

Finally, slightly change the AC_MSG_RESULT output in the opal/mca/if/*/configure.m4 scripts to make it clear that the check for "struct sockaddr_in" is not using the regular AC_CHECK_TYPES method. Instead, the "cached" results it is getting are from OPAL caching, not regular Autoconf test caching. Do this because the "(cached)" output that it previously emitted caused considerable confusion during this round of debugging (i.e., I assumed it was coming from regular Autoconf test caching, which is an entirely different mechanism).

Found a few more places we needed to adjust for changing from
OPAL_CHECK_OS_FLAVORS to OAC_CHECK_OS_FLAVORS.

Also, in the opal/mca/if components, we have configure.m4 scripts that
explicitly check $opal_found_sockaddr.  This was a problem for a few
reasons:

1. We just deleted the setting of $opal_found_sockaddr from the
   previous OPAL_CHECK_OS_FLAVORS macro (*why* it was set in that
   macro isn't really clear -- "struct sockaddr" doesn't really have
   anything to do with checking OS flavors).
2. The old OPAL_CHECK_OS_FLAVORS macro actually checking for "struct
   sockaddr_in", not "struct sockaddr".  This led to a lot of
   confusion in this round of debugging.

Also, the additional network header checks and check for struct
sockaddr_in in OPAL_CHECK_OS_FLAVORS were redundant: they were already
being performed in OMPI's top-level configure.ac.  Deleting these
redundant tests -- and indeed, deleting all of OPAL_CHECK_OS_FLAVORS
-- is fine.  But we did need to set a global variable for the
opal/mca/if/*/configure.m4 scripts to check.  This commit therefore
adjusts the top-level configure.ac script to explicitly save the
result of checking for "struct sockaddr_in" into
$opal_found_sockaddr_in.

Finally, slightly change the AC_MSG_RESULT output in the
opal/mca/if/*/configure.m4 scripts to make it clear that the check for
"struct sockaddr_in" is *not* using the regular AC_CHECK_TYPES method.
Instead, the "cached" results it is getting are from OPAL caching, not
regular Autoconf test caching.  Do this because the "(cached)" output
that it previously emitted caused considerable confusion during this
round of debugging (i.e., I assumed it was coming from regular
Autoconf test caching, which is an entirely different mechanism).

Signed-off-by: Jeff Squyres <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants