Remove stale solaris configure references #13206

rhc54 · 2025-04-22T15:20:21Z

A prior commit (PR #13163) removed the stale Solaris components as we no longer support that environment. However, the PR left the Solaris configure references in the code base.

This PR removes those references. It also removes a duplicate m4 file (opal_check_os_flavors.m4) that exists in the OAC configure area. All references to the OPAL version have been updated to OAC.

A prior commit (PR open-mpi#13163) removed the stale Solaris components as we no longer support that environment. However, the PR left the Solaris configure references in the code base. This PR removes those references. It also removes a duplicate m4 file (opal_check_os_flavors.m4) that exists in the OAC configure area. All references to the OPAL version have been updated to OAC. Signed-off-by: Ralph Castain <[email protected]>

rhc54 · 2025-04-22T15:51:43Z

Not sure the failure has anything to do with this PR - the singleton test is timing out. I do see the following:

testHIndexed (test_util_dtlib.TestUtilDTLib.testHIndexed) ... �20 more processes
have sent help message help-mca-bml-r2.txt / unreachable proc

Any ideas?

opal/mca/threads/pthreads/configure.m4

the unusual spelling is intentional. Signed-off-by: Ralph Castain <[email protected]>

rhc54 · 2025-04-22T22:01:27Z

Something is messed up in your main branch - I'm seeing a bunch of errors like this one:

testCreateGroup (test_exceptions.TestExcSession.testCreateGroup) ... --------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Group_from_session_pset
  Reason:       PMIx server unreachable
--------------------------------------------------------------------------

Looks like you are trying to test comm_spawn related functions and the "mpirun" server isn't getting spawned for some reason. The tests don't always just fail - you get lots of "proc not reachable" for the child job. Since it takes time for all those individual comm_spawn tests to fail, the overall CI test eventually times out.

Again, I can't see how this is related to what is being done here. Did something sneak into your main branch?

hppritcha · 2025-04-23T02:21:07Z

This is expected behavior. Nothing to do at all with spawning processes. Just no server to handle pmix group construct ops.

rhc54 · 2025-04-23T03:17:38Z

This is expected behavior. Nothing to do at all with spawning processes. Just no server to handle pmix group construct ops.

Okay - so how do you guys get this CI to pass? I didn't touch the yaml file.

hppritcha · 2025-04-23T13:53:29Z

Check the mpi4py testcreatefromgroup unit test. That’s where the exception is special cased.

rhc54 · 2025-04-23T14:02:02Z

Just running it by hand, the problem is that mpi4py is running a bunch of singleton comm_spawn tests - and those are generating errors and an eventual hang. Here is a sample of them:

testArgsOnlyAtRoot (test_spawn.TestSpawnSingleWorld.testArgsOnlyAtRoot) ... 64 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
testCommSpawn (test_spawn.TestSpawnSingleWorldMany.testCommSpawn) ... ERROR
testErrcodes (test_spawn.TestSpawnSingleWorldMany.testErrcodes) ... ERROR
testNoArgs (test_spawn.TestSpawnSingleWorldMany.testNoArgs) ... ERROR
testToMemory (test_status.TestStatus.testToMemory) ... ERROR
test_util_dtlib (unittest.loader._FailedTest.test_util_dtlib) ... ERROR

and here is the traceback for the test that eventually hangs - note that it has called spawn 40 times!

test_apply (test_util_pool.TestProcessPool.test_apply) ... 17 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
^CTraceback (most recent call last):
  File "/opt/hpc/build/mpi4py/test/main.py", line 361, in <module>
    main(module=None)
  File "/usr/local/lib/python3.11/unittest/main.py", line 102, in __init__
    self.runTests()
  File "/opt/hpc/build/mpi4py/test/main.py", line 346, in runTests
    super().runTests()
  File "/usr/local/lib/python3.11/unittest/main.py", line 274, in runTests
    self.result = testRunner.run(self.test)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/runner.py", line 217, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/local/lib/python3.11/unittest/case.py", line 678, in __call__
    return self.run(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/usr/local/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/opt/hpc/build/mpi4py/test/test_util_pool.py", line 80, in test_apply
    self.assertEqual(papply(sqr, (5,)), sqr(5))
                     ^^^^^^^^^^^^^^^^^
  File "/opt/hpc/build/mpi4py/build/lib.linux-aarch64-cpython-311/mpi4py/util/pool.py", line 79, in apply
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)
  File "/usr/local/lib/python3.11/threading.py", line 327, in wait
    waiter.acquire()

[rhc-node01:10447] dpm_disconnect_init: error -12 in isend to process 0

...bunch of error outputs like the one below:

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[24043,0],0]) is on host: rhc-node01
  Process 2 ([[51572,40],0]) is on host: rhc-node01
  BTLs attempted: self sm smcuda

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------

hppritcha · 2025-04-23T19:53:19Z

I did some investigations here and it looks like some changes in this PR somehow broke the connection management in the TCP BTL. If one runs main with mpi4py with OMPI_MCA_btl=self,tcp things work. But with this PR it doesn't. There are some suspicioius differences between the opal_config.h generated in this PR vs main and I suspect that's a symptom of other config differences that wind up giving the TCP BTL problems.

rhc54 · 2025-04-23T23:18:35Z

Okay, I'll leave this up in case someone else wants to pick it up. I don't have any more time, I'm afraid. There was a check_os_flavors.m4 file still in OMPI that is a duplicate of one in OAC - could be that there is some differences there. Otherwise, I'm not sure what pieces would cause the conflict.

jsquyres · 2025-04-24T13:16:59Z

Thanks @rhc54.

I confirm @hppritcha's findings -- there's weird configure results differences between this PR and main. I diff'ed the output from running configure on main vs. running configure on this branch. The meaningful differences appear to be these (- == main, + == this PR):

@@ -536,10 +536,6 @@
 checking __linux__... no
 checking __sun__... no
 checking __sun... no
-checking for netdb.h... (cached) yes
-checking for netinet/in.h... (cached) yes
-checking for netinet/tcp.h... (cached) yes
-checking for struct sockaddr_in... (cached) yes
 checking for _SC_NPROCESSORS_ONLN... yes
 checking whether byte ordering is bigendian... no
 checking for broken qsort... no

@@ -3608,15 +3604,13 @@
 
 --- MCA component if:bsdx_ipv4 (m4 configuration macro)
 checking for MCA component if:bsdx_ipv4 compile mode... static
-checking struct sockaddr... yes (cached)
-checking NetBSD, FreeBSD, OpenBSD, or DragonFly... no
+checking struct sockaddr... no (cached)
 checking if MCA component if:bsdx_ipv4 can compile... no
 
 --- MCA component if:bsdx_ipv6 (m4 configuration macro)
 checking for MCA component if:bsdx_ipv6 compile mode... static
-checking struct sockaddr... yes (cached)
-checking some flavor of BSD... yes
-checking if MCA component if:bsdx_ipv6 can compile... yes
+checking struct sockaddr... no (cached)
+checking if MCA component if:bsdx_ipv6 can compile... no
 
 --- MCA component if:linux_ipv6 (m4 configuration macro)
 checking for MCA component if:linux_ipv6 compile mode... static
@@ -3625,11 +3619,8 @@
 
 --- MCA component if:posix_ipv4 (m4 configuration macro)
 checking for MCA component if:posix_ipv4 compile mode... static
-checking struct sockaddr... yes (cached)
-checking not NetBSD, FreeBSD, OpenBSD, or DragonFly... yes
-checking for struct ifreq.ifr_hwaddr... no
-checking for struct ifreq.ifr_mtu... yes
-checking if MCA component if:posix_ipv4 can compile... yes
+checking struct sockaddr... no (cached)
+checking if MCA component if:posix_ipv4 can compile... no
 
 +++ Configuring MCA framework installdirs
 checking for no configure components in framework installdirs...

I can investigate more this weekend.

rhc54 · 2025-04-24T13:25:16Z

Smells like it might be the difference between opal and OAC check_os_flavors.m4? Note that I do have a PR open on the OAC version - don't think it will solve this problem, but just wanted to alert you to it:

open-mpi/oac#20

jsquyres · 2025-04-24T13:39:31Z

Yeah, the first hunk of that diff is definitely a difference between the OPAL and OAC version. But that shouldn't make a difference...

I'm confused by the "(cached)" notation of the checking struct sockaddr... test in the subsequent hunks. How can the first difference between those 2 configure outputs be a cached difference? I.e., that should be a test that was already run, and they would have had to come up with different answers. But I'm not seeing that. 🤷‍♂️

I'll investigate more this weekend.

rhc54 · 2025-04-24T14:42:48Z

Just for another data point: I checked the diffs for PMIx and PRRTE before/after the change, and I don't see any of the differences that you flagged. Note that PMIx has the same interface components (and configury) as OPAL. All the differences I see are benign and expected. 🤷‍♂️

bosilca · 2025-04-24T15:55:57Z

This PR disables most of the if components as it removed the check for support for sockaddr that is required by the if components. This is due to a difference between OAC's check_os_flavors.m4 and OMPI check_os_flavors .m4, the former does not check for network support (headers and sockaddr). Going back to using OMPI's check_os_flavors.m4 should fix these issues.

hppritcha · 2025-04-24T16:14:11Z

This speaks for having mpi4py being a required to merge CI test.

rhc54 · 2025-04-24T20:35:14Z

I believe the thought for OAC was that "check_os_flavors" really meant "identify the OS" - and did not include testing the environment for all sorts of unrelated things. In PMIx/PRRTE, we do that in a separate section. Keeps things cleaner. Also explains why PMIx/PRRTE didn't see similar problems.

Up to you how you want to resolve it - return the OPAL version or add the detection code in a separate area. Just please don't change the OAC version by adding in a bunch of stuff that goes beyond its intended purpose.

This speaks for having mpi4py being a required to merge CI test.

You don't need 30 min of mpi4py to detect this problem, though it does have other uses.

jsquyres · 2025-04-24T21:26:55Z

Yes, I would agree -- there's a reason we removed the network detection stuff from the OAC "check OS" macro (because it has nothing to do with detecting the OS). So we don't really want to go back to the OPAL version, and we also don't want to add those checks into the OAC version.

Worst case, we just do those checks in OMPI's configure.ac.

This is the conclusion I came to before I posted my comment this morning, but the reason I didn't just do that outright because I want to understand the (cached) oddity first: I don't understand why it says those results are cached when we previously ran those tests and agreed with the configure output from main. That makes no sense to me.

rhc54 · 2025-04-24T21:50:41Z

Maybe an OAC_check_network.m4? Could the caching be due to running the configure in PMIx and PRRTE prior to you doing it in OPAL? I believe all the var names are common between the projects.

jsquyres · 2025-04-25T13:19:33Z

I found the issue (issues, actually). It's much more mundane than we thought. Here's the commit message for the commit I just pushed:

opal/mca/if: fix "struct sockaddr_in" and OS checks

Found a few more places we needed to adjust for changing from OPAL_CHECK_OS_FLAVORS to OAC_CHECK_OS_FLAVORS.

Also, in the opal/mca/if components, we have configure.m4 scripts that explicitly check $opal_found_sockaddr. This was a problem for a few reasons:

We just deleted the setting of $opal_found_sockaddr from the previous OPAL_CHECK_OS_FLAVORS macro (why it was set in that macro isn't really clear -- "struct sockaddr" doesn't really have anything to do with checking OS flavors).
The old OPAL_CHECK_OS_FLAVORS macro is actually checking for "struct sockaddr_in", not "struct sockaddr". This led to a lot of confusion in this round of debugging.

Also, the additional network header checks and check for struct sockaddr_in in OPAL_CHECK_OS_FLAVORS were redundant: they were already being performed in OMPI's top-level configure.ac. Deleting these redundant tests -- and indeed, deleting all of OPAL_CHECK_OS_FLAVORS -- is fine. But we did need to set a global variable for the opal/mca/if/*/configure.m4 scripts to check. This commit therefore adjusts the top-level configure.ac script to explicitly save the result of checking for "struct sockaddr_in" into $opal_found_sockaddr_in.

Finally, slightly change the AC_MSG_RESULT output in the opal/mca/if/*/configure.m4 scripts to make it clear that the check for "struct sockaddr_in" is not using the regular AC_CHECK_TYPES method. Instead, the "cached" results it is getting are from OPAL caching, not regular Autoconf test caching. Do this because the "(cached)" output that it previously emitted caused considerable confusion during this round of debugging (i.e., I assumed it was coming from regular Autoconf test caching, which is an entirely different mechanism).

configure.ac

opal/mca/if/bsdx_ipv4/configure.m4

opal/mca/if/bsdx_ipv6/configure.m4

opal/mca/if/linux_ipv6/configure.m4

Found a few more places we needed to adjust for changing from OPAL_CHECK_OS_FLAVORS to OAC_CHECK_OS_FLAVORS. Also, in the opal/mca/if components, we have configure.m4 scripts that explicitly check $opal_found_sockaddr. This was a problem for a few reasons: 1. We just deleted the setting of $opal_found_sockaddr from the previous OPAL_CHECK_OS_FLAVORS macro (*why* it was set in that macro isn't really clear -- "struct sockaddr" doesn't really have anything to do with checking OS flavors). 2. The old OPAL_CHECK_OS_FLAVORS macro actually checking for "struct sockaddr_in", not "struct sockaddr". This led to a lot of confusion in this round of debugging. Also, the additional network header checks and check for struct sockaddr_in in OPAL_CHECK_OS_FLAVORS were redundant: they were already being performed in OMPI's top-level configure.ac. Deleting these redundant tests -- and indeed, deleting all of OPAL_CHECK_OS_FLAVORS -- is fine. But we did need to set a global variable for the opal/mca/if/*/configure.m4 scripts to check. This commit therefore does the following: * Adjusts the configure.m4 in the various opal/mca/if components that were previously looking at $opal_found_sockaddr. * Each configure.m4 now looks for ac_cv_type_struct_sockaddr_in[6] as relevant. * Also update them to use the OAC OS result variables. * Update relevant comments to explain that we're using $ac_cv_type_... instead of calling AC_CHECK_TYPES again because that test is a bit cumbersome with all the required #includes. * Slightly modify the test/results output emitted by these configure tests to make it clear that we're not calling AC_CHECK_TYPES. Do this because the "(cached)" output that it previously emitted caused considerable confusion during this round of debugging (i.e., I assumed it was coming from regular Autoconf test caching, which is an entirely different mechanism). Signed-off-by: Jeff Squyres <[email protected]>

jsquyres · 2025-04-29T14:43:43Z

@bosilca Please review the new commit I just pushed.

bosilca

If we're putting efforts into updating this to 2025 let's do it right. Check for sockaddr, sockaddr_in and sockaddr_in6, get rid of all the BSD decisions, and change the logic in all the IPv6 components to use ac_cv_type_struct_sockaddr_in6.

And don't check for OAC_CHECK_OS_FLAVORS for this particular topic.

jsquyres · 2025-04-29T15:32:52Z

If we're putting efforts into updating this to 2025 let's do it right. Check for sockaddr, sockaddr_in and sockaddr_in6,

struct sockaddr_in and struct sockaddr_in6 are already tested in configure.ac. Are you looking for something else here?

What is the need for testing for struct sockaddr? If we have sockaddr_in and/or sockaddr_in6, then -- by definition -- we have sockaddr.

get rid of all the BSD decisions,

Do we know that that is safe? I don't have any way to test that.

and change the logic in all the IPv6 components to use ac_cv_type_struct_sockaddr_in6.

That was already done.

And don't check for OAC_CHECK_OS_FLAVORS for this particular topic.

Not sure what you mean here...?

If you're referring to AC_REQUIRE([OAC_CHECK_OS_FLAVORS]) in each of the components, a) that was already there, and b) that's the Correct way to ensure that that macro has already been invoked before this one. It's basically the Autoconf mechanism for expressing dependencies between macros. I don't know why we would remove that from these configure.m4 files...?

bosilca · 2025-04-29T18:03:57Z

What I meant was that if you remove the current assumption that IPv4+BSD-like implies IPv6, and instead rely on the existence of struct sockaddr_in and struct sockaddr_in6, there is no need to know the OS flavor and therefore you can remove the dependency on OAC_CHECK_OS_FLAVORS.

jsquyres · 2025-04-29T19:56:27Z

What I meant was that if you remove the current assumption that IPv4+BSD-like implies IPv6, and instead rely on the existence of struct sockaddr_in and struct sockaddr_in6, there is no need to know the OS flavor and therefore you can remove the dependency on OAC_CHECK_OS_FLAVORS.

I'm not sure that's entirely correct. Why do we have 4 components, then, instead of 2?

bsd_ipv4
bsd_ipv6
posix_ipv4
linux_ipv6

I have not looked at the code differences between the BSD components and the POSIX/Linux components.

Also, please remember: this PR is about getting rid of Solaris support. It's not a PR to clean up the if modules -- I did not sign up for that... 😄

bosilca · 2025-04-29T20:53:29Z

My bad, it's good as is. If struct sockaddr_in6 exists and we are on a BSD-like build the bsdx_ipv6 component, otherwise build the linux_ipv6.

github-actions bot added the Target: main label Apr 22, 2025

rhc54 requested review from jsquyres and bwbarrett April 22, 2025 15:20

rhc54 self-assigned this Apr 22, 2025

bwbarrett previously approved these changes Apr 22, 2025

View reviewed changes

opal/mca/threads/pthreads/configure.m4 Outdated Show resolved Hide resolved

Preserve the "hreads" comment, but make it clearer that

f1201ab

the unusual spelling is intentional. Signed-off-by: Ralph Castain <[email protected]>

rhc54 dismissed bwbarrett’s stale review via f1201ab April 22, 2025 20:47

jsquyres force-pushed the topic/solaris branch from 0610e0d to 7e80e2f Compare April 25, 2025 18:22

bosilca requested changes Apr 28, 2025

View reviewed changes

configure.ac Outdated Show resolved Hide resolved

opal/mca/if/bsdx_ipv4/configure.m4 Outdated Show resolved Hide resolved

opal/mca/if/bsdx_ipv6/configure.m4 Outdated Show resolved Hide resolved

opal/mca/if/linux_ipv6/configure.m4 Outdated Show resolved Hide resolved

jsquyres force-pushed the topic/solaris branch from 7e80e2f to f744b78 Compare April 29, 2025 14:41

bosilca reviewed Apr 29, 2025

View reviewed changes

bosilca approved these changes Apr 29, 2025

View reviewed changes

bosilca merged commit bb9ca43 into open-mpi:main Apr 29, 2025
15 checks passed

github-actions bot mentioned this pull request May 8, 2025

fortran: fix common symbol sizes and alignments #13230

Closed

Remove stale solaris configure references #13206

Remove stale solaris configure references #13206

Uh oh!

Conversation

rhc54 commented Apr 22, 2025

Uh oh!

rhc54 commented Apr 22, 2025

Uh oh!

Uh oh!

rhc54 commented Apr 22, 2025

Uh oh!

hppritcha commented Apr 23, 2025

Uh oh!

rhc54 commented Apr 23, 2025

Uh oh!

hppritcha commented Apr 23, 2025

Uh oh!

rhc54 commented Apr 23, 2025

Uh oh!

hppritcha commented Apr 23, 2025

Uh oh!

rhc54 commented Apr 23, 2025

Uh oh!

jsquyres commented Apr 24, 2025

Uh oh!

rhc54 commented Apr 24, 2025

Uh oh!

jsquyres commented Apr 24, 2025

Uh oh!

rhc54 commented Apr 24, 2025

Uh oh!

bosilca commented Apr 24, 2025

Uh oh!

hppritcha commented Apr 24, 2025

Uh oh!

rhc54 commented Apr 24, 2025

Uh oh!

jsquyres commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Apr 24, 2025

Uh oh!

jsquyres commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsquyres commented Apr 29, 2025

Uh oh!

bosilca left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsquyres commented Apr 29, 2025

Uh oh!

bosilca commented Apr 29, 2025

Uh oh!

jsquyres commented Apr 29, 2025

Uh oh!

bosilca commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

jsquyres commented Apr 24, 2025 •

edited

Loading

jsquyres commented Apr 25, 2025 •

edited

Loading

bosilca left a comment •

edited

Loading