- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20210629
- Austen Lauria (IBM)
 - Brendan Cunningham (Cornelis Networks)
 - Brian Barrett (AWS)
 - Geoffrey Paulsen (IBM)
 - Harumi Kuno (HPE)
 - Hessam Mirsadeghi (NVIDIA))
 - Howard Pritchard (LANL)
 - Jeff Squyres (Cisco)
 - Joseph Schuchart (HLRS)
 - Matthew Dosanjh (Sandia)
 - Michael Heinz (Cornelis Networks)
 - Naughton III, Thomas (ORNL)
 - Sam Gutierrez (LANL)
 
- Akshay Venkatesh (NVIDIA)
 - Artem Polyakov (NVIDIA)
 - Aurelien Bouteiller (UTK)
 - Brandon Yates (Intel)
 - Charles Shereda (LLNL)
 - Christoph Niethammer (HLRS)
 - David Bernholdt (ORNL)
 - Edgar Gabriel (UH)
 - Erik Zeiske (HPE)
 - Geoffroy Vallee (ARM)
 - George Bosilca (UTK)
 - Josh Hursey (IBM)
 - Joshua Ladd (NVIDIA)
 - Marisa Roman (Cornelius)
 - Mark Allen (IBM)
 - Matias Cabral (Intel)
 - Nathan Hjelm (Google)
 - Noah Evans (Sandia)
 - Raghu Raja
 - Ralph Castain (Intel)
 - Scott Breyer (Sandia?)
 - Shintaro iwasaki
 - Todd Kordenbrock (Sandia)
 - Tomislav Janjusic (NVIDIA)
 - William Zhang (AWS)
 - Xin Zhao (NVIDIA)
 
- No schedule for v4.0.7
- Might be possible to have this SOMEDAY.
 - Cisco would like v4.0.7 someday.
 
 - PR9094 - external32 - Do we want it in v4.0?
 - PR9088 - long long - Do we want it in v4.1
 - We need both 9094 and 9088 on v4.0.x to fix the bug reported.
- Quality of what this is and what's needed.
 
 - v4.0.6 shipped last week. Looking good.
 - Mpool PR, waiting for review and to go into master first.
- Howard is testing.
 
 - 8919 nVidia cannot link.  Some users may have already hit this.
- Tomislav will try to find someone to look at it.
 
 
- Schedule: Planning on late August (no reason for August) for accumulated bugfixes.
 - Fix huge page allocator waiting on Howard's testing.
 - Long Long one
 - 8867 - show help if libz is missing, Jeff's looking at.
 
- 
PMIX / PRRTE plan to release in next few weeks
 - 
Need to do a v5.0 rc as soon as PRRTE v2 ships.
- Need feedback if we've missed an important one.
 
 - 
PMIx Tools support is still not functional. Opened tickets in PRRTE.
- Not a common case for most users.
 - This also impacts the MPIR shim.
- PRRTE v2 will probably ship with broken tool support.
 
 
 - 
Is the driving force for PRRTE v2.0 OMPI?
- So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
 - Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
 - Or just fix it in PRRTE v2.0?
 - Is broken tool support a blocker for PRRTE v2.0?
- Don't ship OMPI v5.0 with broken Tools support.
 
 
 - 
Is there any objections to delaying
- Either we resource this
 
 - 
https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665
- Current state of PMIx tool support.
 - We'd like to get Tool support in CI, but need it to be working to enable the CI.
 
 - 
https://github.com/openpmix/prrte/issues/978#issuecomment-856205950
- Blocking issue for Open-MPI
 - Brian
 
 - 
PR 9014 - new blocker.
- fix should just be a couple of lines of code... hard to decide what we want.
 - Ralph, Jeff and Brian started talking.
 - Simplest solution was to have our own
 
 - 
Need people working on v5.0 stuff.
 - 
Need some configury changes in before we RC.
 - 
Issue 8850, 8990 and more
 - 
Brian will file 3-ish issues
- One is configure pmix
 
 - 
Dynamic Windows fix in for UCX.
 - 
Any update on debugger support?
 - 
Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if
 - 
UCC coll component updating to just set to be default when UCX is selected. PR 8969
- Intent is that this will eventually replace hcoll.
 - Qaulity
 
 
- Solid progress happening, on Read the docs.
 - These docs would be on the readthedocs.io site, or on our site?
- Haven't thought either way yet.
 - No strong opinion yet.
 
 
- 
Issue 8884 - ROMIO detects CUDA differently.
- Giles proposed a quick fix for now.
 
 
- 
Now released.
 - 
Virtual Face to face.
 - 
Persistant Collectives
- So nice to get MPIX_ rename into v5.0
 - Don't think this was planned for v5.0
 - Don't know if anyone asked them this.  - Might not matter to them
- Virtual face to face -
 
 
 - 
a bunch of stuff in pipeline. Then details.
 - 
Plan to open Sessions pull request.
- Big, almost all in OMPI.
 - Some of it are more impacted by clang format changes.
 - New functions.
 - Considerably more functions can be called before MPI_Init/Finalize
 - Don't want to do sessions in v5.0
 - Hessam Miradeghi is interested in trying MPI_Sessions.
- Interested in a timeline of a release that will contain MPI_Sessions.
 
 - Sessions working group meets every monday at noon central time.
- https://github.com/mpiwg-sessions/sessions-issues/wiki
 - Several of the tools tests are busted on master.
- Sessions branch fixes some of these.
 - Initialize tools after finalize MPI
 
 
 - Update:
- Did some cleanup of refactoring.
 - Topology might NOT change with Sessions relative to whats currently in master
 - Extra topology work that wasn't accepted by MPI v4.0 standard.
 - Question on how we do mca versioning
 
 
 - 
We don't KNOW that OMPI v6.0 may not be an ABI break
 - 
Would be NICE to get MPIX symbols into a seperate library.
- What's left in MPIX after persistant collectives?
- Short Float,
 - Pcall_req - persistant collective
 - Affinity
 
 - If they're NOT built by default, it's not too high of a priority.
- Should just be some code-shuffling.
- On the surface shouldn't be too much.
 - If they use wrapper compilers, or official mechanism
 - Top level library, since app -> MPI and app -> MPIX lib.
 - libmpi_x library can then be versioned differently.
 
 
 - Should just be some code-shuffling.
 
 - What's left in MPIX after persistant collectives?
 - 
Dont change to build MPIX by default.
 - 
Open an issue to track all of our MPI 4.0 items
- MPI Forum will want, certainly before supercomputing.
 
 - 
Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting.
- In person meeting is off the table for many of us. We might want an out of sequence meeting.
 - Lets doodle something a couple of weeks out.
 - Doodle and send it out
 - trivial wiki page in style of other in person wiki.
 
 - 
Two days of 2 hour blocks - wiki *
 
- 
Who owns our open-SQL?
- noone?
 - What value is the viewer using to generate the ORG data?
- Looking for field in the perl client
- It's just the username.  It's nothing simple.
- Something about how the cherry-pie server is stuffing stuff into the database.
 
 
 - It's just the username.  It's nothing simple.
 - Thought it was in the ini file, but isn't.
 
 - Looking for field in the perl client
 - Concerned that we don't have an owner.
 - Back in the day, we used MTT because there was nothing else.
- But perhaps there's something else now?
 
 
 - 
A lot of segfaults in UCX 1sided in IBM
 - 
Howard Pritchard Does someone at nVidia have a good set of test for GPU
- Can ask around.
 - Only tests is The OSU MPI has support for CUDA and ROCM tests.
- Good enough for sanity.
 - No support for Intel low level stuff now.
 
 - PyTorch - machine learning framework - resembles an actual application.
- Has different backends, collectives reduction tool NCCL, but also has a CUDA backend for single/multiple nodes.
 
 
 - 
ECP - worried we're going to get so far behind MPICH because all 3 major exascale systems are using essentially the same technology and their vendors use MPICH. They're racing ahead with integrating GPU offloaded code with MPICH. Just a heads up.
- A thread on The GPU can trigger something to happen in MPI.
 - CUDA_Async Not sure of
 
 
- No discussion
 
- No update
 
- No discussion.