- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20161213
        Geoffrey Paulsen edited this page Jan 9, 2018 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Geoff Paulsen
 - Artem Polyakov
 - Jeff Squyres
 - Brian Barrett
 - Howard
 - Jimmy - SPI representative.
 - Josh Hursey
 - Josh Ladd
 - Nathan Hjelm
 - Ralph
 - Todd Kordenbrock (HPE @ Sandia)
 
- Introductions.
 - can we leverage their 501.3c Non-profit to leverage some status?
- One difference is that with SPI, Open MPI would remain our current legal status.  Just associated with
- With Conservancy, Open MPI would be an activity of the Conservancy.
 
 - Would be reasonable to request non-profit
 - Github may be willing to add an organization to non-profit (SPI), they are willing to.
- Jimmy doesn't see a meaningful difference between SPI and Conservancy.
 
 - If join withing 60 days of Nov 15th (Ralph is lesion).
 
 - One difference is that with SPI, Open MPI would remain our current legal status.  Just associated with
 - Discussion
- Probably only need SPI services, Conservancy provides more.
 - When started this process, neither organization would reply to Ralph for 6 months.
 - If you join SPI, not becoming part of their organization.
 - Conservancy would be happier to have us have more formal processes.
 
 
- Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.5
- Pressing need to release 1.10.5
- Waiting on PR from Nathan, then will create RC.
- Master fix is correct, but has to be back ported to 1.10.5.
 - Nathan's users Want release by end of week.
 
 
 - Waiting on PR from Nathan, then will create RC.
 - Added regression test for darray bug.
 - Mathias PSM2 not setting 1sided bits correctly.
 
 - Pressing need to release 1.10.5
 
- 
Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
 - 
Known / ongoing issues to discuss
- Darray Datatype issue - 2.0.2 - do a minor point release
 - Early termination is not handled correctly - 2550 - Ralph fixed already. 2552, 2553 (jeff will cleanup)
 - osc_pt2pt wrong answer - 2505.
- IBM has a 1 line fix.  Mark thinks there is another issue in lock-all.
- Nathan: that sounds like it could be it. Can call Fence, but either in an epoch or not in an epoch. When you try to do a true extent, we return the wrong extent, and wrong lower bound. OMPI was seeing true
 
 
 - IBM has a 1 line fix.  Mark thinks there is another issue in lock-all.
 
 - 
PMIx update
- Last changes went in. Josh is rolling a new RC.
 - Josh will update a PR for the v2.x branch.
 - Should improve memory usage, but not yet ideal.
 - Fuzzy, estimate for End of January.
 - Strings on KNL are 40KB, and 80KB (per remote peer).  This is not fixed in this RC.
- If we do compression, then have to do changes in OMPI. Currently clients don't free it. If we return
 - Not sure if we want compression for all strings... for example hwloc output gets put into shared memory.
 
 - Josh and Artem feels like mid-january. of PMIx 1.2 + integration in Open MPI v2.1.0.
 - Fujitsu was excited about this change.  Things should get much much better.
- Fujitsu gets credit for investigating how bad this issue was. Thanks!
 
 - Artem has a PMIx perf tool (in contrib of PMIx srces).  Measures memory consumption.
- Nathan's using MPI memory usage. Calls MPI_Init, does some collectives, and then reports process and node memory usage.
 
 
 - 
OMPI 2.1
- THE blocking issue is PMIx.
 - Focus now is OMPI 2.0.2.
 
 - 
Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *
 
Review Master MTT testing (https://mtt.open-mpi.org/)
- No morning messages still.  Need to pester Brian about.  Apparently not allowed to make changes until after the new year.
- mail from our AWS instance is not getting to us.
 
 - Biggest failures we saw in 2.0.x and 2.1.x
- OSHMEM - BTL fix, fixed  a bunch of things, but still a few errors (Segv), Put or Get not registered location.
- Jeff will make a ticket for few remaining OSHMEM failures.
 
 
 - OSHMEM - BTL fix, fixed  a bunch of things, but still a few errors (Segv), Put or Get not registered location.
 - Sylvain seeing a bunch of errors in master oob/ud components
- mostly timeouts. not sure if hanging, or really slow.
 
 - Josh - turned on Jenkins testing at IBM, may result in timeouts. Using PGI on PPC64.
 
- Face to Face in January - https://github.com/open-mpi/ompi/wiki/Meeting-2017-01
 
- Cisco, ORNL, UTK, NVIDIA
 - Mellanox, Sandia, Intel
 - LANL, Houston, IBM