- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20191022
        Geoffrey Paulsen edited this page Oct 23, 2019 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Akshay Venkatesh (NVIDIA)
 - Artem Polyakov (Mellanox)
 - Austen Lauria (IBM)
 - Brendan Cunningham (Intel)
 - Brian Barrett (AWS)
 - David Bernhold (ORNL)
 - Edgar Gabriel (UH)
 - Geoffrey Paulsen (IBM)
 - George Bosilca (UTK)
 - Harumi Kuno (HPE)
 - Jeff Squyres (Cisco)
 - Josh Hursey (IBM)
 - Matthew Dosanjh (Sandia)
 - Michael Heinz (Intel)
 - Todd Kordenbrock (Sandia)
 - William Zhang (AWS)
 
- Brandon Yates (Intel)
 - Charles Shereda (LLNL)
 - Erik Zeiske
 - Howard Pritchard (LANL)
 - Joshua Ladd (Mellanox)
 - Mark Allen (IBM)
 - Matias Cabral (Intel)
 - Nathan Hjelm (Google)
 - Noah Evans (Sandia)
 - Ralph Castain (Intel)
 - Thomas Naughton (ORNL)
 - Tom Naughton
 - Xin Zhao (Mellanox)
 - mohan (AWS)
 
- All of this in context in v5.0
 - Intel is no longer driving PRRTE work, and Ralph won't be available for PRRTE much either.
 - PRRTE will be a good PMIX developement environment, but no longer a focus to be a scale and robust launcher.
 - OMPI community could come into PRRTE, and put in production / scalability testing, features, etc.
 - Given that we have not been good at contributing to PRRTE (other than Ralph), there's another proposal
- There's been a drift from ORTE / PRRTE, so transitioning is risky.
 
 - Step 1. Make PMIX a first class citizen
- Still good to keep PMIX as a static framework (no more glue, but still under 
orte/mca/pmix, but basicly just passes through, and callPMIX_calls directly. - Allows us to still have internal backup PMIx if no external PMIX is found.
 
 - Still good to keep PMIX as a static framework (no more glue, but still under 
 - Step 2. We can whittle down orte, since PMIX does much of this.
 - Two things PRRTE won't care about, is scale and all binding patterns.
 - Only recent versions of SLURM have PMIx
 - Need to continue to support ssh.
- Not just core PMIx, still need daemons for SSH to work, but they're not part of PMIx.
 - Part of ORTE that we wouldn't be deleting.
 
 - What does Altair PbsPro and open source PbsPro do?
- Torque is different than PbsPro
 
 - Are there OLD systems that we currently support that we still don't care, and could discontinue support in v5.x
- Who supports PMIx, and who doesn't
 
 - If PMIx becomes a first class citizen and rest of code base just makes PMIx calls, how do we support these things?
- mpirun would still have to launch orteds via plm.
 - srun wouldn't need
 - But this is how it works today. Torque doesn't support PMIx at all, but TM just launches ORTEDs
 - ALPS - aprun ./a.out - requires a.out to connect up to ALPS daemons.
- Cray still supports PMI - someone would need to write a PMI -> PMIX adapter.
 
 - ORTE does not have the concept of persistant daemons
 
 - Is there a situation where we might have a launcher launching ortes and we'd need to relay pmix calls to the correct pmix server layer?
- Generally we won't have that situation, since the launcher won't launch ORTEds.
 
 - George's work currently depends on PRRTE
- If ORTEDs provides PMIx_Events, would that be enough?
- No George needs PRRTE's fault-tollerant overlay network.
 - George will scope the effort to port that feature from PRRTE to ORTE.
 
 
 - If ORTEDs provides PMIx_Events, would that be enough?
 - ACTION - Please gather list of resource managers, and Tools that we care about supporting in Open-MPI v5.0.x
 
- Date looks good.  Feb 17th right before MPI Forum
- 2pm monday, and maybe most of Tuesday
 - Cisco has a portland facility and is happy to host.
 - But willing to step asside if others want to host.
 - about 20-30 min drive from MPI Forum, will probably need a car.
 
 - It's official!  Portland Oregon, Feb 17, 2020.
- Safe to begin booking travel now.
 
 
- 
OMPI has been waiting for some git submodule work in Jenkins on AWS.
- Need someone to have someone to figure out why Jenkins doesn't like Jeff's PR.
- Anyone with github account for ompi team should have access.
 - PR 6821
 - Apparently Jenkin's isn't behaving as it should.
 
 - Three pieces:  Jenkins, CI, bot.
- AWS has a libfabirc setup like this for testing.
 - Issue is that they're reworking the design, and will rollout for both libfabric and open-mpi.
 
 - William Zhang talked to Brian
- Not something AWS team will work on, but Brian will work on it.
 
 - Jeff will talk to Brian as well.
 
 - Need someone to have someone to figure out why Jenkins doesn't like Jeff's PR.
 - 
Howard and Jeff have access to Jenkins on AWS. Part of the problem is that we don't have much expertise on Jenkins/AWS.
- William will probably be admining the Jenkins/AWS or communicating with those who will.
 
 - 
Merged
--recurse-submodulesupdate intoompi-scriptsJenkins script as first step. Let's see if that works. - 
Modular thread re-write (noah)
- UGNI and Vader BTLs were getting better performance, not sure why.
 - For modular threading library, might be interesting to decide at compile time or runtime.
 - Previously similar things seemed to be related to ICACHE.
 - Howard will lok at.
 
 
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- Will put out RCs for v3.0.5 and v3.1.5 this week.
 - Please test RCs when they become available.
 - Start drawing up a list of fixes that won't be backported to v3.0.x
- Datatype bug won't be backported, because it snowballed too big.
 - Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.
 
 
Review v4.0.x Milestones v4.0.2
- v4.0.2 was released and haven't had any catastrophic issues come in.
 - We're begining to merge in new v4.0.3 PRs
 
- Schedule: April 2020?
- Wiki - go look at items, and we should discuss a bit in weekly calls.
 - Some items:
- MPI1 removed stuff.
 
 
 
Review Master Master Pull Requests
- IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
 - Absoft 32bit fortran failures.
 
- No discussion this week.
 - See older weekday notes for prior items.
 
- No discussion this week.