-mca entry <libs> for dynamic MPI profiling interface layering #6245

markalle · 2019-01-07T05:19:59Z

I'll start with a brief outline of the code changes, then a description of
what this feature is all about.

scripts normalize_mpih.pl and mkcode.pl:
parses mpi.h.in to figure out prototypes for all the MPI functions
and constructs wrapper functions for them all
various mpicc-wrapper-data.txt.in etc files:
these all get a couple new lines
libs_profilesupport=-lmpiprofilesupport
profilesupport_lib_file=libmpiprofilesupport.so
and opal_wrapper.c puts this early in the link line, so an "ldd" of
an MPI app would now have a "libmpiprofilesupport.so" in front of
libmpi.so etc.
Makefile changes:
build a new libmpiprofilesupport.so two places:
1. $MPI_ROOT/lib/libmpiprofilesupport.so [stub that defines nothing]
2. $MPI_ROOT/lib/profilesupport/libmpiprofilesupport.so [real]
plm_base_launch_support.c
gets some general purpose LD_PRELOAD and LD_LIBRARY_PATH prepend
features, plus a "-mca entry" specific feature that looks at the
-mca entry setting and uses the new OMPI_LD_PRELOAD_PREPEND feature
to modify LD_PRELOAD if needed
schizo_ompi.c
this is another LD_LIBRARY_PATH prepend feature, but it's just used
for -mca entry rather than being a general purpose feature. The only
reason for this second LD_LIBRARY_PATH prepend feature is to be able
to put a setting in front of all the others. If I just use the
plm-level feature above the final LD_LIBRARY_PATH would be
$MPI_ROOT/lib : :
and only by intercepting here can we get in front of that first
$MPI_ROOT/lib

Next I'll give a pretty long description of what profile interface
layering is about. Although to be honest, I think the main part of
this feature that people actually use is only the dynamic part. Eg at
runtime this lets you just add "-mca entry ./libmpe.so" for example
without relinking your app and it would run as if you had linked
with -lmpe.

MPI Profiling Interface Layering

The MPI standard defines a PMPI profiling interface, but in its normal usage only one profiling library can be used at a time.

A profiling wrapper library defines some subset of MPI_* entrypoints, and inside those redefinitions, it calls some combination of MPI_* and PMPI_* symbols. For example

int
MPI_Allgather(void *sbuf, int scount, MPI_Datatype sdt,
    void *rbuf, int rcount, MPI_Datatype rdt, MPI_Comm comm)
{
    int rval;
    double t1, t2, t3;
    t1 = MPI_Wtime();
    MPI_Barrier(comm);
    t2 = MPI_Wtime();
    rval = PMPI_Allgather(sbuf, scount, sdt, rbuf, rcount, rdt, comm);
    t3 = MPI_Wtime();
    // record time waiting vs time spent in allgather..
    return(rval);
}
double MPI_Wtime() {
  // insert hypothetical high-resolution replacement here, for example
}

In trying to use two unrelated wrapper libraries at the same time, it's in general not possible to link them in such a way that proper layering occurs. For example, consider two wrapper libraries:

"libJobLog.so" wrapping MPI_Init and MPI_Finalize to keep a log of
every MPI job, listing hosts, runtimes, and cpu times.
"libCollPerf.so" wrapping MPI_Init and MPI_Finalize, and all the
MPI collectives to gather statistics about how evenly the ranks
enter the collectives.

With ordinary linking, each MPI_* call would resolve into one of the wrapper libraries, and from there the wrapper library's call to PMPI_* would resolve into the bottom level libmpi.so. Only one of the libraries would have its MPI_Init and MPI_Finalize routines called.

Defining consistent layering behavior:

With dynamically loaded symbols, it is possible to design a consistent approach to layering for an arbitrary number of wrapper libraries.

When a wrapper library "libwrap.so" redefines an MPI_* symbol, there are two possibilities for what MPI calls it can make. It can call another MPI_* entry, or it can call a PMPI_* entry. In the case of ordinary single-level wrapping, the calls into MPI_* would resolve into "libwrap.so" first and then "libmpi.so" if not found. And the calls to PMPI_* would resolve to "libmpi.so".

In the case of multi-level wrapping, the equivalent behavior is for MPI_* to resolve to the current level, and PMPI_* to resolve to the next level down in the list of libraries.

We would set up an array of handles

  void *lib[MAX_LEVELS];
containing the dlopened handles for all the libraries, eg
  lib[0] = dlopen("libwrap3.so", );
  lib[1] = dlopen("libwrap2.so", );
  lib[2] = dlopen("libwrap1.so", );
  lib[3] = dlopen("libmpi.so", );

For each MPI function an array of function pointers would exist, one for each library:

  int (*(fptr_MPI_Send[MAX_LEVELS]))();

Some of the entries would be null, but the bottom level should always be defined.

And a thread-local global would define the calldepth for the thread, initially 0. The calldepth needs to be thread-local data so the implementation can be thread-safe, since different threads would have unrelated call stacks.

Model implementation of MPI_Send and PMPI_Send:

int
MPI_Send(void* buf, int count, MPI_Datatype dt, int to, int tag, MPI_Comm comm)
{
  int rval;
  int entrylev, nextlev;
  int *calldepth;
  calldepth = (int*) pthread_getspecific(depthkey); // thread-local global
  entrylev = *calldepth;
  nextlev = entrylev;  // (or nextlev = entrylev + 1 for PMPI_Send)
  if (nextlev >= nwrapper_levels) { --nextlev; }
  while (!fptr_MPI_Send[nextlev] && nextlev<nwrapper_levels) { ++nextlev; }
  if (nextlev >= nwrapper_levels) {
    printf("Fatal Error: unable to find symbol at level>=%d for %s\\n",
      entrylev, "MPI_Send");
    exit(1);
  }
  *calldepth = nextlev;
  rval = fptr_MPI_Send[nextlev](buf, count, dt, to, tag, comm);
  *calldepth = entrylev;
  return(rval);
}

If code like the above is in a library called libmpiprofilesupport.so and an MPI app is linked against that library, then an example sequence of events might be

app.x calls MPI_Send, the linker resolves it in libmpiprofilesupport.so
above code sees calldepth=0, calls fptr_MPI_Send[0] == MPI_Send in libwrap1.so
libwrap1.so's MPI_Send calls PMPI_Send, it resolves into libmpiprofilesupport.so
above code increments calldepth to 1, calls fptr_MPI_Send[1] == MPI_Send
in libwrap2.so

And so on. At each MPI/PMPI call, the linker/loader resolves to the symbol in libmpiprofilesupport.so, and that library decides which function pointer to go into next.

Fortran

I believe Open MPI makes the profiling design choice that wrapping the C-language MPI_Send() automatically results in Fortran mpi_send() being wrapped.

The "calldepth" behavior from above isn't a perfect match for this situation. For comparison, the traditional behavior for an application linked against a single libwrap.so would go as follows. An "ldd" on the application might show dependencies
libwrap.so
libmpi_mpifh.so
libmpi.so
and an application call to fortran mpi_send() would find the first definition in libmpi_mpifh.so.0 where it would call MPI_Send() which would go back up to libwrap.so which might call PMPI_Send() which would then go down to libmpi.so. The linker/loader starts at the top in its search for each symbol.

The layered profiling approach with a "calldepth" always goes down the list of libraries. If fortran is at the bottom of the list, those symbols would end up not being intercepted.

I think the easiest way to maintain the desired behavior is to re-order the libraries in the dynamic case as

  lib[0] = dlopen("libmpi_mpifh.so", RTLD_NOW|RTLD_GLOBAL);
  lib[1] = dlopen("libwrap.so",      RTLD_NOW|RTLD_GLOBAL);
  // .. other wrapper libraries ..
  lib[n] = dlopen("libmpi.so",       RTLD_NOW|RTLD_GLOBAL);

So the fortran wrapper is always first, and libmpi.so is always last, with all the wrapper libraries inbetween.

Internally the -mca entry feature produces a list of entrypoint levels with three types:

wrapper libraries like libwrap.so
base MPI implementation which is two libraries libmpi.so and libmpi_mpifh.so
just the fortran calls from the base MPI implementation
and allows "fort" to be manually included in the list of libraries.

So if one had a library libwrap.so that defined MPI_Send and called PMPI_Send, then the syntax
-mca entry fort,libwrap.so
would produce a list of entry levels as
level[0] = fortran symbols from libmpi_mpifh.so
level[1] = libwrap.so
level[2] = all the base MPI symbols from libmpi.so and libmpi_mpifh.so
and if an application called fortran mpi_send, the call sequence would be

app.x calls mpi_send, the linker resolves it in libmpiprofilesupport.so
calldepth=0, calls fptr_mpi_send[0] == mpi_send in libmpi_mpifh.so
libmpi_mpifh.so's mpi_send calls PMPI_Send, which resolves into libmpiprofilesupport.so
calldepth=1, calls fptr_MPI_Send[1] == MPI_Send in libwrap.so
libwrap.so's MPI_Send calls PMPI_Send, which resolves into libmpiprofilesupport.so
calldepth=2, calls fptr_MPI_Send[2] == MPI_Send in libmpi.so
so including the OMPI fortran wrappers in the list in front of libwrap.so enabled automatic
wrapping of mpi_send by only redefining MPI_Send. If this behavior is not desired, it's possible
tp use the syntax
-mca entry libwrap.so,fort
which puts the fortran symbols at the bottom of the list where all the base MPI symbols are.

Performance

On a machine where OMPI pingpong takes 0.22 usec, the -mca entry option slowed the pingpong to 0.24 usec.

This is enough of an impact I wouldn't consider putting the code from libmpiprofilesupport.so into the main libmpi.so library. But as long as the code is isolated in its own libmpiprofilesupport.so and LD_LIBRARY_PATH is used to point at a stub library there is no performance impact when the feature isn't activated.

Weaknesses

I think the main weakness is that the above design only handles MPI_* calls. If a wrapper library wanted to intercept both MPI_* calls and all malloc()/free() calls for example, the above would result in the malloc()/free() not being intercepted (the app isn't linked against libwrap.so, it's only linked against libmpiprofilesupport.so which has a dlopen() of libwrap.so).

This commit includes a "preload" feature (that can be included in the -mca entry list) that would be needed with such libraries. Then the malloc/free/etc from libwrap.so would be used. There wouldn't be any kind of layering for the non-MPI symbols if multiple libraries were trying to intercept malloc/free for example.

Another minor weakness is that MPI_Pcontrol(level, ...)'s variable argument list is impossible to maintain as far as I know. The proper behavior would be that if an application calls MPI_Pcontrol(1, "test", 3.14), we should call MPI_Pcontrol with those arguments in every wrapper library in the list. But there's no way in stdarg.h to extract and re-call with an unknown list of arguments. The best we can do is call MPI_Pcontrol(level) at each library.

Documentation for this feature:

[Dynamic MPI Profiling interface with layering]

The MCA option -mca entry <list> can be used to enable interception
from profiling libraries at runtime.

In traditional MPI usage a wrapper library "libwrap.so" can be built
that redefines selected MPI calls, and using that library would
require relinking the application with -lwrap to enable the
interception.

But this feature allows such interception to be enabled at runtime
without relinking by using the mpirun option
  -mca entry libwrap.so

The -mca entry feature can be used as any of
  -mca entry /path/to/libwrap.so
  -mca entry libwrap.so  (if the library will be found via
                         LD_LIBRARY_PATH)
  -mca entry wrap        (shortcut for libwrap.so)

Note that this feature doesn't automatically set LD_LIBRARY_PATH so
that "libwrap.so" can be found. That could be done by using the
additional option
  -x OMPI_LD_LIBRARY_PATH_PREPEND=<dir>:<dir>:...

To layer multiple libraries, a comma separted list can be used:
  -mca entry libwrap1.so,libwrap2.so

A few other keywords can be added into the -mca entry list:
  -mca entry v       : verbose option, list the opened libraries
  -mca entry preload : cause the wrapper libraries to be added to an
                       LD_PRELOAD. This could be needed if a library
                       redefine non-MPI symbols
  -mca entry fortran : a layer that minimally wraps the Fortran MPI
                       calls on top of the C calls, eg defining
                       mpi_foo and calling PMPI_Foo
  -mca entry fort    : short for fortran above

By default Fortran wrappers are placed at the top of the list and
the base product is always placed at the bottom, so
  -mca entry libwrap.so
would be equivalent to
  -mca entry fortran,libwrap.so
and would produce a layering as
  level[0] = fortran symbols defining mpi_foo and calling PMPI_Foo
  level[1] = libwrap.so
  level[2] = MPI from the base product

In this way if libwrap.so defined MPI_Send and an application used
Fortran mpi_send the MPI_Send call in libwrap.so would be triggered.
If that behavior was not desired, the fortran wrappers can be
essentially disabled by moving them to the bottom of the list, eg
  -mca entry libwrap.so,fortran

Signed-off-by: Mark Allen [email protected]

ibm-ompi · 2019-01-07T05:50:36Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/75dfe8b79830b93f939dde35044cb3dc

ibm-ompi · 2019-01-07T05:51:18Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/2a5f37b8a0484f595f08d51f6bbe9247

ibm-ompi · 2019-01-07T05:52:44Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/b1dea1aa3420af1c6b0ec80b8b2a01a8

gpaulsen · 2019-01-07T15:38:47Z

Added WIP-DNM label until @markalle can investigate CI failures.

rhc54 · 2019-01-07T15:50:04Z

Note that these changes to ORTE won't be coming over to PRRTE, so you'll also need to discuss and think about what you intend to do after the changeover.

hjelmn · 2019-01-16T22:05:34Z

One objection. If we ever add an "entry" framework (you never know) this would break things. You need to set either the framework or component name to something non-null.

AboorvaDevarajan · 2019-05-31T01:42:36Z

Can one of the admins verify this patch?

markalle · 2019-11-15T00:30:56Z

I left this one on hold forever because orte was expected to go away, but if it's not going away I'd like to get this in at some point. I've changed the name to -mca tools_entry <libs>. Josh said we could make an actual framework called "tests" to officially claim that namespace, but this PR doesn't do that part.

The main modification to OMPI is letting things be prepended to LD_LIBRARY_PATH and LD_PRELOAD, with the LD_LIBRARY_PATH part specifically done in schizo where no other code is about to jump in an prepend something else in front of what we just prepended. And that functionality is used to flip between a libmpiprofilesupport.so that does nothing and a libmpiprofilesupport.so that allows for dynamic PMPI and layered PMPI interceptions.

rhc54 · 2019-11-15T01:13:38Z

Prepending LD_PRELOAD and friends is already supported in OMPI via the capabilities in PMIx. Please don't add another backdoor channel to do the same thing.

jjhursey · 2019-11-15T17:58:23Z

PMIx has the ability to adjust environment variables when PMIx_Spawn is called (PMIX_PREPEND_ENVAR). Since this functionality is happening at mpirun time I'm not sure how we would leverage the PMIx functionality since it doesn't make that call.

If this were being ported to PRRTE I think it would be more straightforward in that regard since IIRC prun calls PMIx_Spawn so we would be able to use the attributes.

Is there a place in ORTE where we might be able to leverage the PMIx environment handling functionality?

rhc54 · 2019-11-15T18:49:37Z

All you have to do is add it to the info that will be included in the launch msg - per the code in orte/orted/pmix/pmix_server_dyn.c:

            /* cache for inclusion with job info at registration */
            cache = NULL;
            opal_list_remove_item(job_info, &info->super);
            if (orte_get_attribute(&jdata->attributes, ORTE_JOB_INFO_CACHE, (void**)&cache, OPAL_PTR) &&
                NULL != cache) {
                opal_list_append(cache, &info->super);
            } else {
                cache = OBJ_NEW(opal_list_t);
                opal_list_append(cache, &info->super);
                orte_set_attribute(&jdata->attributes, ORTE_JOB_INFO_CACHE, ORTE_ATTR_LOCAL, (void*)cache, OPAL_PTR);
            }

Note that I am about 70% done with making PMIx a "1st class citizen" in OMPI, which means that mpirun will be calling PMIx_Spawn so that it can (if on an appropriate system) use the system PMIx server to directly launch the processes.

markalle · 2019-11-25T23:51:42Z

Thanks, I moved the first reading of the entry MCAs into a schizo component that sets ORTE_JOB_PREPEND_ENVAR attributes to get the prepending done. Seems much nicer

markalle · 2019-12-16T23:02:24Z

The last update I've made here is for MPI_Pcontrol. Previously I only had it set up to be used from the top level at the app, but I could imagine a wrapper library being used to activate then deactivate profiling at certain conditions, so it now allows that.

markalle · 2020-02-19T23:13:50Z

I've updated the auto-generation of wrappers so instead of parsing "mpi.h.in" it uses the MPI standard pythonization effort. @jsquyres, don't get too excited, this isn't a big usage like generating the code to make mpi_send() do translations of arguments and call MPI_Send(), rather this is merely wrapping mpi_send() and calling a function pointer into the MPI library's mpi_send(). It's a fairly trivial usage of the pythonization info, but it's still a start.

To use the pythonization introduces a python 3.8 dependency into OMPI. That's true of the entire pythonization effort in the MPI standard and that requirement gets inherited here. So to mitigate that I left in the former non-pythonization version as "mkcode.pl" with a "hard_coded_fn_list.txt", and use a python3 --version check to decide whether to activate the pythonization version or just use the hard coded list.

The data from the standard's *.tex is located in "generated_mpistandard_parsesets.py". That file can be re-generated by "extract.py" from a checked out tree of the MPI standard. (@jsquyres the *.tex from there doesn't appear to have MPI_Comm_c2f etc pythonized (?))
Anyway a program that wanted to use the data would do

sys.path.append(<path to the parse_mpi_standard directory>)
from generated_mpistandard_parsesets import *
run_everything_from_the_standard()
# then to loop over the functions:
for cfunc in dict.keys(MPI_FN_DATA):
    PARSESET = MPI_FN_DATA[cfunc]
# some of the data in PARSESET is directly usable, other parts
# have too many layers of interpretation so the only realistic way
# to use the data is via functions from the binding-tool/*.py like
    decl = emit_c_param(param, kind_map, True)

markalle · 2020-05-06T00:16:06Z

I repushed with a change to how LD_LIBRARY_PATH gets prepended through prrte. Rather than try to force fit that feature into prrte I went with prepending a script into argv of the command being started, so if you type
% mpirun -entry ./libwrap.so -np 2 ./myprogram.x
it runs as if the command had been
% mpirun -np 2 $MPI_ROOT/bin/profilesupport ./myprogram.x

The use case here is the LD_LIBRARY_PATH really needs to have the new entry be first, not just in the list where the order the code happens in could could potentially result in other things prepended in front of it. So the script is an easy way to guarantee that

jsquyres · 2020-05-07T14:33:47Z

@markalle This seems problematic for a few reasons -- are there other approaches?

I see you copied all the Python from the MPI Forum repo. Shouldn't we be consuming the machine-readable bindings data that the Forum publishes (it'll likely be JSON)? If we want early access to pre-published JSON, I'm sure we can get that. I think that's much more stable for the long run.
I recently tried to add some Python to the Open MPI code base and the community resoundedly said "no -- stick with Perl". The concerns were that the community did not want to add another scripting language into the code base, and that Perl -- old and krufty that it is -- has stayed much more stable than Python over time.
Argv manipulation seems like a losing game. Can't we put this feature properly in PRRTE?

jjhursey · 2020-05-07T16:19:31Z

I like the idea of using the JSON provided by the MPI Forum so we don't have to maintain the parsing scripts in OMPI. I bet we can get an early drop of that to work with. That should help simplify things a bit.

To parse the JSON with Perl we will likely need a Perl library (I use one for our CI setup to parse the GitHub responses - whatever is provided by yum install perl-JSON). We will need to be careful of redistribution if we put that library in OMPI, or we can just request the user install it with OMPI as a dependency.

martinruefenacht · 2020-05-07T16:39:43Z

@markalle I would also strongly urge not to copy the python code from the MPI Standard (mutation is severe and imminent).

I can easily write an emitter which dumps a json file (or any other convenient format) of C declarations, which I assume is what you need from the comment on Feb 19?

jsquyres · 2020-05-07T19:45:44Z

Yes, that is an open question: what exactly do we want in JSON from the Forum? Language-neutral specifications? C, mpif.h, mpi module, and mpi_f08 specifications? ...?

markalle · 2020-05-12T20:01:11Z

About the prrte argv manipulation, we had a feature like that in Platform-MPI, that would allow any cmdline like "mpirun ... program.x" to run as if it had been "mpirun ... myscript program.x". I feel like that's a smaller imposition than the type of LD_LIBRARY_PATH manipulation where I'd be asking not just to be prepended, but specifically to be the first thing prepended, eg not to let any other part of the code prepend $MPI_ROOT/lib in front of my entry. It's also possible to accomplish roughly the same thing with LD_PRELOAD without touching LD_LIBRARY_PATH, but that setting hits more processes and in the past I've had apps that failed strangely with LD_PRELOAD so I'd prefer it to just be an option rather than the only way to activate this feature.

Anyway back to the python: I'm happy to switch it around and I'm not loving the python parsing that's there right now. But wouldn't the JSON output be roughly equivalent to the array of PARSESETs that I have now after running the python code out of the standard?

If so, when I walk a list of PARSESET['parameters'] items I had to call emit_c_param(param, kind_map, True) to make sense of the data there, so I think I still need python functions out of the standard to be able to use the data. If calls like that could be somewhat more isolated I could probably at least use a much smaller chunk of python copied from the standard.

martinruefenacht · 2020-05-12T22:07:35Z

@markalle I believe @jsquyres idea is that we output directly what you need to a JSON. If I understand correctly you want access to the C parameters of all functions?

I would write an emitter (as for the binding_emitter for latex) which would translate the current "apis.json" into a specific json for you to load and use.

Something like:
{ "some_function_name": [ { "type": "int", "name": "rank", "direction": "out", "whole": "int *rank" }, ... ], "many_others": [...], ... }

The "apis.json" is definitely an internal data structure that is not intended to be used outside.

markalle · 2020-05-13T19:03:56Z

@martinruefenacht That would be great. I could see that expanding quite a bit for other needs though. The current PR doesn't want the entries that are PARSESET['attributes']['callback'] or PARSESET['attributes']['predefined_function'] so the list would either need to leave those out or identify them.

But yeah, a fairly simple json output would satisfy the needs of this PR (this PR used to be parsing the MPI functions straight out of mpi.h, it doesn't need very much info).

The next step up in complexity of code generation from json output is probably to generate everything in ompi/mpi/fortran/mpif-h/*.c. For that we'd at least need type/IN/OUT for all the args, anything that's an array would need an identification of how many dimensions and elements it has, and any "significant only at root" identifications (eg we can't just translate gatherv's displs[] for all ranks, the leaves aren't required to pass in anything for it). And we'd need to know which buf is used to identify a collective as INPLACE and which parameters are ignored in the presence of INPLACE. Maybe something like "ignored if <var> <val>" as that would also handle ERRCODES_IGNORE. That's only a rough first pass at what info might be needed, but the info we'd eventually want from the json output could grow quite a bit for these other uses

martinruefenacht · 2020-05-13T22:02:26Z

@markalle Okay I will write an emitter and let you know when it is in place (in a branch for the moment). I assume the above given format is good? What else would you like to see for each function name/parameter?

With respect to other needs, if it is something we have "recorded" then it wouldn't be a problem.

The latter step is in progress. Jeff and I did a some of that work of encoding all that information, but there will be gaps. In other words this will be possible and is the target, but it will be a while.

…interception) I'll start with a brief outline of the code changes, then a description of what this feature is all about. * scripts mkcode.py that uses a simple version of json info for MPI calls that was generated using the MPI standard. * various mpicc-wrapper-data.txt.in etc files: these all get a couple new lines libs_profilesupport=-lmpiprofilesupport profilesupport_lib_file=libmpiprofilesupport.so and opal_wrapper.c puts this early in the link line, so an "ldd" of an MPI app would now have a "libmpiprofilesupport.so" in front of libmpi.so etc. * Makefile changes: build a new libmpiprofilesupport.so two places: 1. $MPI_ROOT/lib/libmpiprofilesupport.so [stub that defines nothing] 2. $MPI_ROOT/lib/profilesupport/libmpiprofilesupport.so [real] * schizo component entry: lookup the --mca ompi_tools_entry setting and add a profilesupport wrapper script in front of the executable so that LD_LIBRARY_PATH and possibly LD_PRELOAD can be modified in front of the executable. Next I'll give a pretty long description of what profile interface layering is about. Although to be honest, I think the main part of this feature that people actually use is only the dynamic part. Eg at runtime this lets you just add "--mca ompi_tools_entry ./libmpe.so" for example without relinking your app and it would run as if you had linked with -lmpe. ------------------------------------------------------------------------ MPI Profiling Interface Layering The MPI standard defines a PMPI profiling interface, but in its normal usage only one profiling library can be used at a time. A profiling wrapper library defines some subset of MPI_* entrypoints, and inside those redefinitions, it calls some combination of MPI_* and PMPI_* symbols. For example int MPI_Allgather(void *sbuf, int scount, MPI_Datatype sdt, void *rbuf, int rcount, MPI_Datatype rdt, MPI_Comm comm) { int rval; double t1, t2, t3; t1 = MPI_Wtime(); MPI_Barrier(comm); t2 = MPI_Wtime(); rval = PMPI_Allgather(sbuf, scount, sdt, rbuf, rcount, rdt, comm); t3 = MPI_Wtime(); // record time waiting vs time spent in allgather.. return(rval); } double MPI_Wtime() { // insert hypothetical high-resolution replacement here, for example } In trying to use two unrelated wrapper libraries at the same time, it's in general not possible to link them in such a way that proper layering occurs. For example, consider two wrapper libraries: 1. "libJobLog.so" wrapping MPI_Init and MPI_Finalize to keep a log of every MPI job, listing hosts, runtimes, and cpu times. 2. "libCollPerf.so" wrapping MPI_Init and MPI_Finalize, and all the MPI collectives to gather statistics about how evenly the ranks enter the collectives. With ordinary linking, each MPI_* call would resolve into one of the wrapper libraries, and from there the wrapper library's call to PMPI_* would resolve into the bottom level libmpi.so. Only one of the libraries would have its MPI_Init and MPI_Finalize routines called. ---------------------------------------------------------------- Defining consistent layering behavior: With dynamically loaded symbols, it is possible to design a consistent approach to layering for an arbitrary number of wrapper libraries. When a wrapper library "libwrap.so" redefines an MPI_* symbol, there are two possibilities for what MPI calls it can make. It can call another MPI_* entry, or it can call a PMPI_* entry. In the case of ordinary single-level wrapping, the calls into MPI_* would resolve into "libwrap.so" first and then "libmpi.so" if not found. And the calls to PMPI_* would resolve to "libmpi.so". In the case of multi-level wrapping, the equivalent behavior is for MPI_* to resolve to the current level, and PMPI_* to resolve to the next level down in the list of libraries. We would set up an array of handles void *lib[MAX_LEVELS]; containing the dlopened handles for all the libraries, eg lib[0] = dlopen("libwrap3.so", ); lib[1] = dlopen("libwrap2.so", ); lib[2] = dlopen("libwrap1.so", ); lib[3] = dlopen("libmpi.so", ); For each MPI function an array of function pointers would exist, one for each library: int (*(fptr_MPI_Send[MAX_LEVELS]))(); Some of the entries would be null, but the bottom level should always be defined. And a thread-local global would define the calldepth for the thread, initially 0. The calldepth needs to be thread-local data so the implementation can be thread-safe, since different threads would have unrelated call stacks. Model implementation of MPI_Send and PMPI_Send: int MPI_Send(void* buf, int count, MPI_Datatype dt, int to, int tag, MPI_Comm comm) { int rval; int entrylev, nextlev; int *calldepth; calldepth = (int*) pthread_getspecific(depthkey); // thread-local global entrylev = *calldepth; nextlev = entrylev; // (or nextlev = entrylev + 1 for PMPI_Send) if (nextlev >= nwrapper_levels) { --nextlev; } while (!fptr_MPI_Send[nextlev] && nextlev<nwrapper_levels) { ++nextlev; } if (nextlev >= nwrapper_levels) { printf("Fatal Error: unable to find symbol at level>=%d for %s\\n", entrylev, "MPI_Send"); exit(1); } *calldepth = nextlev; rval = fptr_MPI_Send[nextlev](buf, count, dt, to, tag, comm); *calldepth = entrylev; return(rval); } If code like the above is in a library called libmpiprofilesupport.so and an MPI app is linked against that library, then an example sequence of events might be - app.x calls MPI_Send, the linker resolves it in libmpiprofilesupport.so - above code sees calldepth=0, calls fptr_MPI_Send[0] == MPI_Send in libwrap1.so - libwrap1.so's MPI_Send calls PMPI_Send, it resolves into libmpiprofilesupport.so - above code increments calldepth to 1, calls fptr_MPI_Send[1] == MPI_Send in libwrap2.so And so on. At each MPI/PMPI call, the linker/loader resolves to the symbol in libmpiprofilesupport.so, and that library decides which function pointer to go into next. ---------------------------------------------------------------- Fortran I believe Open MPI makes the profiling design choice that wrapping the C-language MPI_Send() automatically results in Fortran mpi_send() being wrapped. The "calldepth" behavior from above isn't a perfect match for this situation. For comparison, the traditional behavior for an application linked against a single libwrap.so would go as follows. An "ldd" on the application might show dependencies libwrap.so libmpi_mpifh.so libmpi.so and an application call to fortran mpi_send() would find the first definition in libmpi_mpifh.so.0 where it would call MPI_Send() which would go back up to libwrap.so which might call PMPI_Send() which would then go down to libmpi.so. The linker/loader starts at the top in its search for each symbol. The layered profiling approach with a "calldepth" always goes down the list of libraries. If fortran is at the bottom of the list, those symbols would end up not being intercepted. I think the easiest way to maintain the desired behavior is to re-order the libraries in the dynamic case as lib[0] = dlopen("libmpi_mpifh.so", RTLD_NOW|RTLD_GLOBAL); lib[1] = dlopen("libwrap.so", RTLD_NOW|RTLD_GLOBAL); .. other wrapper libraries .. lib[n] = dlopen("libmpi.so", RTLD_NOW|RTLD_GLOBAL); So the fortran wrapper is always first, and libmpi.so is always last, with all the wrapper libraries inbetween. Internally the --mca ompi_tools_entry feature produces a list of entrypoint levels with three types: 1. wrapper libraries like libwrap.so 2. base MPI implementation which is two libraries libmpi.so and libmpi_mpifh.so 3. just the fortran calls from the base MPI implementation and allows "fort" to be manually included in the list of libraries. So if one had a library libwrap.so that defined MPI_Send and called PMPI_Send, then the syntax --mca ompi_tools_entry fort,libwrap.so would produce a list of entry levels as level[0] = fortran symbols from libmpi_mpifh.so level[1] = libwrap.so level[2] = all the base MPI symbols from libmpi.so and libmpi_mpifh.so and if an application called fortran mpi_send, the call sequence would be - app.x calls mpi_send, the linker resolves it in libmpiprofilesupport.so - calldepth=0, calls fptr_mpi_send[0] == mpi_send in libmpi_mpifh.so - libmpi_mpifh.so's mpi_send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=1, calls fptr_MPI_Send[1] == MPI_Send in libwrap.so - libwrap.so's MPI_Send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=2, calls fptr_MPI_Send[2] == MPI_Send in libmpi.so so including the OMPI fortran wrappers in the list in front of libwrap.so enabled automatic wrapping of mpi_send by only redefining MPI_Send. If this behavior is not desired, it's possible tp use the syntax --mca ompi_tools_entry libwrap.so,fort which puts the fortran symbols at the bottom of the list where all the base MPI symbols are. ---------------------------------------------------------------- Performance On a machine where OMPI pingpong takes 0.22 usec, the --mca ompi_tools_entry option slowed the pingpong to 0.24 usec. This is enough of an impact I wouldn't consider putting the code from libmpiprofilesupport.so into the main libmpi.so library. But as long as the code is isolated in its own libmpiprofilesupport.so and LD_LIBRARY_PATH is used to point at a stub library there is no performance impact when the feature isn't activated. ---------------------------------------------------------------- Weaknesses I think the main weakness is that the above design only handles MPI_* calls. If a wrapper library wanted to intercept both MPI_* calls and all malloc()/free() calls for example, the above would result in the malloc()/free() not being intercepted (the app isn't linked against libwrap.so, it's only linked against libmpiprofilesupport.so which has a dlopen() of libwrap.so). This commit includes a "preload" feature (that can be included in the --mca ompi_tools_entry list) that would be needed with such libraries. Then the malloc/free/etc from libwrap.so would be used. There wouldn't be any kind of layering for the non-MPI symbols if multiple libraries were trying to intercept malloc/free for example. Another minor weakness is that MPI_Pcontrol(level, ...)'s variable argument list is impossible to maintain as far as I know. The proper behavior would be that if an application calls MPI_Pcontrol(1, "test", 3.14), we should call MPI_Pcontrol with those arguments in every wrapper library in the list. But there's no way in stdarg.h to extract and re-call with an unknown list of arguments. The best we can do is call MPI_Pcontrol(level) at each library. ---------------------------------------------------------------- Documentation for this feature: [Dynamic MPI Profiling interface with layering] The MCA option --mca ompi_tools_entry <list> can be used to enable interception from profiling libraries at runtime. In traditional MPI usage a wrapper library "libwrap.so" can be built that redefines selected MPI calls, and using that library would require relinking the application with -lwrap to enable the interception. But this feature allows such interception to be enabled at runtime without relinking by using the mpirun option --mca ompi_tools_entry libwrap.so The --mca ompi_tools_entry feature can be used as any of --mca ompi_tools_entry /path/to/libwrap.so --mca ompi_tools_entry libwrap.so (if the library will be found via LD_LIBRARY_PATH) --mca ompi_tools_entry wrap (shortcut for libwrap.so) Note that this feature doesn't automatically set LD_LIBRARY_PATH so that "libwrap.so" can be found. That could be done by using the additional option -x OMPI_LD_LIBRARY_PATH_PREPEND=<dir>:<dir>:... To layer multiple libraries, a comma separted list can be used: --mca ompi_tools_entry libwrap1.so,libwrap2.so A few other keywords can be added into the --mca ompi_tools_entry list: --mca ompi_tools_entry v : verbose option, list the opened libraries --mca ompi_tools_entry preload : cause the wrapper libraries to be added to an LD_PRELOAD. This could be needed if a library redefine non-MPI symbols --mca ompi_tools_entry fortran : a layer that minimally wraps the Fortran MPI calls on top of the C calls, eg defining mpi_foo and calling PMPI_Foo --mca ompi_tools_entry fort : short for fortran above By default Fortran wrappers are placed at the top of the list and the base product is always placed at the bottom, so --mca ompi_tools_entry libwrap.so would be equivalent to --mca ompi_tools_entry fortran,libwrap.so and would produce a layering as level[0] = fortran symbols defining mpi_foo and calling PMPI_Foo level[1] = libwrap.so level[2] = MPI from the base product In this way if libwrap.so defined MPI_Send and an application used Fortran mpi_send the MPI_Send call in libwrap.so would be triggered. If that behavior was not desired, the fortran wrappers can be essentially disabled by moving them to the bottom of the list, eg --mca ompi_tools_entry libwrap.so,fortran ---------------------------------------------------------------- Signed-off-by: Mark Allen <[email protected]> Signed-off-by: Mark Allen <[email protected]>

markalle · 2020-05-14T05:34:17Z

Okay I took out all the MPI standard python and put in a small-ish json file that just contains the data I needed for this PR, and can certainly be restructured and added to.

markalle · 2020-05-14T18:12:53Z

I like the json, but by taking away all the MPI standard scripts the current PR just has a big blob of json appearing by magic. If I used an early checkout of the standard that has bugs, or if my script that picked what info to put into the json structures had its own bugs those would be pretty invisible.

We'll certainly have opportunites to re-address this since the json I created just now doesn't have enough info to autogenerate everything in ompi/mpi/fortran/mpif-h/*.c or whatever similar uses we might have next. But I think the scripts that made the json ought to be part of whatever solution we end up with or else it's too opaque of a blob

ibm-ompi · 2022-02-23T20:14:45Z

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6e6650c9a0b99171b12fc05ecafa585d

gpaulsen added the ⚠️ WIP-DNM! label Jan 7, 2019

bwbarrett added the Target: main label Jan 7, 2019

markalle force-pushed the pmpi_layering branch from 4d19ee8 to 2515a3e Compare November 15, 2019 00:25

markalle force-pushed the pmpi_layering branch from 2515a3e to 4d782bf Compare November 25, 2019 23:40

markalle force-pushed the pmpi_layering branch from 4d782bf to 3711a00 Compare December 4, 2019 20:20

markalle force-pushed the pmpi_layering branch from 3711a00 to 58d94c9 Compare December 16, 2019 22:55

markalle force-pushed the pmpi_layering branch from 58d94c9 to f79e03f Compare February 19, 2020 23:13

markalle force-pushed the pmpi_layering branch from f79e03f to 5554c02 Compare February 26, 2020 21:22

markalle mentioned this pull request Feb 26, 2020

support for cmdline "--mca prrte_tools_entry lib" openpmix/prrte#414

Closed

markalle force-pushed the pmpi_layering branch from 5554c02 to f6bc81b Compare May 6, 2020 00:04

markalle force-pushed the pmpi_layering branch from f6bc81b to fd390eb Compare May 14, 2020 05:27

gpaulsen added the Target: v6.0.x label Mar 2, 2021

-mca entry <libs> for dynamic MPI profiling interface layering #6245

Are you sure you want to change the base?

-mca entry <libs> for dynamic MPI profiling interface layering #6245

Uh oh!

Conversation

markalle commented Jan 7, 2019 • edited by jsquyres Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibm-ompi commented Jan 7, 2019

Uh oh!

ibm-ompi commented Jan 7, 2019

Uh oh!

ibm-ompi commented Jan 7, 2019

Uh oh!

gpaulsen commented Jan 7, 2019

Uh oh!

rhc54 commented Jan 7, 2019

Uh oh!

hjelmn commented Jan 16, 2019

Uh oh!

AboorvaDevarajan commented May 31, 2019

Uh oh!

markalle commented Nov 15, 2019

Uh oh!

rhc54 commented Nov 15, 2019

Uh oh!

jjhursey commented Nov 15, 2019

Uh oh!

rhc54 commented Nov 15, 2019

Uh oh!

markalle commented Nov 25, 2019

Uh oh!

markalle commented Dec 16, 2019

Uh oh!

markalle commented Feb 19, 2020

Uh oh!

markalle commented May 6, 2020

Uh oh!

jsquyres commented May 7, 2020

Uh oh!

jjhursey commented May 7, 2020

Uh oh!

martinruefenacht commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsquyres commented May 7, 2020

Uh oh!

markalle commented May 12, 2020

Uh oh!

martinruefenacht commented May 12, 2020

Uh oh!

markalle commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinruefenacht commented May 13, 2020

Uh oh!

markalle commented May 14, 2020

Uh oh!

markalle commented May 14, 2020

Uh oh!

ibm-ompi commented Feb 23, 2022

Uh oh!

Uh oh!

markalle commented Jan 7, 2019 •

edited by jsquyres

Loading

martinruefenacht commented May 7, 2020 •

edited

Loading

markalle commented May 13, 2020 •

edited

Loading