-
Notifications
You must be signed in to change notification settings - Fork 908
-mca entry <libs> for dynamic MPI profiling interface layering #6245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/75dfe8b79830b93f939dde35044cb3dc |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/2a5f37b8a0484f595f08d51f6bbe9247 |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/b1dea1aa3420af1c6b0ec80b8b2a01a8 |
Added |
Note that these changes to ORTE won't be coming over to PRRTE, so you'll also need to discuss and think about what you intend to do after the changeover. |
One objection. If we ever add an "entry" framework (you never know) this would break things. You need to set either the framework or component name to something non-null. |
Can one of the admins verify this patch? |
4d19ee8
to
2515a3e
Compare
I left this one on hold forever because orte was expected to go away, but if it's not going away I'd like to get this in at some point. I've changed the name to The main modification to OMPI is letting things be prepended to LD_LIBRARY_PATH and LD_PRELOAD, with the LD_LIBRARY_PATH part specifically done in schizo where no other code is about to jump in an prepend something else in front of what we just prepended. And that functionality is used to flip between a libmpiprofilesupport.so that does nothing and a libmpiprofilesupport.so that allows for dynamic PMPI and layered PMPI interceptions. |
Prepending LD_PRELOAD and friends is already supported in OMPI via the capabilities in PMIx. Please don't add another backdoor channel to do the same thing. |
PMIx has the ability to adjust environment variables when If this were being ported to PRRTE I think it would be more straightforward in that regard since IIRC Is there a place in ORTE where we might be able to leverage the PMIx environment handling functionality? |
All you have to do is add it to the info that will be included in the launch msg - per the code in orte/orted/pmix/pmix_server_dyn.c:
Note that I am about 70% done with making PMIx a "1st class citizen" in OMPI, which means that mpirun will be calling PMIx_Spawn so that it can (if on an appropriate system) use the system PMIx server to directly launch the processes. |
2515a3e
to
4d782bf
Compare
Thanks, I moved the first reading of the entry MCAs into a schizo component that sets ORTE_JOB_PREPEND_ENVAR attributes to get the prepending done. Seems much nicer |
4d782bf
to
3711a00
Compare
3711a00
to
58d94c9
Compare
The last update I've made here is for MPI_Pcontrol. Previously I only had it set up to be used from the top level at the app, but I could imagine a wrapper library being used to activate then deactivate profiling at certain conditions, so it now allows that. |
58d94c9
to
f79e03f
Compare
I've updated the auto-generation of wrappers so instead of parsing "mpi.h.in" it uses the MPI standard pythonization effort. @jsquyres, don't get too excited, this isn't a big usage like generating the code to make mpi_send() do translations of arguments and call MPI_Send(), rather this is merely wrapping mpi_send() and calling a function pointer into the MPI library's mpi_send(). It's a fairly trivial usage of the pythonization info, but it's still a start. To use the pythonization introduces a python 3.8 dependency into OMPI. That's true of the entire pythonization effort in the MPI standard and that requirement gets inherited here. So to mitigate that I left in the former non-pythonization version as "mkcode.pl" with a "hard_coded_fn_list.txt", and use a python3 --version check to decide whether to activate the pythonization version or just use the hard coded list. The data from the standard's *.tex is located in "generated_mpistandard_parsesets.py". That file can be re-generated by "extract.py" from a checked out tree of the MPI standard. (@jsquyres the *.tex from there doesn't appear to have MPI_Comm_c2f etc pythonized (?))
|
f79e03f
to
5554c02
Compare
I repushed with a change to how LD_LIBRARY_PATH gets prepended through prrte. Rather than try to force fit that feature into prrte I went with prepending a script into argv of the command being started, so if you type The use case here is the LD_LIBRARY_PATH really needs to have the new entry be first, not just in the list where the order the code happens in could could potentially result in other things prepended in front of it. So the script is an easy way to guarantee that |
@markalle This seems problematic for a few reasons -- are there other approaches?
|
I like the idea of using the JSON provided by the MPI Forum so we don't have to maintain the parsing scripts in OMPI. I bet we can get an early drop of that to work with. That should help simplify things a bit. To parse the JSON with Perl we will likely need a Perl library (I use one for our CI setup to parse the GitHub responses - whatever is provided by |
@markalle I would also strongly urge not to copy the python code from the MPI Standard (mutation is severe and imminent). I can easily write an emitter which dumps a json file (or any other convenient format) of C declarations, which I assume is what you need from the comment on Feb 19? |
Yes, that is an open question: what exactly do we want in JSON from the Forum? Language-neutral specifications? C, mpif.h, mpi module, and mpi_f08 specifications? ...? |
About the prrte argv manipulation, we had a feature like that in Platform-MPI, that would allow any cmdline like "mpirun ... program.x" to run as if it had been "mpirun ... myscript program.x". I feel like that's a smaller imposition than the type of LD_LIBRARY_PATH manipulation where I'd be asking not just to be prepended, but specifically to be the first thing prepended, eg not to let any other part of the code prepend $MPI_ROOT/lib in front of my entry. It's also possible to accomplish roughly the same thing with LD_PRELOAD without touching LD_LIBRARY_PATH, but that setting hits more processes and in the past I've had apps that failed strangely with LD_PRELOAD so I'd prefer it to just be an option rather than the only way to activate this feature. Anyway back to the python: I'm happy to switch it around and I'm not loving the python parsing that's there right now. But wouldn't the JSON output be roughly equivalent to the array of PARSESETs that I have now after running the python code out of the standard? If so, when I walk a list of |
@markalle I believe @jsquyres idea is that we output directly what you need to a JSON. If I understand correctly you want access to the C parameters of all functions? I would write an emitter (as for the binding_emitter for latex) which would translate the current "apis.json" into a specific json for you to load and use. Something like: The "apis.json" is definitely an internal data structure that is not intended to be used outside. |
@martinruefenacht That would be great. I could see that expanding quite a bit for other needs though. The current PR doesn't want the entries that are But yeah, a fairly simple json output would satisfy the needs of this PR (this PR used to be parsing the MPI functions straight out of mpi.h, it doesn't need very much info). The next step up in complexity of code generation from json output is probably to generate everything in ompi/mpi/fortran/mpif-h/*.c. For that we'd at least need type/IN/OUT for all the args, anything that's an array would need an identification of how many dimensions and elements it has, and any "significant only at root" identifications (eg we can't just translate gatherv's displs[] for all ranks, the leaves aren't required to pass in anything for it). And we'd need to know which buf is used to identify a collective as INPLACE and which parameters are ignored in the presence of INPLACE. Maybe something like "ignored if |
@markalle Okay I will write an emitter and let you know when it is in place (in a branch for the moment). I assume the above given format is good? What else would you like to see for each function name/parameter? With respect to other needs, if it is something we have "recorded" then it wouldn't be a problem. The latter step is in progress. Jeff and I did a some of that work of encoding all that information, but there will be gaps. In other words this will be possible and is the target, but it will be a while. |
…interception) I'll start with a brief outline of the code changes, then a description of what this feature is all about. * scripts mkcode.py that uses a simple version of json info for MPI calls that was generated using the MPI standard. * various mpicc-wrapper-data.txt.in etc files: these all get a couple new lines libs_profilesupport=-lmpiprofilesupport profilesupport_lib_file=libmpiprofilesupport.so and opal_wrapper.c puts this early in the link line, so an "ldd" of an MPI app would now have a "libmpiprofilesupport.so" in front of libmpi.so etc. * Makefile changes: build a new libmpiprofilesupport.so two places: 1. $MPI_ROOT/lib/libmpiprofilesupport.so [stub that defines nothing] 2. $MPI_ROOT/lib/profilesupport/libmpiprofilesupport.so [real] * schizo component entry: lookup the --mca ompi_tools_entry setting and add a profilesupport wrapper script in front of the executable so that LD_LIBRARY_PATH and possibly LD_PRELOAD can be modified in front of the executable. Next I'll give a pretty long description of what profile interface layering is about. Although to be honest, I think the main part of this feature that people actually use is only the dynamic part. Eg at runtime this lets you just add "--mca ompi_tools_entry ./libmpe.so" for example without relinking your app and it would run as if you had linked with -lmpe. ------------------------------------------------------------------------ MPI Profiling Interface Layering The MPI standard defines a PMPI profiling interface, but in its normal usage only one profiling library can be used at a time. A profiling wrapper library defines some subset of MPI_* entrypoints, and inside those redefinitions, it calls some combination of MPI_* and PMPI_* symbols. For example int MPI_Allgather(void *sbuf, int scount, MPI_Datatype sdt, void *rbuf, int rcount, MPI_Datatype rdt, MPI_Comm comm) { int rval; double t1, t2, t3; t1 = MPI_Wtime(); MPI_Barrier(comm); t2 = MPI_Wtime(); rval = PMPI_Allgather(sbuf, scount, sdt, rbuf, rcount, rdt, comm); t3 = MPI_Wtime(); // record time waiting vs time spent in allgather.. return(rval); } double MPI_Wtime() { // insert hypothetical high-resolution replacement here, for example } In trying to use two unrelated wrapper libraries at the same time, it's in general not possible to link them in such a way that proper layering occurs. For example, consider two wrapper libraries: 1. "libJobLog.so" wrapping MPI_Init and MPI_Finalize to keep a log of every MPI job, listing hosts, runtimes, and cpu times. 2. "libCollPerf.so" wrapping MPI_Init and MPI_Finalize, and all the MPI collectives to gather statistics about how evenly the ranks enter the collectives. With ordinary linking, each MPI_* call would resolve into one of the wrapper libraries, and from there the wrapper library's call to PMPI_* would resolve into the bottom level libmpi.so. Only one of the libraries would have its MPI_Init and MPI_Finalize routines called. ---------------------------------------------------------------- Defining consistent layering behavior: With dynamically loaded symbols, it is possible to design a consistent approach to layering for an arbitrary number of wrapper libraries. When a wrapper library "libwrap.so" redefines an MPI_* symbol, there are two possibilities for what MPI calls it can make. It can call another MPI_* entry, or it can call a PMPI_* entry. In the case of ordinary single-level wrapping, the calls into MPI_* would resolve into "libwrap.so" first and then "libmpi.so" if not found. And the calls to PMPI_* would resolve to "libmpi.so". In the case of multi-level wrapping, the equivalent behavior is for MPI_* to resolve to the current level, and PMPI_* to resolve to the next level down in the list of libraries. We would set up an array of handles void *lib[MAX_LEVELS]; containing the dlopened handles for all the libraries, eg lib[0] = dlopen("libwrap3.so", ); lib[1] = dlopen("libwrap2.so", ); lib[2] = dlopen("libwrap1.so", ); lib[3] = dlopen("libmpi.so", ); For each MPI function an array of function pointers would exist, one for each library: int (*(fptr_MPI_Send[MAX_LEVELS]))(); Some of the entries would be null, but the bottom level should always be defined. And a thread-local global would define the calldepth for the thread, initially 0. The calldepth needs to be thread-local data so the implementation can be thread-safe, since different threads would have unrelated call stacks. Model implementation of MPI_Send and PMPI_Send: int MPI_Send(void* buf, int count, MPI_Datatype dt, int to, int tag, MPI_Comm comm) { int rval; int entrylev, nextlev; int *calldepth; calldepth = (int*) pthread_getspecific(depthkey); // thread-local global entrylev = *calldepth; nextlev = entrylev; // (or nextlev = entrylev + 1 for PMPI_Send) if (nextlev >= nwrapper_levels) { --nextlev; } while (!fptr_MPI_Send[nextlev] && nextlev<nwrapper_levels) { ++nextlev; } if (nextlev >= nwrapper_levels) { printf("Fatal Error: unable to find symbol at level>=%d for %s\\n", entrylev, "MPI_Send"); exit(1); } *calldepth = nextlev; rval = fptr_MPI_Send[nextlev](buf, count, dt, to, tag, comm); *calldepth = entrylev; return(rval); } If code like the above is in a library called libmpiprofilesupport.so and an MPI app is linked against that library, then an example sequence of events might be - app.x calls MPI_Send, the linker resolves it in libmpiprofilesupport.so - above code sees calldepth=0, calls fptr_MPI_Send[0] == MPI_Send in libwrap1.so - libwrap1.so's MPI_Send calls PMPI_Send, it resolves into libmpiprofilesupport.so - above code increments calldepth to 1, calls fptr_MPI_Send[1] == MPI_Send in libwrap2.so And so on. At each MPI/PMPI call, the linker/loader resolves to the symbol in libmpiprofilesupport.so, and that library decides which function pointer to go into next. ---------------------------------------------------------------- Fortran I believe Open MPI makes the profiling design choice that wrapping the C-language MPI_Send() automatically results in Fortran mpi_send() being wrapped. The "calldepth" behavior from above isn't a perfect match for this situation. For comparison, the traditional behavior for an application linked against a single libwrap.so would go as follows. An "ldd" on the application might show dependencies libwrap.so libmpi_mpifh.so libmpi.so and an application call to fortran mpi_send() would find the first definition in libmpi_mpifh.so.0 where it would call MPI_Send() which would go back up to libwrap.so which might call PMPI_Send() which would then go down to libmpi.so. The linker/loader starts at the top in its search for each symbol. The layered profiling approach with a "calldepth" always goes down the list of libraries. If fortran is at the bottom of the list, those symbols would end up not being intercepted. I think the easiest way to maintain the desired behavior is to re-order the libraries in the dynamic case as lib[0] = dlopen("libmpi_mpifh.so", RTLD_NOW|RTLD_GLOBAL); lib[1] = dlopen("libwrap.so", RTLD_NOW|RTLD_GLOBAL); .. other wrapper libraries .. lib[n] = dlopen("libmpi.so", RTLD_NOW|RTLD_GLOBAL); So the fortran wrapper is always first, and libmpi.so is always last, with all the wrapper libraries inbetween. Internally the --mca ompi_tools_entry feature produces a list of entrypoint levels with three types: 1. wrapper libraries like libwrap.so 2. base MPI implementation which is two libraries libmpi.so and libmpi_mpifh.so 3. just the fortran calls from the base MPI implementation and allows "fort" to be manually included in the list of libraries. So if one had a library libwrap.so that defined MPI_Send and called PMPI_Send, then the syntax --mca ompi_tools_entry fort,libwrap.so would produce a list of entry levels as level[0] = fortran symbols from libmpi_mpifh.so level[1] = libwrap.so level[2] = all the base MPI symbols from libmpi.so and libmpi_mpifh.so and if an application called fortran mpi_send, the call sequence would be - app.x calls mpi_send, the linker resolves it in libmpiprofilesupport.so - calldepth=0, calls fptr_mpi_send[0] == mpi_send in libmpi_mpifh.so - libmpi_mpifh.so's mpi_send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=1, calls fptr_MPI_Send[1] == MPI_Send in libwrap.so - libwrap.so's MPI_Send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=2, calls fptr_MPI_Send[2] == MPI_Send in libmpi.so so including the OMPI fortran wrappers in the list in front of libwrap.so enabled automatic wrapping of mpi_send by only redefining MPI_Send. If this behavior is not desired, it's possible tp use the syntax --mca ompi_tools_entry libwrap.so,fort which puts the fortran symbols at the bottom of the list where all the base MPI symbols are. ---------------------------------------------------------------- Performance On a machine where OMPI pingpong takes 0.22 usec, the --mca ompi_tools_entry option slowed the pingpong to 0.24 usec. This is enough of an impact I wouldn't consider putting the code from libmpiprofilesupport.so into the main libmpi.so library. But as long as the code is isolated in its own libmpiprofilesupport.so and LD_LIBRARY_PATH is used to point at a stub library there is no performance impact when the feature isn't activated. ---------------------------------------------------------------- Weaknesses I think the main weakness is that the above design only handles MPI_* calls. If a wrapper library wanted to intercept both MPI_* calls and all malloc()/free() calls for example, the above would result in the malloc()/free() not being intercepted (the app isn't linked against libwrap.so, it's only linked against libmpiprofilesupport.so which has a dlopen() of libwrap.so). This commit includes a "preload" feature (that can be included in the --mca ompi_tools_entry list) that would be needed with such libraries. Then the malloc/free/etc from libwrap.so would be used. There wouldn't be any kind of layering for the non-MPI symbols if multiple libraries were trying to intercept malloc/free for example. Another minor weakness is that MPI_Pcontrol(level, ...)'s variable argument list is impossible to maintain as far as I know. The proper behavior would be that if an application calls MPI_Pcontrol(1, "test", 3.14), we should call MPI_Pcontrol with those arguments in every wrapper library in the list. But there's no way in stdarg.h to extract and re-call with an unknown list of arguments. The best we can do is call MPI_Pcontrol(level) at each library. ---------------------------------------------------------------- Documentation for this feature: [Dynamic MPI Profiling interface with layering] The MCA option --mca ompi_tools_entry <list> can be used to enable interception from profiling libraries at runtime. In traditional MPI usage a wrapper library "libwrap.so" can be built that redefines selected MPI calls, and using that library would require relinking the application with -lwrap to enable the interception. But this feature allows such interception to be enabled at runtime without relinking by using the mpirun option --mca ompi_tools_entry libwrap.so The --mca ompi_tools_entry feature can be used as any of --mca ompi_tools_entry /path/to/libwrap.so --mca ompi_tools_entry libwrap.so (if the library will be found via LD_LIBRARY_PATH) --mca ompi_tools_entry wrap (shortcut for libwrap.so) Note that this feature doesn't automatically set LD_LIBRARY_PATH so that "libwrap.so" can be found. That could be done by using the additional option -x OMPI_LD_LIBRARY_PATH_PREPEND=<dir>:<dir>:... To layer multiple libraries, a comma separted list can be used: --mca ompi_tools_entry libwrap1.so,libwrap2.so A few other keywords can be added into the --mca ompi_tools_entry list: --mca ompi_tools_entry v : verbose option, list the opened libraries --mca ompi_tools_entry preload : cause the wrapper libraries to be added to an LD_PRELOAD. This could be needed if a library redefine non-MPI symbols --mca ompi_tools_entry fortran : a layer that minimally wraps the Fortran MPI calls on top of the C calls, eg defining mpi_foo and calling PMPI_Foo --mca ompi_tools_entry fort : short for fortran above By default Fortran wrappers are placed at the top of the list and the base product is always placed at the bottom, so --mca ompi_tools_entry libwrap.so would be equivalent to --mca ompi_tools_entry fortran,libwrap.so and would produce a layering as level[0] = fortran symbols defining mpi_foo and calling PMPI_Foo level[1] = libwrap.so level[2] = MPI from the base product In this way if libwrap.so defined MPI_Send and an application used Fortran mpi_send the MPI_Send call in libwrap.so would be triggered. If that behavior was not desired, the fortran wrappers can be essentially disabled by moving them to the bottom of the list, eg --mca ompi_tools_entry libwrap.so,fortran ---------------------------------------------------------------- Signed-off-by: Mark Allen <[email protected]> Signed-off-by: Mark Allen <[email protected]>
Okay I took out all the MPI standard python and put in a small-ish json file that just contains the data I needed for this PR, and can certainly be restructured and added to. |
I like the json, but by taking away all the MPI standard scripts the current PR just has a big blob of json appearing by magic. If I used an early checkout of the standard that has bugs, or if my script that picked what info to put into the json structures had its own bugs those would be pretty invisible. We'll certainly have opportunites to re-address this since the json I created just now doesn't have enough info to autogenerate everything in |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/6e6650c9a0b99171b12fc05ecafa585d |
I'll start with a brief outline of the code changes, then a description of
what this feature is all about.
parses mpi.h.in to figure out prototypes for all the MPI functions
and constructs wrapper functions for them all
these all get a couple new lines
libs_profilesupport=-lmpiprofilesupport
profilesupport_lib_file=libmpiprofilesupport.so
and opal_wrapper.c puts this early in the link line, so an "ldd" of
an MPI app would now have a "libmpiprofilesupport.so" in front of
libmpi.so etc.
build a new libmpiprofilesupport.so two places:
gets some general purpose LD_PRELOAD and LD_LIBRARY_PATH prepend
features, plus a "-mca entry" specific feature that looks at the
-mca entry setting and uses the new OMPI_LD_PRELOAD_PREPEND feature
to modify LD_PRELOAD if needed
this is another LD_LIBRARY_PATH prepend feature, but it's just used
for -mca entry rather than being a general purpose feature. The only
reason for this second LD_LIBRARY_PATH prepend feature is to be able
to put a setting in front of all the others. If I just use the
plm-level feature above the final LD_LIBRARY_PATH would be
$MPI_ROOT/lib : :
and only by intercepting here can we get in front of that first
$MPI_ROOT/lib
Next I'll give a pretty long description of what profile interface
layering is about. Although to be honest, I think the main part of
this feature that people actually use is only the dynamic part. Eg at
runtime this lets you just add "-mca entry ./libmpe.so" for example
without relinking your app and it would run as if you had linked
with -lmpe.
MPI Profiling Interface Layering
The MPI standard defines a PMPI profiling interface, but in its normal usage only one profiling library can be used at a time.
A profiling wrapper library defines some subset of MPI_* entrypoints, and inside those redefinitions, it calls some combination of MPI_* and PMPI_* symbols. For example
In trying to use two unrelated wrapper libraries at the same time, it's in general not possible to link them in such a way that proper layering occurs. For example, consider two wrapper libraries:
every MPI job, listing hosts, runtimes, and cpu times.
MPI collectives to gather statistics about how evenly the ranks
enter the collectives.
With ordinary linking, each MPI_* call would resolve into one of the wrapper libraries, and from there the wrapper library's call to PMPI_* would resolve into the bottom level libmpi.so. Only one of the libraries would have its MPI_Init and MPI_Finalize routines called.
Defining consistent layering behavior:
With dynamically loaded symbols, it is possible to design a consistent approach to layering for an arbitrary number of wrapper libraries.
When a wrapper library "libwrap.so" redefines an MPI_* symbol, there are two possibilities for what MPI calls it can make. It can call another MPI_* entry, or it can call a PMPI_* entry. In the case of ordinary single-level wrapping, the calls into MPI_* would resolve into "libwrap.so" first and then "libmpi.so" if not found. And the calls to PMPI_* would resolve to "libmpi.so".
In the case of multi-level wrapping, the equivalent behavior is for MPI_* to resolve to the current level, and PMPI_* to resolve to the next level down in the list of libraries.
We would set up an array of handles
For each MPI function an array of function pointers would exist, one for each library:
Some of the entries would be null, but the bottom level should always be defined.
And a thread-local global would define the calldepth for the thread, initially 0. The calldepth needs to be thread-local data so the implementation can be thread-safe, since different threads would have unrelated call stacks.
Model implementation of MPI_Send and PMPI_Send:
If code like the above is in a library called libmpiprofilesupport.so and an MPI app is linked against that library, then an example sequence of events might be
in libwrap2.so
And so on. At each MPI/PMPI call, the linker/loader resolves to the symbol in libmpiprofilesupport.so, and that library decides which function pointer to go into next.
Fortran
I believe Open MPI makes the profiling design choice that wrapping the C-language MPI_Send() automatically results in Fortran mpi_send() being wrapped.
The "calldepth" behavior from above isn't a perfect match for this situation. For comparison, the traditional behavior for an application linked against a single libwrap.so would go as follows. An "ldd" on the application might show dependencies
libwrap.so
libmpi_mpifh.so
libmpi.so
and an application call to fortran mpi_send() would find the first definition in libmpi_mpifh.so.0 where it would call MPI_Send() which would go back up to libwrap.so which might call PMPI_Send() which would then go down to libmpi.so. The linker/loader starts at the top in its search for each symbol.
The layered profiling approach with a "calldepth" always goes down the list of libraries. If fortran is at the bottom of the list, those symbols would end up not being intercepted.
I think the easiest way to maintain the desired behavior is to re-order the libraries in the dynamic case as
So the fortran wrapper is always first, and libmpi.so is always last, with all the wrapper libraries inbetween.
Internally the -mca entry feature produces a list of entrypoint levels with three types:
and allows "fort" to be manually included in the list of libraries.
So if one had a library libwrap.so that defined MPI_Send and called PMPI_Send, then the syntax
-mca entry fort,libwrap.so
would produce a list of entry levels as
level[0] = fortran symbols from libmpi_mpifh.so
level[1] = libwrap.so
level[2] = all the base MPI symbols from libmpi.so and libmpi_mpifh.so
and if an application called fortran mpi_send, the call sequence would be
so including the OMPI fortran wrappers in the list in front of libwrap.so enabled automatic
wrapping of mpi_send by only redefining MPI_Send. If this behavior is not desired, it's possible
tp use the syntax
-mca entry libwrap.so,fort
which puts the fortran symbols at the bottom of the list where all the base MPI symbols are.
Performance
On a machine where OMPI pingpong takes 0.22 usec, the -mca entry option slowed the pingpong to 0.24 usec.
This is enough of an impact I wouldn't consider putting the code from libmpiprofilesupport.so into the main libmpi.so library. But as long as the code is isolated in its own libmpiprofilesupport.so and LD_LIBRARY_PATH is used to point at a stub library there is no performance impact when the feature isn't activated.
Weaknesses
I think the main weakness is that the above design only handles MPI_* calls. If a wrapper library wanted to intercept both MPI_* calls and all malloc()/free() calls for example, the above would result in the malloc()/free() not being intercepted (the app isn't linked against libwrap.so, it's only linked against libmpiprofilesupport.so which has a dlopen() of libwrap.so).
This commit includes a "preload" feature (that can be included in the -mca entry list) that would be needed with such libraries. Then the malloc/free/etc from libwrap.so would be used. There wouldn't be any kind of layering for the non-MPI symbols if multiple libraries were trying to intercept malloc/free for example.
Another minor weakness is that MPI_Pcontrol(level, ...)'s variable argument list is impossible to maintain as far as I know. The proper behavior would be that if an application calls MPI_Pcontrol(1, "test", 3.14), we should call MPI_Pcontrol with those arguments in every wrapper library in the list. But there's no way in stdarg.h to extract and re-call with an unknown list of arguments. The best we can do is call MPI_Pcontrol(level) at each library.
Documentation for this feature:
[Dynamic MPI Profiling interface with layering]
Signed-off-by: Mark Allen [email protected]