Skip to content

Commit fd390eb

Browse files
committed
--mca ompi_tools_entry <libs> for PMPI layering (and dynamic wrapper interception)
I'll start with a brief outline of the code changes, then a description of what this feature is all about. * scripts mkcode.py that uses a simple version of json info for MPI calls that was generated using the MPI standard. * various mpicc-wrapper-data.txt.in etc files: these all get a couple new lines libs_profilesupport=-lmpiprofilesupport profilesupport_lib_file=libmpiprofilesupport.so and opal_wrapper.c puts this early in the link line, so an "ldd" of an MPI app would now have a "libmpiprofilesupport.so" in front of libmpi.so etc. * Makefile changes: build a new libmpiprofilesupport.so two places: 1. $MPI_ROOT/lib/libmpiprofilesupport.so [stub that defines nothing] 2. $MPI_ROOT/lib/profilesupport/libmpiprofilesupport.so [real] * schizo component entry: lookup the --mca ompi_tools_entry setting and add a profilesupport wrapper script in front of the executable so that LD_LIBRARY_PATH and possibly LD_PRELOAD can be modified in front of the executable. Next I'll give a pretty long description of what profile interface layering is about. Although to be honest, I think the main part of this feature that people actually use is only the dynamic part. Eg at runtime this lets you just add "--mca ompi_tools_entry ./libmpe.so" for example without relinking your app and it would run as if you had linked with -lmpe. ------------------------------------------------------------------------ MPI Profiling Interface Layering The MPI standard defines a PMPI profiling interface, but in its normal usage only one profiling library can be used at a time. A profiling wrapper library defines some subset of MPI_* entrypoints, and inside those redefinitions, it calls some combination of MPI_* and PMPI_* symbols. For example int MPI_Allgather(void *sbuf, int scount, MPI_Datatype sdt, void *rbuf, int rcount, MPI_Datatype rdt, MPI_Comm comm) { int rval; double t1, t2, t3; t1 = MPI_Wtime(); MPI_Barrier(comm); t2 = MPI_Wtime(); rval = PMPI_Allgather(sbuf, scount, sdt, rbuf, rcount, rdt, comm); t3 = MPI_Wtime(); // record time waiting vs time spent in allgather.. return(rval); } double MPI_Wtime() { // insert hypothetical high-resolution replacement here, for example } In trying to use two unrelated wrapper libraries at the same time, it's in general not possible to link them in such a way that proper layering occurs. For example, consider two wrapper libraries: 1. "libJobLog.so" wrapping MPI_Init and MPI_Finalize to keep a log of every MPI job, listing hosts, runtimes, and cpu times. 2. "libCollPerf.so" wrapping MPI_Init and MPI_Finalize, and all the MPI collectives to gather statistics about how evenly the ranks enter the collectives. With ordinary linking, each MPI_* call would resolve into one of the wrapper libraries, and from there the wrapper library's call to PMPI_* would resolve into the bottom level libmpi.so. Only one of the libraries would have its MPI_Init and MPI_Finalize routines called. ---------------------------------------------------------------- Defining consistent layering behavior: With dynamically loaded symbols, it is possible to design a consistent approach to layering for an arbitrary number of wrapper libraries. When a wrapper library "libwrap.so" redefines an MPI_* symbol, there are two possibilities for what MPI calls it can make. It can call another MPI_* entry, or it can call a PMPI_* entry. In the case of ordinary single-level wrapping, the calls into MPI_* would resolve into "libwrap.so" first and then "libmpi.so" if not found. And the calls to PMPI_* would resolve to "libmpi.so". In the case of multi-level wrapping, the equivalent behavior is for MPI_* to resolve to the current level, and PMPI_* to resolve to the next level down in the list of libraries. We would set up an array of handles void *lib[MAX_LEVELS]; containing the dlopened handles for all the libraries, eg lib[0] = dlopen("libwrap3.so", ); lib[1] = dlopen("libwrap2.so", ); lib[2] = dlopen("libwrap1.so", ); lib[3] = dlopen("libmpi.so", ); For each MPI function an array of function pointers would exist, one for each library: int (*(fptr_MPI_Send[MAX_LEVELS]))(); Some of the entries would be null, but the bottom level should always be defined. And a thread-local global would define the calldepth for the thread, initially 0. The calldepth needs to be thread-local data so the implementation can be thread-safe, since different threads would have unrelated call stacks. Model implementation of MPI_Send and PMPI_Send: int MPI_Send(void* buf, int count, MPI_Datatype dt, int to, int tag, MPI_Comm comm) { int rval; int entrylev, nextlev; int *calldepth; calldepth = (int*) pthread_getspecific(depthkey); // thread-local global entrylev = *calldepth; nextlev = entrylev; // (or nextlev = entrylev + 1 for PMPI_Send) if (nextlev >= nwrapper_levels) { --nextlev; } while (!fptr_MPI_Send[nextlev] && nextlev<nwrapper_levels) { ++nextlev; } if (nextlev >= nwrapper_levels) { printf("Fatal Error: unable to find symbol at level>=%d for %s\\n", entrylev, "MPI_Send"); exit(1); } *calldepth = nextlev; rval = fptr_MPI_Send[nextlev](buf, count, dt, to, tag, comm); *calldepth = entrylev; return(rval); } If code like the above is in a library called libmpiprofilesupport.so and an MPI app is linked against that library, then an example sequence of events might be - app.x calls MPI_Send, the linker resolves it in libmpiprofilesupport.so - above code sees calldepth=0, calls fptr_MPI_Send[0] == MPI_Send in libwrap1.so - libwrap1.so's MPI_Send calls PMPI_Send, it resolves into libmpiprofilesupport.so - above code increments calldepth to 1, calls fptr_MPI_Send[1] == MPI_Send in libwrap2.so And so on. At each MPI/PMPI call, the linker/loader resolves to the symbol in libmpiprofilesupport.so, and that library decides which function pointer to go into next. ---------------------------------------------------------------- Fortran I believe Open MPI makes the profiling design choice that wrapping the C-language MPI_Send() automatically results in Fortran mpi_send() being wrapped. The "calldepth" behavior from above isn't a perfect match for this situation. For comparison, the traditional behavior for an application linked against a single libwrap.so would go as follows. An "ldd" on the application might show dependencies libwrap.so libmpi_mpifh.so libmpi.so and an application call to fortran mpi_send() would find the first definition in libmpi_mpifh.so.0 where it would call MPI_Send() which would go back up to libwrap.so which might call PMPI_Send() which would then go down to libmpi.so. The linker/loader starts at the top in its search for each symbol. The layered profiling approach with a "calldepth" always goes down the list of libraries. If fortran is at the bottom of the list, those symbols would end up not being intercepted. I think the easiest way to maintain the desired behavior is to re-order the libraries in the dynamic case as lib[0] = dlopen("libmpi_mpifh.so", RTLD_NOW|RTLD_GLOBAL); lib[1] = dlopen("libwrap.so", RTLD_NOW|RTLD_GLOBAL); .. other wrapper libraries .. lib[n] = dlopen("libmpi.so", RTLD_NOW|RTLD_GLOBAL); So the fortran wrapper is always first, and libmpi.so is always last, with all the wrapper libraries inbetween. Internally the --mca ompi_tools_entry feature produces a list of entrypoint levels with three types: 1. wrapper libraries like libwrap.so 2. base MPI implementation which is two libraries libmpi.so and libmpi_mpifh.so 3. just the fortran calls from the base MPI implementation and allows "fort" to be manually included in the list of libraries. So if one had a library libwrap.so that defined MPI_Send and called PMPI_Send, then the syntax --mca ompi_tools_entry fort,libwrap.so would produce a list of entry levels as level[0] = fortran symbols from libmpi_mpifh.so level[1] = libwrap.so level[2] = all the base MPI symbols from libmpi.so and libmpi_mpifh.so and if an application called fortran mpi_send, the call sequence would be - app.x calls mpi_send, the linker resolves it in libmpiprofilesupport.so - calldepth=0, calls fptr_mpi_send[0] == mpi_send in libmpi_mpifh.so - libmpi_mpifh.so's mpi_send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=1, calls fptr_MPI_Send[1] == MPI_Send in libwrap.so - libwrap.so's MPI_Send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=2, calls fptr_MPI_Send[2] == MPI_Send in libmpi.so so including the OMPI fortran wrappers in the list in front of libwrap.so enabled automatic wrapping of mpi_send by only redefining MPI_Send. If this behavior is not desired, it's possible tp use the syntax --mca ompi_tools_entry libwrap.so,fort which puts the fortran symbols at the bottom of the list where all the base MPI symbols are. ---------------------------------------------------------------- Performance On a machine where OMPI pingpong takes 0.22 usec, the --mca ompi_tools_entry option slowed the pingpong to 0.24 usec. This is enough of an impact I wouldn't consider putting the code from libmpiprofilesupport.so into the main libmpi.so library. But as long as the code is isolated in its own libmpiprofilesupport.so and LD_LIBRARY_PATH is used to point at a stub library there is no performance impact when the feature isn't activated. ---------------------------------------------------------------- Weaknesses I think the main weakness is that the above design only handles MPI_* calls. If a wrapper library wanted to intercept both MPI_* calls and all malloc()/free() calls for example, the above would result in the malloc()/free() not being intercepted (the app isn't linked against libwrap.so, it's only linked against libmpiprofilesupport.so which has a dlopen() of libwrap.so). This commit includes a "preload" feature (that can be included in the --mca ompi_tools_entry list) that would be needed with such libraries. Then the malloc/free/etc from libwrap.so would be used. There wouldn't be any kind of layering for the non-MPI symbols if multiple libraries were trying to intercept malloc/free for example. Another minor weakness is that MPI_Pcontrol(level, ...)'s variable argument list is impossible to maintain as far as I know. The proper behavior would be that if an application calls MPI_Pcontrol(1, "test", 3.14), we should call MPI_Pcontrol with those arguments in every wrapper library in the list. But there's no way in stdarg.h to extract and re-call with an unknown list of arguments. The best we can do is call MPI_Pcontrol(level) at each library. ---------------------------------------------------------------- Documentation for this feature: [Dynamic MPI Profiling interface with layering] The MCA option --mca ompi_tools_entry <list> can be used to enable interception from profiling libraries at runtime. In traditional MPI usage a wrapper library "libwrap.so" can be built that redefines selected MPI calls, and using that library would require relinking the application with -lwrap to enable the interception. But this feature allows such interception to be enabled at runtime without relinking by using the mpirun option --mca ompi_tools_entry libwrap.so The --mca ompi_tools_entry feature can be used as any of --mca ompi_tools_entry /path/to/libwrap.so --mca ompi_tools_entry libwrap.so (if the library will be found via LD_LIBRARY_PATH) --mca ompi_tools_entry wrap (shortcut for libwrap.so) Note that this feature doesn't automatically set LD_LIBRARY_PATH so that "libwrap.so" can be found. That could be done by using the additional option -x OMPI_LD_LIBRARY_PATH_PREPEND=<dir>:<dir>:... To layer multiple libraries, a comma separted list can be used: --mca ompi_tools_entry libwrap1.so,libwrap2.so A few other keywords can be added into the --mca ompi_tools_entry list: --mca ompi_tools_entry v : verbose option, list the opened libraries --mca ompi_tools_entry preload : cause the wrapper libraries to be added to an LD_PRELOAD. This could be needed if a library redefine non-MPI symbols --mca ompi_tools_entry fortran : a layer that minimally wraps the Fortran MPI calls on top of the C calls, eg defining mpi_foo and calling PMPI_Foo --mca ompi_tools_entry fort : short for fortran above By default Fortran wrappers are placed at the top of the list and the base product is always placed at the bottom, so --mca ompi_tools_entry libwrap.so would be equivalent to --mca ompi_tools_entry fortran,libwrap.so and would produce a layering as level[0] = fortran symbols defining mpi_foo and calling PMPI_Foo level[1] = libwrap.so level[2] = MPI from the base product In this way if libwrap.so defined MPI_Send and an application used Fortran mpi_send the MPI_Send call in libwrap.so would be triggered. If that behavior was not desired, the fortran wrappers can be essentially disabled by moving them to the bottom of the list, eg --mca ompi_tools_entry libwrap.so,fortran ---------------------------------------------------------------- Signed-off-by: Mark Allen <[email protected]> Signed-off-by: Mark Allen <[email protected]>
1 parent ee9baf0 commit fd390eb

24 files changed

+1437
-7
lines changed

config/ompi_config_files.m4

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
# Copyright (c) 2018 Los Alamos National Security, LLC. All rights
77
# reserved.
88
# Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
9+
# Copyright (c) 2020 IBM Corporation. All rights reserved.
910
# $COPYRIGHT$
1011
#
1112
# Additional copyrights may follow
@@ -46,6 +47,7 @@ AC_DEFUN([OMPI_CONFIG_FILES],[
4647
ompi/mpi/tool/profile/Makefile
4748

4849
ompi/tools/ompi_info/Makefile
50+
ompi/tools/profilesupport/Makefile
4951
ompi/tools/wrappers/Makefile
5052
ompi/tools/wrappers/mpicc-wrapper-data.txt
5153
ompi/tools/wrappers/mpic++-wrapper-data.txt
@@ -57,5 +59,7 @@ AC_DEFUN([OMPI_CONFIG_FILES],[
5759
ompi/tools/wrappers/mpijavac.pl
5860
ompi/tools/mpisync/Makefile
5961
ompi/tools/mpirun/Makefile
62+
ompi/tools/pmpi_dynamic_layering/stub/Makefile
63+
ompi/tools/pmpi_dynamic_layering/real/Makefile
6064
])
6165
])

ompi/runtime/ompi_mpi_params.c

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
* All rights reserved.
2020
* Copyright (c) 2016-2019 Research Organization for Information Science
2121
* and Technology (RIST). All rights reserved.
22+
* Copyright (c) 2020 IBM Corporation. All rights reserved.
2223
* $COPYRIGHT$
2324
*
2425
* Additional copyrights may follow
@@ -83,6 +84,8 @@ static bool show_file_mca_params = false;
8384
static bool show_enviro_mca_params = false;
8485
static bool show_override_mca_params = false;
8586
static bool ompi_mpi_oversubscribe = false;
87+
static char *entry_string = NULL;
88+
static char *entry_base_string = NULL;
8689

8790
int ompi_mpi_register_params(void)
8891
{
@@ -340,6 +343,24 @@ int ompi_mpi_register_params(void)
340343
MCA_BASE_VAR_SCOPE_READONLY,
341344
&ompi_mpi_spc_dump_enabled);
342345

346+
/* -mca tools_entry <libraries-and-options> */
347+
(void) mca_base_var_register("ompi", "tools", NULL, "entry",
348+
"options for dynamic PMPI wrapper layering",
349+
MCA_BASE_VAR_TYPE_STRING, NULL,
350+
0, 0,
351+
OPAL_INFO_LVL_3,
352+
MCA_BASE_VAR_SCOPE_READONLY,
353+
&entry_string);
354+
355+
/* -mca tools_entry_base <libraries> */
356+
(void) mca_base_var_register("ompi", "tools", NULL, "entry_base",
357+
"base libraries for dynamic PMPI wrapper layering",
358+
MCA_BASE_VAR_TYPE_STRING, NULL,
359+
0, 0,
360+
OPAL_INFO_LVL_3,
361+
MCA_BASE_VAR_SCOPE_READONLY,
362+
&entry_base_string);
363+
343364
return OMPI_SUCCESS;
344365
}
345366

ompi/tools/Makefile.am

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# Copyright (c) 2004-2005 The Regents of the University of California.
1212
# All rights reserved.
1313
# Copyright (c) 2014 Intel, Inc. All rights reserved.
14-
# Copyright (c) 2020 IBM Corporation. All rights reserved.
14+
# Copyright (c) 2019-2020 IBM Corporation. All rights reserved.
1515
# $COPYRIGHT$
1616
#
1717
# Additional copyrights may follow
@@ -24,11 +24,17 @@
2424
SUBDIRS += \
2525
tools/mpirun \
2626
tools/ompi_info \
27+
tools/profilesupport \
2728
tools/wrappers \
29+
tools/pmpi_dynamic_layering/stub \
30+
tools/pmpi_dynamic_layering/real \
2831
tools/mpisync
2932

3033
DIST_SUBDIRS += \
3134
tools/mpirun \
3235
tools/ompi_info \
36+
tools/profilesupport \
3337
tools/wrappers \
38+
tools/pmpi_dynamic_layering/stub \
39+
tools/pmpi_dynamic_layering/real \
3440
tools/mpisync
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#
2+
# Copyright (c) 2016-2020 IBM Corporation. All rights reserved.
3+
# $COPYRIGHT$
4+
#
5+
# $HEADER$
6+
#
7+
8+
include $(top_srcdir)/Makefile.ompi-rules
9+
10+
psupportdir = $(libdir)/profilesupport
11+
psupport_LTLIBRARIES = libmpiprofilesupport.la
12+
here_src = $(top_srcdir)/ompi/tools/pmpi_dynamic_layering/real
13+
14+
libmpiprofilesupport_la_SOURCES = wrap.c constructed_wrapper_done.h
15+
#libmpiprofilesupport_la_LDFLAGS= -avoid-version
16+
libmpiprofilesupport_la_LDFLAGS = -version-info $(libmpi_so_version)
17+
18+
nodist_prog_SOURCES = constructed_wrapper_done.h
19+
BUILT_SOURCES = constructed_wrapper_done.h
20+
21+
OMPI_OMIT_MPI1_COMPAT_DECLS=1
22+
if OMPI_ENABLE_MPI1_COMPAT
23+
OMPI_OMIT_MPI1_COMPAT_DECLS=0
24+
endif
25+
26+
# The mkcode.py script relies on finding some json output based off the MPI
27+
# standard
28+
constructed_wrapper_done.h: mkcode.py mpi_calls.json
29+
$(here_src)/mkcode.py
30+
31+
clean-local:
32+
rm -f constructed_wrapper_*.h
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
/*
2+
* Copyright (c) 2019-2020 IBM Corporation. All rights reserved.
3+
* $COPYRIGHT$
4+
*/
5+
6+
#include <mpi.h>
7+
#include <stdio.h>
8+
#include <stdlib.h>
9+
#include <unistd.h>
10+
11+
/* intended for use with pingpong or similar programs
12+
* look at time between a send and the next recv
13+
*
14+
* bin i goes up to (2^(i/8 + 2)) * (8 + i%8)/8 ns
15+
* so every 80 bins represent 1024x, so by 500 bins the times are huge
16+
*/
17+
18+
#define NBINS 500
19+
20+
long long ns[NBINS];
21+
int bucket[NBINS];
22+
double tprev = -1;
23+
double tnow;
24+
25+
void ptable();
26+
27+
int
28+
MPI_Init(int *argc, char **argv[]) {
29+
int rv, i;
30+
rv = PMPI_Init(argc, argv);
31+
for (i=0; i<NBINS; ++i) {
32+
ns[i] = (1<<(i/8 + 2)) * (8 + i%8) / 8;
33+
bucket[i] = 0;
34+
}
35+
return(rv);
36+
}
37+
38+
int
39+
MPI_Finalize() {
40+
int i, myrank, nranks;
41+
42+
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
43+
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
44+
for (i=0; i<nranks; ++i) {
45+
MPI_Barrier(MPI_COMM_WORLD);
46+
if (myrank == i) {
47+
printf("========[R%d]======================================\n",
48+
myrank);
49+
ptable();
50+
fflush(stdout);
51+
}
52+
sleep(1);
53+
MPI_Barrier(MPI_COMM_WORLD);
54+
}
55+
56+
return PMPI_Finalize();
57+
}
58+
59+
int
60+
MPI_Send(const void *buf, int count, MPI_Datatype dt, int peer, int tag,
61+
MPI_Comm comm)
62+
{
63+
tprev = MPI_Wtime();
64+
return PMPI_Send(buf, count, dt, peer, tag, comm);
65+
}
66+
67+
int
68+
MPI_Recv(void *buf, int count, MPI_Datatype dt, int peer, int tag,
69+
MPI_Comm comm, MPI_Status *status)
70+
{
71+
int rv, i;
72+
double t;
73+
rv = PMPI_Recv(buf, count, dt, peer, tag, comm, status);
74+
tnow = MPI_Wtime();
75+
t = (tnow - tprev) * 1000000000;
76+
i = 0;
77+
while (i<NBINS && t>ns[i]) { ++i; }
78+
++bucket[i];
79+
return rv;
80+
}
81+
82+
void
83+
ptable() {
84+
int i, startidx, endidx, xsize, maxbucket, nx, j;
85+
86+
startidx = 0; /* incr to first non-0 bucket */
87+
endidx = 0; /* becomes last non-0 bucket */
88+
maxbucket = 0;
89+
90+
for (i=0; i<NBINS; ++i) {
91+
if (startidx == 0 && bucket[i]) { startidx = i; }
92+
if (bucket[i]) { endidx = i; }
93+
if (bucket[i] > maxbucket) { maxbucket = bucket[i]; }
94+
}
95+
xsize = maxbucket / 68;
96+
if (xsize == 0) { xsize = 1; }
97+
98+
for (i=startidx; i<endidx; ++i) {
99+
nx = bucket[i] / xsize + 1;
100+
101+
if (ns[i] < 10000) { printf("%4dn ", ns[i]); }
102+
else if (ns[i]/1024 < 10000) { printf("%4du ", ns[i]/1024); }
103+
else if (ns[i]/1024/1024 < 10000) {
104+
printf("%4dm ", ns[i]/1024/1024);
105+
}
106+
else {
107+
printf("%4d ", ns[i]/1024/1024/1024);
108+
}
109+
for (j=0; j<nx; ++j) {
110+
printf("x");
111+
}
112+
printf("\n");
113+
}
114+
}

0 commit comments

Comments
 (0)