Skip to content

Commit 3711a00

Browse files
committed
-mca tools_entry <libs> for MPI profiling interface layering (and dynamic wrapper interception)
I'll start with a brief outline of the code changes, then a description of what this feature is all about. * scripts normalize_mpih.pl and mkcode.pl: parses mpi.h.in to figure out prototypes for all the MPI functions and constructs wrapper functions for them all * various mpicc-wrapper-data.txt.in etc files: these all get a couple new lines libs_profilesupport=-lmpiprofilesupport profilesupport_lib_file=libmpiprofilesupport.so and opal_wrapper.c puts this early in the link line, so an "ldd" of an MPI app would now have a "libmpiprofilesupport.so" in front of libmpi.so etc. * Makefile changes: build a new libmpiprofilesupport.so two places: 1. $MPI_ROOT/lib/libmpiprofilesupport.so [stub that defines nothing] 2. $MPI_ROOT/lib/profilesupport/libmpiprofilesupport.so [real] * schizo component entry: lookup the -mca tools_entry setting and use the ORTE_JOB_PREPEND_ENVAR attribute to get LD_LIBRARY_PATH and possibly LD_PRELOAD modified to support the specified profiling libraries Next I'll give a pretty long description of what profile interface layering is about. Although to be honest, I think the main part of this feature that people actually use is only the dynamic part. Eg at runtime this lets you just add "-mca tools_entry ./libmpe.so" for example without relinking your app and it would run as if you had linked with -lmpe. ------------------------------------------------------------------------ MPI Profiling Interface Layering The MPI standard defines a PMPI profiling interface, but in its normal usage only one profiling library can be used at a time. A profiling wrapper library defines some subset of MPI_* entrypoints, and inside those redefinitions, it calls some combination of MPI_* and PMPI_* symbols. For example int MPI_Allgather(void *sbuf, int scount, MPI_Datatype sdt, void *rbuf, int rcount, MPI_Datatype rdt, MPI_Comm comm) { int rval; double t1, t2, t3; t1 = MPI_Wtime(); MPI_Barrier(comm); t2 = MPI_Wtime(); rval = PMPI_Allgather(sbuf, scount, sdt, rbuf, rcount, rdt, comm); t3 = MPI_Wtime(); // record time waiting vs time spent in allgather.. return(rval); } double MPI_Wtime() { // insert hypothetical high-resolution replacement here, for example } In trying to use two unrelated wrapper libraries at the same time, it's in general not possible to link them in such a way that proper layering occurs. For example, consider two wrapper libraries: 1. "libJobLog.so" wrapping MPI_Init and MPI_Finalize to keep a log of every MPI job, listing hosts, runtimes, and cpu times. 2. "libCollPerf.so" wrapping MPI_Init and MPI_Finalize, and all the MPI collectives to gather statistics about how evenly the ranks enter the collectives. With ordinary linking, each MPI_* call would resolve into one of the wrapper libraries, and from there the wrapper library's call to PMPI_* would resolve into the bottom level libmpi.so. Only one of the libraries would have its MPI_Init and MPI_Finalize routines called. ---------------------------------------------------------------- Defining consistent layering behavior: With dynamically loaded symbols, it is possible to design a consistent approach to layering for an arbitrary number of wrapper libraries. When a wrapper library "libwrap.so" redefines an MPI_* symbol, there are two possibilities for what MPI calls it can make. It can call another MPI_* entry, or it can call a PMPI_* entry. In the case of ordinary single-level wrapping, the calls into MPI_* would resolve into "libwrap.so" first and then "libmpi.so" if not found. And the calls to PMPI_* would resolve to "libmpi.so". In the case of multi-level wrapping, the equivalent behavior is for MPI_* to resolve to the current level, and PMPI_* to resolve to the next level down in the list of libraries. We would set up an array of handles void *lib[MAX_LEVELS]; containing the dlopened handles for all the libraries, eg lib[0] = dlopen("libwrap3.so", ); lib[1] = dlopen("libwrap2.so", ); lib[2] = dlopen("libwrap1.so", ); lib[3] = dlopen("libmpi.so", ); For each MPI function an array of function pointers would exist, one for each library: int (*(fptr_MPI_Send[MAX_LEVELS]))(); Some of the entries would be null, but the bottom level should always be defined. And a thread-local global would define the calldepth for the thread, initially 0. The calldepth needs to be thread-local data so the implementation can be thread-safe, since different threads would have unrelated call stacks. Model implementation of MPI_Send and PMPI_Send: int MPI_Send(void* buf, int count, MPI_Datatype dt, int to, int tag, MPI_Comm comm) { int rval; int entrylev, nextlev; int *calldepth; calldepth = (int*) pthread_getspecific(depthkey); // thread-local global entrylev = *calldepth; nextlev = entrylev; // (or nextlev = entrylev + 1 for PMPI_Send) if (nextlev >= nwrapper_levels) { --nextlev; } while (!fptr_MPI_Send[nextlev] && nextlev<nwrapper_levels) { ++nextlev; } if (nextlev >= nwrapper_levels) { printf("Fatal Error: unable to find symbol at level>=%d for %s\\n", entrylev, "MPI_Send"); exit(1); } *calldepth = nextlev; rval = fptr_MPI_Send[nextlev](buf, count, dt, to, tag, comm); *calldepth = entrylev; return(rval); } If code like the above is in a library called libmpiprofilesupport.so and an MPI app is linked against that library, then an example sequence of events might be - app.x calls MPI_Send, the linker resolves it in libmpiprofilesupport.so - above code sees calldepth=0, calls fptr_MPI_Send[0] == MPI_Send in libwrap1.so - libwrap1.so's MPI_Send calls PMPI_Send, it resolves into libmpiprofilesupport.so - above code increments calldepth to 1, calls fptr_MPI_Send[1] == MPI_Send in libwrap2.so And so on. At each MPI/PMPI call, the linker/loader resolves to the symbol in libmpiprofilesupport.so, and that library decides which function pointer to go into next. ---------------------------------------------------------------- Fortran I believe Open MPI makes the profiling design choice that wrapping the C-language MPI_Send() automatically results in Fortran mpi_send() being wrapped. The "calldepth" behavior from above isn't a perfect match for this situation. For comparison, the traditional behavior for an application linked against a single libwrap.so would go as follows. An "ldd" on the application might show dependencies libwrap.so libmpi_mpifh.so libmpi.so and an application call to fortran mpi_send() would find the first definition in libmpi_mpifh.so.0 where it would call MPI_Send() which would go back up to libwrap.so which might call PMPI_Send() which would then go down to libmpi.so. The linker/loader starts at the top in its search for each symbol. The layered profiling approach with a "calldepth" always goes down the list of libraries. If fortran is at the bottom of the list, those symbols would end up not being intercepted. I think the easiest way to maintain the desired behavior is to re-order the libraries in the dynamic case as lib[0] = dlopen("libmpi_mpifh.so", RTLD_NOW|RTLD_GLOBAL); lib[1] = dlopen("libwrap.so", RTLD_NOW|RTLD_GLOBAL); .. other wrapper libraries .. lib[n] = dlopen("libmpi.so", RTLD_NOW|RTLD_GLOBAL); So the fortran wrapper is always first, and libmpi.so is always last, with all the wrapper libraries inbetween. Internally the -mca tools_entry feature produces a list of entrypoint levels with three types: 1. wrapper libraries like libwrap.so 2. base MPI implementation which is two libraries libmpi.so and libmpi_mpifh.so 3. just the fortran calls from the base MPI implementation and allows "fort" to be manually included in the list of libraries. So if one had a library libwrap.so that defined MPI_Send and called PMPI_Send, then the syntax -mca tools_entry fort,libwrap.so would produce a list of entry levels as level[0] = fortran symbols from libmpi_mpifh.so level[1] = libwrap.so level[2] = all the base MPI symbols from libmpi.so and libmpi_mpifh.so and if an application called fortran mpi_send, the call sequence would be - app.x calls mpi_send, the linker resolves it in libmpiprofilesupport.so - calldepth=0, calls fptr_mpi_send[0] == mpi_send in libmpi_mpifh.so - libmpi_mpifh.so's mpi_send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=1, calls fptr_MPI_Send[1] == MPI_Send in libwrap.so - libwrap.so's MPI_Send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=2, calls fptr_MPI_Send[2] == MPI_Send in libmpi.so so including the OMPI fortran wrappers in the list in front of libwrap.so enabled automatic wrapping of mpi_send by only redefining MPI_Send. If this behavior is not desired, it's possible tp use the syntax -mca tools_entry libwrap.so,fort which puts the fortran symbols at the bottom of the list where all the base MPI symbols are. ---------------------------------------------------------------- Performance On a machine where OMPI pingpong takes 0.22 usec, the -mca tools_entry option slowed the pingpong to 0.24 usec. This is enough of an impact I wouldn't consider putting the code from libmpiprofilesupport.so into the main libmpi.so library. But as long as the code is isolated in its own libmpiprofilesupport.so and LD_LIBRARY_PATH is used to point at a stub library there is no performance impact when the feature isn't activated. ---------------------------------------------------------------- Weaknesses I think the main weakness is that the above design only handles MPI_* calls. If a wrapper library wanted to intercept both MPI_* calls and all malloc()/free() calls for example, the above would result in the malloc()/free() not being intercepted (the app isn't linked against libwrap.so, it's only linked against libmpiprofilesupport.so which has a dlopen() of libwrap.so). This commit includes a "preload" feature (that can be included in the -mca tools_entry list) that would be needed with such libraries. Then the malloc/free/etc from libwrap.so would be used. There wouldn't be any kind of layering for the non-MPI symbols if multiple libraries were trying to intercept malloc/free for example. Another minor weakness is that MPI_Pcontrol(level, ...)'s variable argument list is impossible to maintain as far as I know. The proper behavior would be that if an application calls MPI_Pcontrol(1, "test", 3.14), we should call MPI_Pcontrol with those arguments in every wrapper library in the list. But there's no way in stdarg.h to extract and re-call with an unknown list of arguments. The best we can do is call MPI_Pcontrol(level) at each library. ---------------------------------------------------------------- Documentation for this feature: [Dynamic MPI Profiling interface with layering] The MCA option -mca tools_entry <list> can be used to enable interception from profiling libraries at runtime. In traditional MPI usage a wrapper library "libwrap.so" can be built that redefines selected MPI calls, and using that library would require relinking the application with -lwrap to enable the interception. But this feature allows such interception to be enabled at runtime without relinking by using the mpirun option -mca tools_entry libwrap.so The -mca tools_entry feature can be used as any of -mca tools_entry /path/to/libwrap.so -mca tools_entry libwrap.so (if the library will be found via LD_LIBRARY_PATH) -mca tools_entry wrap (shortcut for libwrap.so) Note that this feature doesn't automatically set LD_LIBRARY_PATH so that "libwrap.so" can be found. That could be done by using the additional option -x OMPI_LD_LIBRARY_PATH_PREPEND=<dir>:<dir>:... To layer multiple libraries, a comma separted list can be used: -mca tools_entry libwrap1.so,libwrap2.so A few other keywords can be added into the -mca tools_entry list: -mca tools_entry v : verbose option, list the opened libraries -mca tools_entry preload : cause the wrapper libraries to be added to an LD_PRELOAD. This could be needed if a library redefine non-MPI symbols -mca tools_entry fortran : a layer that minimally wraps the Fortran MPI calls on top of the C calls, eg defining mpi_foo and calling PMPI_Foo -mca tools_entry fort : short for fortran above By default Fortran wrappers are placed at the top of the list and the base product is always placed at the bottom, so -mca tools_entry libwrap.so would be equivalent to -mca tools_entry fortran,libwrap.so and would produce a layering as level[0] = fortran symbols defining mpi_foo and calling PMPI_Foo level[1] = libwrap.so level[2] = MPI from the base product In this way if libwrap.so defined MPI_Send and an application used Fortran mpi_send the MPI_Send call in libwrap.so would be triggered. If that behavior was not desired, the fortran wrappers can be essentially disabled by moving them to the bottom of the list, eg -mca tools_entry libwrap.so,fortran ---------------------------------------------------------------- Signed-off-by: Mark Allen <[email protected]>
1 parent db52da4 commit 3711a00

26 files changed

+1629
-6
lines changed

config/ompi_config_files.m4

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
# Copyright (c) 2018 Los Alamos National Security, LLC. All rights
77
# reserved.
88
# Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
9+
# Copyright (c) 2019 IBM Corporation. All rights reserved.
910
# $COPYRIGHT$
1011
#
1112
# Additional copyrights may follow
@@ -56,5 +57,7 @@ AC_DEFUN([OMPI_CONFIG_FILES],[
5657
ompi/tools/wrappers/ompi-fort.pc
5758
ompi/tools/wrappers/mpijavac.pl
5859
ompi/tools/mpisync/Makefile
60+
ompi/tools/pmpi_dynamic_layering/stub/Makefile
61+
ompi/tools/pmpi_dynamic_layering/real/Makefile
5962
])
6063
])

ompi/tools/Makefile.am

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
# Copyright (c) 2004-2005 The Regents of the University of California.
1212
# All rights reserved.
1313
# Copyright (c) 2014 Intel, Inc. All rights reserved.
14+
# Copyright (c) 2019 IBM Corporation. All rights reserved.
1415
# $COPYRIGHT$
1516
#
1617
# Additional copyrights may follow
@@ -23,9 +24,13 @@
2324
SUBDIRS += \
2425
tools/ompi_info \
2526
tools/wrappers \
27+
tools/pmpi_dynamic_layering/stub \
28+
tools/pmpi_dynamic_layering/real \
2629
tools/mpisync
2730

2831
DIST_SUBDIRS += \
2932
tools/ompi_info \
3033
tools/wrappers \
34+
tools/pmpi_dynamic_layering/stub \
35+
tools/pmpi_dynamic_layering/real \
3136
tools/mpisync
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#
2+
# Copyright (c) 2016 IBM
3+
#
4+
# Copyright (c) 2019 IBM Corporation. All rights reserved.
5+
# $COPYRIGHT$
6+
#
7+
# $HEADER$
8+
#
9+
10+
include $(top_srcdir)/Makefile.ompi-rules
11+
12+
psupportdir = $(libdir)/profilesupport
13+
psupport_LTLIBRARIES = libmpiprofilesupport.la
14+
here_src = $(top_srcdir)/ompi/tools/pmpi_dynamic_layering/real
15+
16+
libmpiprofilesupport_la_SOURCES = wrap.c constructed_wrapper_done.h
17+
#libmpiprofilesupport_la_LDFLAGS= -avoid-version
18+
libmpiprofilesupport_la_LDFLAGS = -version-info $(libmpi_so_version)
19+
20+
nodist_prog_SOURCES = constructed_wrapper_done.h
21+
BUILT_SOURCES = constructed_wrapper_done.h
22+
23+
OMPI_OMIT_MPI1_COMPAT_DECLS=1
24+
if OMPI_ENABLE_MPI1_COMPAT
25+
OMPI_OMIT_MPI1_COMPAT_DECLS=0
26+
endif
27+
28+
constructed_wrapper_done.h: normalize_mpih.pl mkcode.pl $(top_srcdir)/ompi/include/mpi.h.in
29+
rm -f cppoutput flist && \
30+
cpp -DOMPI_OMIT_MPI1_COMPAT_DECLS=$(OMPI_OMIT_MPI1_COMPAT_DECLS) \
31+
$(top_srcdir)/ompi/include/mpi.h.in > cppoutput && \
32+
perl $(here_src)/normalize_mpih.pl $(top_srcdir)/ompi/include/mpi.h.in cppoutput > flist && \
33+
perl $(here_src)/mkcode.pl
34+
35+
clean-local:
36+
rm -f constructed_wrapper_*.h
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
/*
2+
* Copyright (c) 2019 IBM Corporation. All rights reserved.
3+
* $COPYRIGHT$
4+
*/
5+
6+
#include <mpi.h>
7+
#include <stdio.h>
8+
#include <stdlib.h>
9+
#include <unistd.h>
10+
11+
/* intended for use with pingpong or similar programs
12+
* look at time between a send and the next recv
13+
*
14+
* bin i goes up to (2^(i/8 + 2)) * (8 + i%8)/8 ns
15+
* so every 80 bins represent 1024x, so by 500 bins the times are huge
16+
*/
17+
18+
#define NBINS 500
19+
20+
long long ns[NBINS];
21+
int bucket[NBINS];
22+
double tprev = -1;
23+
double tnow;
24+
25+
void ptable();
26+
27+
int
28+
MPI_Init(int *argc, char **argv[]) {
29+
int rv, i;
30+
rv = PMPI_Init(argc, argv);
31+
for (i=0; i<NBINS; ++i) {
32+
ns[i] = (1<<(i/8 + 2)) * (8 + i%8) / 8;
33+
bucket[i] = 0;
34+
}
35+
return(rv);
36+
}
37+
38+
int
39+
MPI_Finalize() {
40+
int i, myrank, nranks;
41+
42+
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
43+
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
44+
for (i=0; i<nranks; ++i) {
45+
MPI_Barrier(MPI_COMM_WORLD);
46+
if (myrank == i) {
47+
printf("========[R%d]======================================\n",
48+
myrank);
49+
ptable();
50+
fflush(stdout);
51+
}
52+
sleep(1);
53+
MPI_Barrier(MPI_COMM_WORLD);
54+
}
55+
56+
return PMPI_Finalize();
57+
}
58+
59+
int
60+
MPI_Send(const void *buf, int count, MPI_Datatype dt, int peer, int tag,
61+
MPI_Comm comm)
62+
{
63+
tprev = MPI_Wtime();
64+
return PMPI_Send(buf, count, dt, peer, tag, comm);
65+
}
66+
67+
int
68+
MPI_Recv(void *buf, int count, MPI_Datatype dt, int peer, int tag,
69+
MPI_Comm comm, MPI_Status *status)
70+
{
71+
int rv, i;
72+
double t;
73+
rv = PMPI_Recv(buf, count, dt, peer, tag, comm, status);
74+
tnow = MPI_Wtime();
75+
t = (tnow - tprev) * 1000000000;
76+
i = 0;
77+
while (i<NBINS && t>ns[i]) { ++i; }
78+
++bucket[i];
79+
return rv;
80+
}
81+
82+
void
83+
ptable() {
84+
int i, startidx, endidx, xsize, maxbucket, nx, j;
85+
86+
startidx = 0; /* incr to first non-0 bucket */
87+
endidx = 0; /* becomes last non-0 bucket */
88+
maxbucket = 0;
89+
90+
for (i=0; i<NBINS; ++i) {
91+
if (startidx == 0 && bucket[i]) { startidx = i; }
92+
if (bucket[i]) { endidx = i; }
93+
if (bucket[i] > maxbucket) { maxbucket = bucket[i]; }
94+
}
95+
xsize = maxbucket / 68;
96+
if (xsize == 0) { xsize = 1; }
97+
98+
for (i=startidx; i<endidx; ++i) {
99+
nx = bucket[i] / xsize + 1;
100+
101+
if (ns[i] < 10000) { printf("%4dn ", ns[i]); }
102+
else if (ns[i]/1024 < 10000) { printf("%4du ", ns[i]/1024); }
103+
else if (ns[i]/1024/1024 < 10000) {
104+
printf("%4dm ", ns[i]/1024/1024);
105+
}
106+
else {
107+
printf("%4d ", ns[i]/1024/1024/1024);
108+
}
109+
for (j=0; j<nx; ++j) {
110+
printf("x");
111+
}
112+
printf("\n");
113+
}
114+
}

0 commit comments

Comments
 (0)