Skip to content

Commit 5554c02

Browse files
committed
-mca prrte_tools_entry <libs> for PMPI layering (and dynamic wrapper interception)
I'll start with a brief outline of the code changes, then a description of what this feature is all about. * scripts mkcode.py and all of the parse_mpi_standard/ directory: uses *.tex and *.py from the pythonization effort in the MPI standard to figure out prototypes for all the MPI functions and constructs wrapper functions for them all * various mpicc-wrapper-data.txt.in etc files: these all get a couple new lines libs_profilesupport=-lmpiprofilesupport profilesupport_lib_file=libmpiprofilesupport.so and opal_wrapper.c puts this early in the link line, so an "ldd" of an MPI app would now have a "libmpiprofilesupport.so" in front of libmpi.so etc. * Makefile changes: build a new libmpiprofilesupport.so two places: 1. $MPI_ROOT/lib/libmpiprofilesupport.so [stub that defines nothing] 2. $MPI_ROOT/lib/profilesupport/libmpiprofilesupport.so [real] * schizo component entry: lookup the -mca prrte_tools_entry setting and use the ORTE_JOB_PREPEND_ENVAR attribute to get LD_LIBRARY_PATH and possibly LD_PRELOAD modified to support the specified profiling libraries Next I'll give a pretty long description of what profile interface layering is about. Although to be honest, I think the main part of this feature that people actually use is only the dynamic part. Eg at runtime this lets you just add "-mca prrte_tools_entry ./libmpe.so" for example without relinking your app and it would run as if you had linked with -lmpe. ------------------------------------------------------------------------ MPI Profiling Interface Layering The MPI standard defines a PMPI profiling interface, but in its normal usage only one profiling library can be used at a time. A profiling wrapper library defines some subset of MPI_* entrypoints, and inside those redefinitions, it calls some combination of MPI_* and PMPI_* symbols. For example int MPI_Allgather(void *sbuf, int scount, MPI_Datatype sdt, void *rbuf, int rcount, MPI_Datatype rdt, MPI_Comm comm) { int rval; double t1, t2, t3; t1 = MPI_Wtime(); MPI_Barrier(comm); t2 = MPI_Wtime(); rval = PMPI_Allgather(sbuf, scount, sdt, rbuf, rcount, rdt, comm); t3 = MPI_Wtime(); // record time waiting vs time spent in allgather.. return(rval); } double MPI_Wtime() { // insert hypothetical high-resolution replacement here, for example } In trying to use two unrelated wrapper libraries at the same time, it's in general not possible to link them in such a way that proper layering occurs. For example, consider two wrapper libraries: 1. "libJobLog.so" wrapping MPI_Init and MPI_Finalize to keep a log of every MPI job, listing hosts, runtimes, and cpu times. 2. "libCollPerf.so" wrapping MPI_Init and MPI_Finalize, and all the MPI collectives to gather statistics about how evenly the ranks enter the collectives. With ordinary linking, each MPI_* call would resolve into one of the wrapper libraries, and from there the wrapper library's call to PMPI_* would resolve into the bottom level libmpi.so. Only one of the libraries would have its MPI_Init and MPI_Finalize routines called. ---------------------------------------------------------------- Defining consistent layering behavior: With dynamically loaded symbols, it is possible to design a consistent approach to layering for an arbitrary number of wrapper libraries. When a wrapper library "libwrap.so" redefines an MPI_* symbol, there are two possibilities for what MPI calls it can make. It can call another MPI_* entry, or it can call a PMPI_* entry. In the case of ordinary single-level wrapping, the calls into MPI_* would resolve into "libwrap.so" first and then "libmpi.so" if not found. And the calls to PMPI_* would resolve to "libmpi.so". In the case of multi-level wrapping, the equivalent behavior is for MPI_* to resolve to the current level, and PMPI_* to resolve to the next level down in the list of libraries. We would set up an array of handles void *lib[MAX_LEVELS]; containing the dlopened handles for all the libraries, eg lib[0] = dlopen("libwrap3.so", ); lib[1] = dlopen("libwrap2.so", ); lib[2] = dlopen("libwrap1.so", ); lib[3] = dlopen("libmpi.so", ); For each MPI function an array of function pointers would exist, one for each library: int (*(fptr_MPI_Send[MAX_LEVELS]))(); Some of the entries would be null, but the bottom level should always be defined. And a thread-local global would define the calldepth for the thread, initially 0. The calldepth needs to be thread-local data so the implementation can be thread-safe, since different threads would have unrelated call stacks. Model implementation of MPI_Send and PMPI_Send: int MPI_Send(void* buf, int count, MPI_Datatype dt, int to, int tag, MPI_Comm comm) { int rval; int entrylev, nextlev; int *calldepth; calldepth = (int*) pthread_getspecific(depthkey); // thread-local global entrylev = *calldepth; nextlev = entrylev; // (or nextlev = entrylev + 1 for PMPI_Send) if (nextlev >= nwrapper_levels) { --nextlev; } while (!fptr_MPI_Send[nextlev] && nextlev<nwrapper_levels) { ++nextlev; } if (nextlev >= nwrapper_levels) { printf("Fatal Error: unable to find symbol at level>=%d for %s\\n", entrylev, "MPI_Send"); exit(1); } *calldepth = nextlev; rval = fptr_MPI_Send[nextlev](buf, count, dt, to, tag, comm); *calldepth = entrylev; return(rval); } If code like the above is in a library called libmpiprofilesupport.so and an MPI app is linked against that library, then an example sequence of events might be - app.x calls MPI_Send, the linker resolves it in libmpiprofilesupport.so - above code sees calldepth=0, calls fptr_MPI_Send[0] == MPI_Send in libwrap1.so - libwrap1.so's MPI_Send calls PMPI_Send, it resolves into libmpiprofilesupport.so - above code increments calldepth to 1, calls fptr_MPI_Send[1] == MPI_Send in libwrap2.so And so on. At each MPI/PMPI call, the linker/loader resolves to the symbol in libmpiprofilesupport.so, and that library decides which function pointer to go into next. ---------------------------------------------------------------- Fortran I believe Open MPI makes the profiling design choice that wrapping the C-language MPI_Send() automatically results in Fortran mpi_send() being wrapped. The "calldepth" behavior from above isn't a perfect match for this situation. For comparison, the traditional behavior for an application linked against a single libwrap.so would go as follows. An "ldd" on the application might show dependencies libwrap.so libmpi_mpifh.so libmpi.so and an application call to fortran mpi_send() would find the first definition in libmpi_mpifh.so.0 where it would call MPI_Send() which would go back up to libwrap.so which might call PMPI_Send() which would then go down to libmpi.so. The linker/loader starts at the top in its search for each symbol. The layered profiling approach with a "calldepth" always goes down the list of libraries. If fortran is at the bottom of the list, those symbols would end up not being intercepted. I think the easiest way to maintain the desired behavior is to re-order the libraries in the dynamic case as lib[0] = dlopen("libmpi_mpifh.so", RTLD_NOW|RTLD_GLOBAL); lib[1] = dlopen("libwrap.so", RTLD_NOW|RTLD_GLOBAL); .. other wrapper libraries .. lib[n] = dlopen("libmpi.so", RTLD_NOW|RTLD_GLOBAL); So the fortran wrapper is always first, and libmpi.so is always last, with all the wrapper libraries inbetween. Internally the -mca prrte_tools_entry feature produces a list of entrypoint levels with three types: 1. wrapper libraries like libwrap.so 2. base MPI implementation which is two libraries libmpi.so and libmpi_mpifh.so 3. just the fortran calls from the base MPI implementation and allows "fort" to be manually included in the list of libraries. So if one had a library libwrap.so that defined MPI_Send and called PMPI_Send, then the syntax -mca prrte_tools_entry fort,libwrap.so would produce a list of entry levels as level[0] = fortran symbols from libmpi_mpifh.so level[1] = libwrap.so level[2] = all the base MPI symbols from libmpi.so and libmpi_mpifh.so and if an application called fortran mpi_send, the call sequence would be - app.x calls mpi_send, the linker resolves it in libmpiprofilesupport.so - calldepth=0, calls fptr_mpi_send[0] == mpi_send in libmpi_mpifh.so - libmpi_mpifh.so's mpi_send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=1, calls fptr_MPI_Send[1] == MPI_Send in libwrap.so - libwrap.so's MPI_Send calls PMPI_Send, which resolves into libmpiprofilesupport.so - calldepth=2, calls fptr_MPI_Send[2] == MPI_Send in libmpi.so so including the OMPI fortran wrappers in the list in front of libwrap.so enabled automatic wrapping of mpi_send by only redefining MPI_Send. If this behavior is not desired, it's possible tp use the syntax -mca prrte_tools_entry libwrap.so,fort which puts the fortran symbols at the bottom of the list where all the base MPI symbols are. ---------------------------------------------------------------- Performance On a machine where OMPI pingpong takes 0.22 usec, the -mca prrte_tools_entry option slowed the pingpong to 0.24 usec. This is enough of an impact I wouldn't consider putting the code from libmpiprofilesupport.so into the main libmpi.so library. But as long as the code is isolated in its own libmpiprofilesupport.so and LD_LIBRARY_PATH is used to point at a stub library there is no performance impact when the feature isn't activated. ---------------------------------------------------------------- Weaknesses I think the main weakness is that the above design only handles MPI_* calls. If a wrapper library wanted to intercept both MPI_* calls and all malloc()/free() calls for example, the above would result in the malloc()/free() not being intercepted (the app isn't linked against libwrap.so, it's only linked against libmpiprofilesupport.so which has a dlopen() of libwrap.so). This commit includes a "preload" feature (that can be included in the -mca prrte_tools_entry list) that would be needed with such libraries. Then the malloc/free/etc from libwrap.so would be used. There wouldn't be any kind of layering for the non-MPI symbols if multiple libraries were trying to intercept malloc/free for example. Another minor weakness is that MPI_Pcontrol(level, ...)'s variable argument list is impossible to maintain as far as I know. The proper behavior would be that if an application calls MPI_Pcontrol(1, "test", 3.14), we should call MPI_Pcontrol with those arguments in every wrapper library in the list. But there's no way in stdarg.h to extract and re-call with an unknown list of arguments. The best we can do is call MPI_Pcontrol(level) at each library. ---------------------------------------------------------------- Documentation for this feature: [Dynamic MPI Profiling interface with layering] The MCA option -mca prrte_tools_entry <list> can be used to enable interception from profiling libraries at runtime. In traditional MPI usage a wrapper library "libwrap.so" can be built that redefines selected MPI calls, and using that library would require relinking the application with -lwrap to enable the interception. But this feature allows such interception to be enabled at runtime without relinking by using the mpirun option -mca prrte_tools_entry libwrap.so The -mca prrte_tools_entry feature can be used as any of -mca prrte_tools_entry /path/to/libwrap.so -mca prrte_tools_entry libwrap.so (if the library will be found via LD_LIBRARY_PATH) -mca prrte_tools_entry wrap (shortcut for libwrap.so) Note that this feature doesn't automatically set LD_LIBRARY_PATH so that "libwrap.so" can be found. That could be done by using the additional option -x OMPI_LD_LIBRARY_PATH_PREPEND=<dir>:<dir>:... To layer multiple libraries, a comma separted list can be used: -mca prrte_tools_entry libwrap1.so,libwrap2.so A few other keywords can be added into the -mca prrte_tools_entry list: -mca prrte_tools_entry v : verbose option, list the opened libraries -mca prrte_tools_entry preload : cause the wrapper libraries to be added to an LD_PRELOAD. This could be needed if a library redefine non-MPI symbols -mca prrte_tools_entry fortran : a layer that minimally wraps the Fortran MPI calls on top of the C calls, eg defining mpi_foo and calling PMPI_Foo -mca prrte_tools_entry fort : short for fortran above By default Fortran wrappers are placed at the top of the list and the base product is always placed at the bottom, so -mca prrte_tools_entry libwrap.so would be equivalent to -mca prrte_tools_entry fortran,libwrap.so and would produce a layering as level[0] = fortran symbols defining mpi_foo and calling PMPI_Foo level[1] = libwrap.so level[2] = MPI from the base product In this way if libwrap.so defined MPI_Send and an application used Fortran mpi_send the MPI_Send call in libwrap.so would be triggered. If that behavior was not desired, the fortran wrappers can be essentially disabled by moving them to the bottom of the list, eg -mca prrte_tools_entry libwrap.so,fortran ---------------------------------------------------------------- Signed-off-by: Mark Allen <[email protected]>
1 parent 3366f3e commit 5554c02

40 files changed

+15226
-7
lines changed

config/ompi_config_files.m4

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
# Copyright (c) 2018 Los Alamos National Security, LLC. All rights
77
# reserved.
88
# Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
9+
# Copyright (c) 2020 IBM Corporation. All rights reserved.
910
# $COPYRIGHT$
1011
#
1112
# Additional copyrights may follow
@@ -58,5 +59,7 @@ AC_DEFUN([OMPI_CONFIG_FILES],[
5859
ompi/tools/wrappers/mpijavac.pl
5960
ompi/tools/mpisync/Makefile
6061
ompi/tools/mpirun/Makefile
62+
ompi/tools/pmpi_dynamic_layering/stub/Makefile
63+
ompi/tools/pmpi_dynamic_layering/real/Makefile
6164
])
6265
])

ompi/runtime/ompi_mpi_params.c

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
* All rights reserved.
2020
* Copyright (c) 2016-2019 Research Organization for Information Science
2121
* and Technology (RIST). All rights reserved.
22+
* Copyright (c) 2020 IBM Corporation. All rights reserved.
2223
* $COPYRIGHT$
2324
*
2425
* Additional copyrights may follow
@@ -83,6 +84,8 @@ static bool show_file_mca_params = false;
8384
static bool show_enviro_mca_params = false;
8485
static bool show_override_mca_params = false;
8586
static bool ompi_mpi_oversubscribe = false;
87+
static char *entry_string = NULL;
88+
static char *entry_base_string = NULL;
8689

8790
int ompi_mpi_register_params(void)
8891
{
@@ -340,6 +343,24 @@ int ompi_mpi_register_params(void)
340343
MCA_BASE_VAR_SCOPE_READONLY,
341344
&ompi_mpi_spc_dump_enabled);
342345

346+
/* -mca tools_entry <libraries-and-options> */
347+
(void) mca_base_var_register("ompi", "tools", NULL, "entry",
348+
"options for dynamic PMPI wrapper layering",
349+
MCA_BASE_VAR_TYPE_STRING, NULL,
350+
0, 0,
351+
OPAL_INFO_LVL_3,
352+
MCA_BASE_VAR_SCOPE_READONLY,
353+
&entry_string);
354+
355+
/* -mca tools_entry_base <libraries> */
356+
(void) mca_base_var_register("ompi", "tools", NULL, "entry_base",
357+
"base libraries for dynamic PMPI wrapper layering",
358+
MCA_BASE_VAR_TYPE_STRING, NULL,
359+
0, 0,
360+
OPAL_INFO_LVL_3,
361+
MCA_BASE_VAR_SCOPE_READONLY,
362+
&entry_base_string);
363+
343364
return OMPI_SUCCESS;
344365
}
345366

ompi/tools/Makefile.am

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# Copyright (c) 2004-2005 The Regents of the University of California.
1212
# All rights reserved.
1313
# Copyright (c) 2014 Intel, Inc. All rights reserved.
14-
# Copyright (c) 2020 IBM Corporation. All rights reserved.
14+
# Copyright (c) 2019-2020 IBM Corporation. All rights reserved.
1515
# $COPYRIGHT$
1616
#
1717
# Additional copyrights may follow
@@ -25,10 +25,14 @@ SUBDIRS += \
2525
tools/mpirun \
2626
tools/ompi_info \
2727
tools/wrappers \
28+
tools/pmpi_dynamic_layering/stub \
29+
tools/pmpi_dynamic_layering/real \
2830
tools/mpisync
2931

3032
DIST_SUBDIRS += \
3133
tools/mpirun \
3234
tools/ompi_info \
3335
tools/wrappers \
36+
tools/pmpi_dynamic_layering/stub \
37+
tools/pmpi_dynamic_layering/real \
3438
tools/mpisync
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
A portion of the MPI standard is checked in here:
2+
binding-tool/*.py
3+
4+
Those are the tools that make sense of blocks like
5+
function_name("MPI_Send")
6+
7+
parameter(
8+
"buf", "BUFFER", desc="initial address of send buffer", constant=True
9+
)
10+
parameter(
11+
"count", "POLYXFER_NUM_ELEM", desc="number of elements in send buffer"
12+
)
13+
parameter(
14+
"datatype", "DATATYPE", desc="datatype of each send buffer element"
15+
)
16+
parameter("dest", "RANK", desc="rank of destination")
17+
parameter("tag", "TAG", desc="message tag")
18+
parameter("comm", "COMMUNICATOR")
19+
that are scattered throughout the standard's *.tex file.
20+
21+
The *.tex files from the MPI standard are not checked in here, but an
22+
extraction tool and the result of running it are:
23+
extract.py
24+
generated_mpistandard_parsesets.py
25+
26+
The "extract.py" tool does a find of *.tex and prints all the
27+
python blocks it finds to stdout. The results in
28+
"generated_mpistandard_parsesets.py" are what happens if this is
29+
run in a check out of the MPI standard (with Jeff Squyres'
30+
pythonization checked in).
31+
32+
A user of this MPI standard data can use calls like
33+
sys.path.append("path/to/parse_mpi_standard")
34+
from generated_mpistandard_parsets import \
35+
run_everything_from_the_standard, MPI_FN_DATA
36+
run_everything_from_the_standard()
37+
# At this point the dict MPI_FN_DATA is populated
38+
PARSESET = MPI_FN_DATA['MPI_Send']
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
'''
2+
3+
'''
4+
5+
6+
from typing import Mapping, MutableMapping
7+
import copy
8+
import logging
9+
from collections import defaultdict
10+
11+
12+
import common
13+
from userfuncs import *
14+
15+
16+
DEFAULT_ATTRIBUTES = {
17+
# validity attributes
18+
'lis_expressible': True,
19+
'c_expressible': True,
20+
'f08_expressible': True,
21+
'f90_expressible': True,
22+
23+
# TODO
24+
'f90_use_colons': False,
25+
# weird index overloading
26+
'f90_index_overload': None,
27+
28+
# callback attributes
29+
'callback': False,
30+
'predefined_function': None,
31+
32+
# semantic attributes
33+
'deprecated': False,
34+
'execute_once': False,
35+
36+
# control dependencies
37+
# 'needs_any': ['MPI_Init', 'MPI_Init_thread'],
38+
# 'needs_all': [],
39+
# 'leads_any': ['MPI_Finalize'],
40+
# 'leads_all': [],
41+
}
42+
43+
44+
DEFAULT_TEMPORARIES = {
45+
'has_ierror': True,
46+
47+
# whether this binding is from a definition
48+
'reference': None,
49+
50+
# which expressible languages should be rendered
51+
# at definition mpi-binding
52+
'renders': ['lis', 'c', 'f08', 'f90'],
53+
}
54+
55+
56+
def reset_parseset() -> None:
57+
"""
58+
Resets the global variable PARSESET.
59+
"""
60+
61+
# global PARSESET
62+
# PARSESET = dict()
63+
PARSESET.clear()
64+
65+
# name -> LIS/C/F08, name_f90 -> F90
66+
# these must be unique, otherwise you have big problems
67+
PARSESET['name'] = None
68+
PARSESET['name_f90'] = None
69+
70+
# API attributes
71+
PARSESET['parameters'] = list()
72+
PARSESET['return_kind'] = None
73+
PARSESET['attributes'] = copy.deepcopy(DEFAULT_ATTRIBUTES)
74+
75+
# not needed in apis.json
76+
PARSESET['temporaries'] = copy.deepcopy(DEFAULT_TEMPORARIES)
77+
78+
79+
def parsing_done() -> None:
80+
"""
81+
This hook is run after all the parsing is done for a given binding.
82+
"""
83+
84+
if PARSESET['temporaries']['has_ierror']:
85+
parameter('ierror',
86+
'ERROR_CODE',
87+
optional=True,
88+
direction='out',
89+
suppress='c_parameter lis_parameter')
90+
91+
# default to an INT return type
92+
if PARSESET['return_kind'] is None:
93+
# MR this is not strictly true, Fortran just doesn't have a return type
94+
# since they are subroutines
95+
return_type('ERROR_CODE')
96+
97+
98+
def execute_binding(binding: str, require_definition: bool = False) -> Mapping:
99+
'''
100+
Executes the content of a matched binding block.
101+
'''
102+
103+
reset_parseset()
104+
105+
logging.info("Rendering mpi-binding to LaTeX:")
106+
for line in binding.split('\n'):
107+
if line:
108+
logging.info('>> %s', line)
109+
110+
# if definition required
111+
# then ignore bindings without function_name
112+
if require_definition and 'function_name' not in binding:
113+
return None
114+
115+
# execute the lines read from the mpi-binding
116+
# pylint: disable=exec-used
117+
exec(common.remove_unexpected_indentation(binding))
118+
119+
parsing_done()
120+
121+
# write to dataset
122+
logging.debug('parsed %s', PARSESET)
123+
parseset = copy.deepcopy(PARSESET)
124+
125+
# clear parsed information
126+
logging.debug('clearing PARSESET: %s',
127+
str(PARSESET))
128+
129+
return parseset

0 commit comments

Comments
 (0)