Skip to content

Nested parallelism in OpenMP

Bei Wang edited this page Dec 6, 2019 · 21 revisions

We use nested parallelism in multi-event clustering to exploit the hardware computing capability in multicore architectures. The most intuitive method of implementing nested parallelism within an OpenMP code is to simply nest two (or more) OpenMP parallel regions. In the multi-event clustering case, the outer parallel region is set to spawn across events and the inner parallel region is set across strips within a single event. Care must be taken to ensure that the right number of threads with proper thread affinity are given to each level of parallel regions.

By default, OpenMP turns off nested parallelism. In this case, only a single thread is generated whenever a parallel region is encountered in the inner loops. Nested parallelism can be turned on at run time either through calling function omp_set_nested(true) in the code or setting environment variable OMP_NESTED=TRUE in job running environment. Similarly, we can control the number of threads in each level at run time with num_thread() clause or through setting environment variable OMP_NUM_THREADS=a, b (where the number of threads for each level is separated by comma). In the Intel compiler, there is a support for an experimental feature known as "hot teams" which is able to reduce the overhead of spawning threads for an inner-most loop in the nested region by keeping a pool of threads alive (but idle) during the execution of the non-nested parallel code. The use of hot team is controlled by two environment variables: KMP_HOT_TEAMS_MODE and KMP_HOT_TEAMS_MAX_LEVEL. To keep unused team member alive during non-nested region, we set KMP_HOT_TEAMS_MODE=1. Since we have maximal two levels in our case, we set KMP_HOT_TEAMS_MAX_LEVEL=2. For nested parallelism, it is also important to ensure that threads are placed with affinity. Thread affinity can be enforced at runtime using OpenMP clause, e.g., proc_bind. Also OpenMP 4.0 provides two environment variables for handing the placement of threads. To place leaders on separate cores far apart and team members close together, we set OMP_PROC_BIND=spread,close and OMP_PLACES=cores. Note the run time OpenMP functions, e.g., omp_set_nested, num_threads, proc_bind, etc have higher priority than the corresponding OpenMP environment variables, e.g., OMP_NESTED, OMP_NUM_THREADS, OMP_PROC_BIND etc. We show a performance comparison (wrt running time in seconds) on two sockets system with Intel Broadwell processor below:

Nested OpenMP Performance Comparison

Overall, we find that setting environment variables as below leads to the best performance:
export OMP_NESTED=TRUE
export OMP_NUM_THREADS=a,b
export OMP_PLACES=cores
export OMP_PROC_BIND=spread, close
export KMP_HOT_TEAMS_MODE=1 (intel only)
export KMP_HOT_TEAMS_MAX_LEVEL=2 (intel only)

Setting OMP_DISPLAY_ENV=TRUE and KMP_AFFINITY=verbose (intel only) is helpful for us to check the thread placement.

Ref:

  1. https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/721790
  2. Cosmic Microwave background analysis: nested parallelism in practice, High Performance Parallelism Pearls, Volume 2, p.178
Clone this wiki locally