|
3 | 3 | == Introduction |
4 | 4 |
|
5 | 5 | [.text-justify] |
6 | | -Kokkos' *hierarchical parallelism* is a paradigm that enables the exploitation of multiple levels of *shared-memory parallelism*, allowing developers to leverage increased parallelism in their computations for potential performance improvements. This framework supports various levels of parallelism, including thread teams, threads within a team, and vector lanes, which can be nested to create complex parallel structures. |
| 6 | +Kokkos' *hierarchical parallelism* is a paradigm that enables the exploitation of multiple levels of *shared-memory parallelism*, allowing developers to leverage increased parallelism in their computations for potential performance improvements. This framework supports various levels of parallelism, including thread teams, threads within a team, and vector lanes, which can be nested to create complex parallel structures [1][2][6]. |
7 | 7 |
|
8 | 8 | [.text-justify] |
9 | 9 | The paradigm employs a two-tiered approach: an outer level, often implemented using a league of teams, which divides the overall workload into larger chunks, and an inner level, typically comprising threads within a team, which focuses on finer-grained parallelism within these chunks. *Thread teams*, a fundamental concept in Kokkos, represent collections of threads that can synchronize and share a common scratch pad memory. |
10 | 10 |
|
11 | 11 |
|
12 | | -== hierarchical parallelism |
| 12 | +== Hierarchical parallelism |
13 | 13 |
|
14 | | -[.text-justify] |
15 | | -At the heart of Kokkos' *hierarchical parallelism* lies the ability to exploit multiple levels of *shared-memory parallelism*. |
16 | | -This approach allows developers to map complex algorithms to the hierarchical nature of modern hardware, from multi-core CPUs to many-core GPUs and leverage more parallelism in their computations, potentially leading to significant performance improvements. The framework supports various levels of parallelism, including thread teams, threads within a team, and vector lanes, which can be nested to create complex parallel structures [1][2][6]. |
| 14 | +At the heart of Kokkos' *hierarchical parallelism* lies the ability to exploit multiple levels of *shared-memory parallelism*. |
| 15 | +This approach allows developers to map complex algorithms to the hierarchical nature of modern hardware, from multi-core CPUs to many-core GPUs and leverage more parallelism in their computations, potentially leading to significant performance improvements. The framework supports various levels of parallelism, including thread teams, threads within a team, and vector lanes, which can be nested to create complex parallel structures . |
17 | 16 |
|
18 | 17 | *Similarities and Differences Between Outer and Inner Levels of Parallelism* |
19 | 18 |
|
@@ -50,43 +49,43 @@ Well-coordinated teams can significantly boost performance by: |
50 | 49 |
|
51 | 50 | [source, c++] |
52 | 51 | ---- |
53 | | - struct HierarchicalParallelism { |
54 | | - Kokkos::View<double**> matrix; |
55 | | - HierarchicalParallelism(int N, int M) : matrix("matrix", N, M) {} |
56 | | - KOKKOS_INLINE_FUNCTION |
57 | | - void operator()(const Kokkos::TeamPolicy<>::member_type& team_member) const { |
58 | | - const int i = team_member.league_rank(); |
59 | | - Kokkos::parallel_for(Kokkos::TeamThreadRange(team_member, matrix.extent(1)), |
60 | | - [&] (const int j) { |
61 | | - matrix(i, j) = i * matrix.extent(1) + j; |
62 | | - }); |
63 | | - |
64 | | - team_member.team_barrier(); |
65 | | - if (team_member.team_rank() == 0) { |
66 | | - double sum = 0.0; |
67 | | - Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team_member, matrix.extent(1)), |
68 | | - [&] (const int j, double& lsum) { |
69 | | - lsum += matrix(i, j); |
70 | | - }, sum); |
71 | | - |
72 | | - Kokkos::single(Kokkos::PerTeam(team_member), [&] () { |
73 | | - matrix(i, 0) = sum; |
74 | | - }); |
75 | | - } |
76 | | - } |
77 | | - }; |
78 | | -
|
79 | | - int main(int argc, char* argv[]) { |
80 | | - Kokkos::initialize(argc, argv); |
81 | | - { |
82 | | - const int N = 1000; |
83 | | - const int M = 100; |
84 | | - HierarchicalParallelism functor(N, M); |
85 | | - Kokkos::parallel_for(Kokkos::TeamPolicy<>(N, Kokkos::AUTO), functor); |
86 | | - } |
87 | | - Kokkos::finalize(); |
88 | | - return 0 |
| 52 | +struct HierarchicalParallelism { |
| 53 | + Kokkos::View<double**> matrix; |
| 54 | + HierarchicalParallelism(int N, int M) : matrix("matrix", N, M) {} |
| 55 | + KOKKOS_INLINE_FUNCTION |
| 56 | + void operator()(const Kokkos::TeamPolicy<>::member_type& team_member) const { |
| 57 | + const int i = team_member.league_rank(); |
| 58 | + Kokkos::parallel_for(Kokkos::TeamThreadRange(team_member, matrix.extent(1)), |
| 59 | + [&] (const int j) { |
| 60 | + matrix(i, j) = i * matrix.extent(1) + j; |
| 61 | + }); |
| 62 | +
|
| 63 | + team_member.team_barrier(); |
| 64 | + if (team_member.team_rank() == 0) { |
| 65 | + double sum = 0.0; |
| 66 | + Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team_member, matrix.extent(1)), |
| 67 | + [&] (const int j, double& lsum) { |
| 68 | + lsum += matrix(i, j); |
| 69 | + }, sum); |
| 70 | +
|
| 71 | + Kokkos::single(Kokkos::PerTeam(team_member), [&] () { |
| 72 | + matrix(i, 0) = sum; |
| 73 | + }); |
89 | 74 | } |
| 75 | + } |
| 76 | +}; |
| 77 | +
|
| 78 | +int main(int argc, char* argv[]) { |
| 79 | + Kokkos::initialize(argc, argv); |
| 80 | + { |
| 81 | + const int N = 1000; |
| 82 | + const int M = 100; |
| 83 | + HierarchicalParallelism functor(N, M); |
| 84 | + Kokkos::parallel_for(Kokkos::TeamPolicy<>(N, Kokkos::AUTO), functor); |
| 85 | + } |
| 86 | + Kokkos::finalize(); |
| 87 | + return 0 |
| 88 | + } |
90 | 89 | ---- |
91 | 90 |
|
92 | 91 | Hierarchical parallelism is implemented as follows: |
@@ -236,7 +235,7 @@ Explanations: |
236 | 235 | *** Scratch Memory can be use with the TeamPolicy to provide thread or team private memory. |
237 | 236 | *** Scratch memory exposes on-chip user managed caches (e.g. on NVIDIA GPUs) |
238 | 237 | *** The size must be determined before launching a kernel. |
239 | | -*** Two levels are available: large/slow and small/fast. |
| 238 | +*** Two levels are available: large/slow and small/fast. |
240 | 239 |
|
241 | 240 |
|
242 | 241 | * *Tocken* |
|
0 commit comments