You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -14,7 +14,7 @@ The paradigm employs a two-tiered approach: an outer level, often implemented us
14
14
At the heart of Kokkos' *hierarchical parallelism* lies the ability to exploit multiple levels of *shared-memory parallelism*.
15
15
This approach allows developers to map complex algorithms to the hierarchical nature of modern hardware, from multi-core CPUs to many-core GPUs and leverage more parallelism in their computations, potentially leading to significant performance improvements. The framework supports various levels of parallelism, including thread teams, threads within a team, and vector lanes, which can be nested to create complex parallel structures .
16
16
17
-
*Similarities and Differences Between Outer and Inner Levels of Parallelism*
17
+
=== Similarities and Differences Between Outer and Inner Levels of Parallelism
18
18
19
19
- **Outer Level (League)**: The outermost level of parallelism, often referred to as the "league," typically corresponds to coarse-grained work distribution. This level is suitable for dividing large workloads across multiple compute units or NUMA domains.
20
20
@@ -24,75 +24,83 @@ This approach allows developers to map complex algorithms to the hierarchical na
24
24
25
25
- **Differences**: Inner levels have access to fast, shared memory resources and synchronization primitives, while outer levels are more independent and lack direct communication mechanisms.
26
26
27
-
*Thread Teams*
27
+
=== Thread Teams
28
28
29
29
Kokkos introduces the concept of *thread teams*, which organizes parallel work into a two-dimensional structure:
30
30
31
31
- **League**: A collection of teams that can execute independently.
32
32
- **Team**: A group of threads that can synchronize and share resources.
33
33
- **Thread**: The basic unit of parallel execution within a team.
34
34
35
-
This hierarchical structure allows for efficient mapping of algorithms to hardware:
35
+
This hierarchical structure allows for efficient mapping of algorithms to hardware:
36
36
37
37
- On *GPUs*, *teams* often correspond to thread blocks, with threads mapping to CUDA threads or vectorized operations.
38
38
- On *CPUs*, *teams* might represent groups of cores, with threads corresponding to individual CPU threads or SIMD lanes.
39
39
40
-
*Performance Improvement with Well-Coordinated Teams*
40
+
=== Performance Improvement with Well-Coordinated Teams
41
41
42
42
Well-coordinated teams can significantly boost performance by:
43
43
44
44
- **Optimizing Memory Access**: Teams can cooperatively load data into shared memory, reducing global memory accesses.
45
45
- **Load Balancing**: The two-level structure allows for dynamic work distribution, adapting to varying workloads across different parts of the computation.
46
46
- **Hardware Utilization**: By matching the team structure to hardware capabilities, Kokkos can achieve high occupancy and efficient resource usage [3].
47
47
48
-
*Example*
49
48
49
+
=== Example of implementation
50
+
51
+
52
+
.`HierarchicalParallelism`
50
53
[source, c++]
51
54
----
52
55
struct HierarchicalParallelism {
53
56
Kokkos::View<double**> matrix;
54
57
HierarchicalParallelism(int N, int M) : matrix("matrix", N, M) {}
0 commit comments