Skip to content

Commit abd6fe5

Browse files
committed
Merge branch 'feature/kokkos' of github.com:feelpp/parallel-programming into feature/kokkos
2 parents 3f99cf7 + eef3c7a commit abd6fe5

File tree

6 files changed

+332
-441
lines changed

6 files changed

+332
-441
lines changed

docs/modules/kokkos/examples/src/02_views_2D.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ int main(int argc, char *argv[]) {
1616
// Print the view elements
1717
Kokkos::parallel_for(
1818
"PrintView", 10, KOKKOS_LAMBDA(const int i) {
19-
printf("view(%d) = %f\n", i, view(i, 0));
20-
printf("view(%d) = %f\n", i, view(i, 1));
19+
printf("view(%d, 0) = %f\n", i, view(i, 0));
20+
printf("view(%d, 1) = %f\n", i, view(i, 1));
2121
});
2222
}
2323
Kokkos::finalize();

docs/modules/kokkos/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
*** xref:basic-concepts/execution-spaces.adoc[Execution Spaces]
1212
*** xref:basic-concepts/memory-spaces.adoc[Memory Spaces]
1313
*** xref:basic-concepts/mirrors.adoc[Mirrors]
14+
**** xref:basic-concepts/mirrors_sol_code.adoc[Solution from Kokkos tutorial]
1415
*** xref:basic-concepts/memory-access-patterns.adoc[Memory Access Patterns]
1516
1617
** xref:advanced-concepts/index.adoc[Advanced Concepts]

docs/modules/kokkos/pages/advanced-concepts/hierarchical-parallelism.adoc

Lines changed: 103 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The paradigm employs a two-tiered approach: an outer level, often implemented us
1414
At the heart of Kokkos' *hierarchical parallelism* lies the ability to exploit multiple levels of *shared-memory parallelism*.
1515
This approach allows developers to map complex algorithms to the hierarchical nature of modern hardware, from multi-core CPUs to many-core GPUs and leverage more parallelism in their computations, potentially leading to significant performance improvements. The framework supports various levels of parallelism, including thread teams, threads within a team, and vector lanes, which can be nested to create complex parallel structures .
1616

17-
*Similarities and Differences Between Outer and Inner Levels of Parallelism*
17+
=== Similarities and Differences Between Outer and Inner Levels of Parallelism
1818

1919
- **Outer Level (League)**: The outermost level of parallelism, often referred to as the "league," typically corresponds to coarse-grained work distribution. This level is suitable for dividing large workloads across multiple compute units or NUMA domains.
2020

@@ -24,75 +24,83 @@ This approach allows developers to map complex algorithms to the hierarchical na
2424

2525
- **Differences**: Inner levels have access to fast, shared memory resources and synchronization primitives, while outer levels are more independent and lack direct communication mechanisms.
2626

27-
*Thread Teams*
27+
=== Thread Teams
2828

2929
Kokkos introduces the concept of *thread teams*, which organizes parallel work into a two-dimensional structure:
3030

3131
- **League**: A collection of teams that can execute independently.
3232
- **Team**: A group of threads that can synchronize and share resources.
3333
- **Thread**: The basic unit of parallel execution within a team.
3434

35-
This hierarchical structure allows for efficient mapping of algorithms to hardware:
35+
This hierarchical structure allows for efficient mapping of algorithms to hardware:
3636

3737
- On *GPUs*, *teams* often correspond to thread blocks, with threads mapping to CUDA threads or vectorized operations.
3838
- On *CPUs*, *teams* might represent groups of cores, with threads corresponding to individual CPU threads or SIMD lanes.
3939

40-
*Performance Improvement with Well-Coordinated Teams*
40+
=== Performance Improvement with Well-Coordinated Teams
4141

4242
Well-coordinated teams can significantly boost performance by:
4343

4444
- **Optimizing Memory Access**: Teams can cooperatively load data into shared memory, reducing global memory accesses.
4545
- **Load Balancing**: The two-level structure allows for dynamic work distribution, adapting to varying workloads across different parts of the computation.
4646
- **Hardware Utilization**: By matching the team structure to hardware capabilities, Kokkos can achieve high occupancy and efficient resource usage [3].
4747

48-
*Example*
4948

49+
=== Example of implementation
50+
51+
52+
.`HierarchicalParallelism`
5053
[source, c++]
5154
----
5255
struct HierarchicalParallelism {
5356
Kokkos::View<double**> matrix;
5457
HierarchicalParallelism(int N, int M) : matrix("matrix", N, M) {}
58+
5559
KOKKOS_INLINE_FUNCTION
5660
void operator()(const Kokkos::TeamPolicy<>::member_type& team_member) const {
5761
const int i = team_member.league_rank();
58-
Kokkos::parallel_for(Kokkos::TeamThreadRange(team_member, matrix.extent(1)),
62+
Kokkos::parallel_for(Kokkos::TeamThreadRange(team_member, matrix.extent(1)), <2>
5963
[&] (const int j) {
6064
matrix(i, j) = i * matrix.extent(1) + j;
6165
});
6266
6367
team_member.team_barrier();
6468
if (team_member.team_rank() == 0) {
6569
double sum = 0.0;
66-
Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team_member, matrix.extent(1)),
70+
Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team_member, matrix.extent(1)), <2>
6771
[&] (const int j, double& lsum) {
6872
lsum += matrix(i, j);
6973
}, sum);
7074
71-
Kokkos::single(Kokkos::PerTeam(team_member), [&] () {
75+
Kokkos::single(Kokkos::PerTeam(team_member), [&] () { <3>
7276
matrix(i, 0) = sum;
7377
});
7478
}
7579
}
7680
};
81+
----
7782

83+
.Execution
84+
[source, c++]
85+
----
7886
int main(int argc, char* argv[]) {
7987
Kokkos::initialize(argc, argv);
8088
{
8189
const int N = 1000;
8290
const int M = 100;
8391
HierarchicalParallelism functor(N, M);
84-
Kokkos::parallel_for(Kokkos::TeamPolicy<>(N, Kokkos::AUTO), functor);
92+
Kokkos::parallel_for(Kokkos::TeamPolicy<>(N, Kokkos::AUTO), functor); <1>
8593
}
8694
Kokkos::finalize();
87-
return 0
88-
}
95+
return 0;
96+
}
8997
----
9098

9199
Hierarchical parallelism is implemented as follows:
92100

93-
- The top level uses `Kokkos::TeamPolicy` to parallelize on the rows of the matrix.
94-
- `Kokkos::TeamThreadRange` is used to parallelize operations on columns within each team.
95-
- `Kokkos::single` is used to ensure that some operations are performed only once per team.
101+
. The top level uses `Kokkos::TeamPolicy` to parallelize on the rows of the matrix.
102+
. `Kokkos::TeamThreadRange` is used to parallelize operations on columns within each team.
103+
. `Kokkos::single` is used to ensure that some operations are performed only once per team.
96104

97105

98106
== Scratch Memory
@@ -130,6 +138,60 @@ To effectively use scratch memory:
130138
2. Create scratch views within kernels using `ScratchView` or `team_scratch()`/`thread_scratch()`.
131139
3. Use team barriers (`team.team_barrier()`) to ensure data consistency when sharing scratch memory among threads.
132140

141+
.Example of Scratch Memory Usage
142+
[source, c++]
143+
----
144+
struct ScratchMemoryExample {
145+
Kokkos::View<double*> data;
146+
ScratchMemoryExample(int N) : data("data", N) {}
147+
148+
KOKKOS_INLINE_FUNCTION
149+
void operator()(const Kokkos::TeamPolicy<>::member_type& team_member) const {
150+
const int team_size = team_member.team_size();
151+
const int team_rank = team_member.team_rank();
152+
const int league_rank = team_member.league_rank();
153+
154+
// Allocate team scratch memory
155+
double* team_scratch = (double*)team_member.team_shmem().get_shmem(team_size * sizeof(double)); <1>
156+
157+
// Each thread initializes its scratch memory
158+
team_scratch[team_rank] = league_rank * team_size + team_rank;
159+
160+
// Synchronize to ensure all threads have written to scratch memory
161+
team_member.team_barrier(); <3>
162+
163+
// Perform a reduction within the team
164+
double team_sum = 0.0;
165+
Kokkos::parallel_reduce(Kokkos::TeamThreadRange(team_member, team_size), [&](const int i, double& lsum) {
166+
lsum += team_scratch[i];
167+
}, team_sum);
168+
169+
// Only one thread writes the result back to global memory
170+
if (team_rank == 0) {
171+
data(league_rank) = team_sum;
172+
}
173+
}
174+
175+
// Specify the amount of scratch memory needed
176+
size_t team_shmem_size(int team_size) const {
177+
return team_size * sizeof(double);
178+
}
179+
};
180+
181+
int main(int argc, char* argv[]) {
182+
Kokkos::initialize(argc, argv);
183+
{
184+
const int N = 1000;
185+
ScratchMemoryExample functor(N);
186+
Kokkos::parallel_for(Kokkos::TeamPolicy<>(N / 10, Kokkos::AUTO).set_scratch_size(0, Kokkos::PerTeam(functor.team_shmem_size(10))), functor); <2>
187+
}
188+
Kokkos::finalize();
189+
return 0;
190+
}
191+
----
192+
193+
194+
133195

134196
== Unique Token
135197

@@ -154,36 +216,35 @@ Kokkos offers two scopes for unique tokens: *Global Scope* and *Instance Scope*.
154216
- **Instance Scope**: Tokens are unique only within a specific instance of `UniqueToken`.
155217

156218

157-
*Example*
158-
219+
.Tokens
159220
[source, c++]
160221
----
161-
Kokkos::initialize(argc, argv);
162-
{
163-
// Size of the array
164-
const int N = 100;
165-
// Kokkos view to store the results
166-
Kokkos::View<int*> results("results", N);
167-
// Create a UniqueToken (based on thread execution)
168-
Kokkos::Experimental::UniqueToken<Kokkos::DefaultExecutionSpace> unique_token;
169-
// Number of available threads
170-
const int num_threads = unique_token.size();
171-
std::cout << "Number of threads: " << num_threads << std::endl;
172-
Kokkos::parallel_for("UniqueTokenExample", N, KOKKOS_LAMBDA(const int i) {
173-
// Get a unique identifier for this thread
174-
int token = unique_token.acquire();
175-
results(i) = token;
176-
unique_token.release(token);
177-
});
178-
// Copy the results to the host for display
179-
auto host_results = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), results);
180-
std::cout << "Results: ";
181-
for (int i = 0; i < N; ++i) {
182-
std::cout << host_results(i) << " ";
183-
}
184-
std::cout << std::endl;
222+
Kokkos::initialize(argc, argv);
223+
{
224+
// Size of the array
225+
const int N = 100;
226+
// Kokkos view to store the results
227+
Kokkos::View<int*> results("results", N);
228+
// Create a UniqueToken (based on thread execution)
229+
Kokkos::Experimental::UniqueToken<Kokkos::DefaultExecutionSpace> unique_token; <1>
230+
// Number of available threads
231+
const int num_threads = unique_token.size();
232+
std::cout << "Number of threads: " << num_threads << std::endl;
233+
Kokkos::parallel_for("UniqueTokenExample", N, KOKKOS_LAMBDA(const int i) {
234+
// Get a unique identifier for this thread
235+
int token = unique_token.acquire(); <2>
236+
results(i) = token;
237+
unique_token.release(token); <3>
238+
});
239+
// Copy the results to the host for display
240+
auto host_results = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), results);
241+
std::cout << "Results: ";
242+
for (int i = 0; i < N; ++i) {
243+
std::cout << host_results(i) << " ";
185244
}
186-
Kokkos::finalize();
245+
std::cout << std::endl;
246+
}
247+
Kokkos::finalize();
187248
----
188249

189250
Explanations:
@@ -203,7 +264,7 @@ Explanations:
203264
**Copying results**:
204265
Data is copied to the host using `Kokkos::create_mirror_view_and_copy` for display.
205266

206-
...
267+
207268

208269

209270
== References
@@ -250,11 +311,4 @@ Explanations:
250311
*** UniqueToken can be sized to restrict ids to a range.
251312
*** A Global UniqueToken is available.
252313
253-
254314
****
255-
256-
257-
258-
259-
260-

docs/modules/kokkos/pages/basic-concepts/mirrors_sol_code.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ add_executable(kokkos_mirror 05_kokkos_mirrors.cpp)
1010
target_link_libraries(kokkos_mirror Kokkos::kokkos)
1111
----
1212

13-
[%dynamic, cpp, filename="05_kokkos_mirrors.cpp"]
13+
[source, cpp, filename="05_kokkos_mirrors.cpp", compile=cmake]
1414
----
1515
include::example$src/05_kokkos_mirrors.cpp[]
1616
----

0 commit comments

Comments
 (0)