Skip to content

Commit c88c17b

Browse files
gsitarambobrobeygmarkomanolisjychang48
committed
Affinity blogs part 1 and 2
Co-authored-by: bobrobey <[email protected]> Co-authored-by: George Markomanolis <[email protected]> Co-authored-by: Justin Chang <[email protected]>
1 parent d0cc152 commit c88c17b

15 files changed

+54355
-0
lines changed

affinity/AUTHORS

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
Corresponding Author:
2+
3+
- Gina Sitaraman
4+
5+
Authors:
6+
7+
- Bob Robey
8+
- Georgios Markomanolis
9+
10+
Reviewer:
11+
12+
- Justin Chang

affinity/Affinity_Part1.md

+250
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
# Affinity part 1 - Affinity, placement, and order
2+
3+
Modern hardware architectures are increasingly complex with multiple sockets,
4+
many cores in each Central Processing Unit (CPU), Graphical Processing Units
5+
(GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as
6+
GPUs or memory controllers will often be local to a CPU socket. Such designs present
7+
interesting challenges in optimizing memory access times, data transfer times, etc.
8+
Depending on how the system is built, hardware components are connected,
9+
and the workload being run, it may be advantageous
10+
to use the resources of the system in a specific way. In this article,
11+
we will discuss the role of affinity, placement, and order in improving performance for
12+
High Performance Computing (HPC) workloads. A short case study is also presented to
13+
familiarize you with performance considerations on a node in the
14+
[Frontier](https://www.olcf.ornl.gov/frontier/) supercomputer. In a
15+
[follow-up article](./Affinity_Part2.md), we also aim to equip you with the tools you
16+
need to understand your system's hardware topology and set up affinity for your
17+
application accordingly.
18+
19+
## A brief introduction to NUMA systems
20+
21+
In Non-Uniform Memory Access (NUMA) systems, resources are logically partitioned into
22+
multiple *domains* or *nodes*. Even though all processor cores can read or write from
23+
any memory on the system, each processor core has *local* memory that is attached to it,
24+
and *non-local* or *remote* memory that is attached to other processor cores or that it
25+
shares with other processors. Accessing data from *local* memory is faster than
26+
accessing data from *remote* memory. This latency is higher especially for data accesses
27+
that cross a socket-to-socket interconnect. With local accesses, memory contention from
28+
CPU cores is reduced resulting in increased bandwidth. Therefore, in such systems, it is
29+
important to spread the processes and their threads across multiple NUMA domains so that
30+
all resources of the system are used uniformly.
31+
32+
NUMA systems can be configured with multiple *domains* per socket. The NUMA domains Per
33+
Socket (NPS) configuration is performed at boot-time and typically by administrators of
34+
large compute clusters. In dual-socket nodes, for instance, it is common to find NPS1
35+
or NPS4 configurations where each socket is set up to have 1 or 4 NUMA domains. All the
36+
memory controllers, processor cores, NICs, GPUs, and other similar resources are
37+
partitioned among the various NUMA domains based on how they are physically connected
38+
to each other.
39+
40+
Consider a dual socket node where there are 16 memory channels in all. In the NPS1 case,
41+
there is one NUMA domain per socket, each with 8 memory channels. In this case, memory
42+
accesses will be interleaved across all 8 memory channels resulting in uniform
43+
bandwidth. In contrast, in a NPS4 configuration, each of the 4 NUMA domains in a socket
44+
will have memory accesses being interleaved across 2 memory channels. This reduced
45+
contention may potentially increase achieved memory bandwidth if all the processes are
46+
spread across the various NUMA domains.
47+
48+
## Affinity, placement and order - introduction and motivation
49+
50+
Scheduling processes and their threads on processor cores is controlled by the Operating
51+
System (OS). The OS manages preemption of processes when resources are scarce and are
52+
shared between many processes. When the OS decides to reschedule a process, it may
53+
choose a new processor core. In this case, any cached data has to be moved to the caches
54+
closer to the new core. This increases latency and lowers performance for the workload.
55+
The operating system is unaware of parallel processes or their threads. In the case of
56+
a multi-process job, such process movement and associated data movement may
57+
cause all the other processes to wait longer at synchronization barriers. OS schedulers
58+
need assistance from the software developer to efficiently manage CPU and GPU affinity.
59+
60+
### Affinity
61+
62+
Affinity is a way for processes to indicate preference of hardware components so that a
63+
given process is always scheduled to the same set of compute cores and is able to access
64+
data from *local* memory efficiently. Processes can be pinned to resources typically
65+
belonging to the same NUMA domain. Setting affinity improves cache reuse and NUMA memory
66+
locality, reduces contention for resources, lowers latency and reduces variability from
67+
run to run. Affinity is extremely important for processes running on CPU cores and the
68+
resulting placement of their data in CPU memory. On systems with CPUs and GPUs, affinity
69+
is less critical unless there is a bottleneck with the location of data in host memory.
70+
If data in host memory is not in the same NUMA domain as the GPU, then memory copies
71+
between host and device, page migration and direct memory access may be affected.
72+
73+
For parallel processes, affinity is more than binding; we also have to pay attention to
74+
placement and order. Let us look at these two ideas more closely.
75+
76+
### Placement
77+
78+
Placement indicates where the processes of a job are placed. Our goal is to maximize
79+
available resources for our workload. To achieve this goal for different types of
80+
workloads, we may do different things. Consider some scenarios to illustrate this point:
81+
82+
- We may want to use all resources such as CPU cores, caches, GPUs, NICs, memory
83+
controllers, etc.
84+
- If processes have multiple threads (OpenMP&reg;), we may require each thread to run on
85+
a separate CPU core
86+
- In some cases, to avoid thrashing of caches, we may want to use only one Hardware
87+
Thread (HWT) per physical core
88+
- In cases where there is not enough memory per process, we may want to skip some CPU
89+
cores
90+
- We may want to reserve some cores for system operations such as servicing GPU
91+
interrupts, etc. to reduce jitter for timing purposes
92+
- Message Passing Interface (MPI) prefers "gang scheduling" whereas the OS does not know
93+
that the processes are connected
94+
95+
On today's hardware, controlling placement may help avoid oversubscription of compute
96+
resources and thereby avoid unnecessary contention for common resources. Proper
97+
placement can help avoid non-uniform use of compute resources where some resources are
98+
used and some idle. When processes are placed too widely apart, this may result in
99+
sub-optimal communication performance. And most importantly, using process placement,
100+
we can prevent migration of processes by the operating system. We must note that
101+
affinity controls in the OS and MPI have greatly improved and changed over the years.
102+
103+
### Order
104+
105+
Order defines how processes of a parallel job are distributed across the sockets of the
106+
node. There are many ways to order processes and we can choose the right one for our
107+
application if we understand the communication pattern in our application. For instance,
108+
if processes communicating with each other are placed close together, maybe on the same
109+
socket, we can lower communication latency. If we had a heavy workload, it may be better
110+
balanced if scattered across all available compute resources.
111+
112+
In many job scheduling systems, the default ordering mechanism is *Round-Robin* or
113+
*Cyclic* where processes are distributed in a round-robin fashion across sockets as
114+
shown in the figure below. In this example, 8 MPI ranks are being scheduled across two
115+
4-core sockets. Cyclic ordering helps maximize available cache for each process and
116+
evenly utilize the resources of a node.
117+
118+
![!](images/cyclic_ordering.svg)
119+
120+
Another commonly used ordering mechanism is called *Packed* or *Close* where consecutive
121+
MPI ranks are assigned to processors in the same socket until it is filled before
122+
scheduling a rank on a different socket. Packed ordering is illustrated in the figure
123+
below for the same case where 8 MPI ranks are scheduled across two sockets. Closely
124+
packing processes can result in improved performance due to data locality if ranks that
125+
communicate the most are accessing data in the same memory node and sharing caches.
126+
127+
![!](images/packed_ordering.svg)
128+
129+
Choosing rank order carefully helps optimize communication. We know that intra-node
130+
communication is faster than inter-node communication. The application or domain expert
131+
may know the best placement for the application at hand. For example, stencil
132+
near-neighbors are best placed next to each other. Tools such as HPE's CrayPat profiler
133+
or `grid_order` utility can be used to detect communication pattern between MPI ranks
134+
and generate an optimal rank order in a file that can further be supplied to Cray MPICH
135+
when running the workload. Slurm binding options may also be available at large
136+
computing sites.
137+
138+
## Case study: Placement considerations on a Frontier node
139+
140+
Oak Ridge National Laboratory (ORNL)'s
141+
[Frontier supercomputer](https://www.olcf.ornl.gov/frontier/)
142+
is a system based on HPE Cray's EX architecture with optimized 3rd Gen AMD EPYC&trade;
143+
CPUs and AMD Instinct&trade; MI250X GPUs. In the figure depicting the topology of a
144+
Frontier node below, we see that the 64-core CPU is connected with 4 MI250X GPUs via
145+
high speed Infinity Fabric&trade; links. We also observe that each MI250X GPU consists
146+
of two Graphics Compute Dies (GCDs), each with 64 GB of High Bandwidth Memory (HBM).
147+
The CPU is connected to 512 GB of DDR4 memory. The two GCDs in each GPU have four
148+
Infinity Fabric&trade; links between them. GCDs between different GPUs are also
149+
connected via Infinity Fabric&trade; links, but fewer of them. We see that there are 4
150+
NICs connected directly to odd numbered GCDs. Lastly, we see that the CPU is configured
151+
in NPS4 mode, so every 16 cores belong to a NUMA domain. Simultaneous Multi-Threading
152+
(SMT) is enabled, so there are two HWTs per physical core.
153+
154+
![!](images/lumi_node_diagram.svg)
155+
156+
On this complex architecture, it is important to choose rank order and placement
157+
carefully to optimize communication. Let us look at a few aspects of this architecture
158+
and attempt to prescribe best practices for each.
159+
160+
### Consideration 1 - Each GCD is connected to 8 CPU cores in a NUMA domain
161+
162+
In the simplified figure below, we see that each GCD is connected to 8 CPU cores and
163+
they belong to the same NUMA domain. For instance, we see that CPU cores 0-7 are closest
164+
to GCD 4 and CPU cores 48-55 are closest to GCD 0. Therefore, pinning a process and its
165+
threads on cores closest to the GCD it uses would improve the efficiency of
166+
Host-to-Device (H2D) and Device-to-Host (D2H) transfers.
167+
168+
![!](images/Frontier_Node_Diagram_Simple.svg)
169+
170+
### Consideration 2 - Memory bandwidth is highest between GCDs of the same MI250X GPU
171+
172+
As seen in the figure below, we have four Infinity Fabric&trade; links between the two
173+
GCDs of a MI250X GPU for a combined 200 GB/s peak bandwidth in each direction. This can
174+
be advantageous for reducing communication latency if we place pairs of ranks that
175+
communicate the most on GCDs of the same GPU. Note that even though bandwidths are
176+
different between different pairs of GCDs, communication using device buffers will be at
177+
least as fast as communication using host buffers.
178+
179+
![!](images/bandwidths.svg)
180+
181+
### Consideration 3 - NICs are attached to odd-numbered GCDs
182+
183+
In the figure below, we see that there are four NICs on a Frontier node and they are
184+
directly connected to odd-numbered GCDs. Hence, inter-node MPI communication using
185+
device buffers (GPU Aware MPI) is expected to be faster. HPE Cray's MPI implementation,
186+
for instance, provides environment variables to pick the ideal mapping between a process
187+
and the default NIC. You can find more information about this using `man mpi` on Cray
188+
systems.
189+
190+
![!](images/nics.svg)
191+
192+
### Consideration 4 - Multiple processes on the same GCD
193+
194+
AMD GPUs natively support running multiple MPI ranks on the same device where processes
195+
share the available resources improving utilization. Depending on the application's
196+
communication pattern, packing ranks that communicate most on the same device can
197+
improve performance. In the figure shown below, 4 MPI ranks are running on GCD 4. These
198+
4 ranks are pinned to CPU cores 0, 2, 4 and 6 respectively.
199+
200+
![!](images/multiple_mpi_ranks.svg)
201+
202+
In this case study, we examined the topology of Frontier nodes and this helped us
203+
understand how we may want to bind, place and order the processes when running our
204+
workloads. Such an analysis is required on any system that you have in order to extract
205+
a little more performance from your jobs. We hope these ideas help you ask the right
206+
questions when optimizing your runs for a new system.
207+
208+
## Conclusion
209+
210+
In parallel applications, affinity involves placement, order and binding. Setting
211+
affinity is a critical piece of the optimization puzzle for hybrid applications on the
212+
complex hardware architectures of today. Choosing the right binding, placement and order
213+
can help improve achieved memory bandwidth, improve achieved bandwidth of data transfers
214+
between host and device, optimize communication, and avoid excessive thread or process
215+
migration. To achieve proper affinity for a given application, we need to know the
216+
hardware topology. Understanding the performance limiters of the application can help
217+
design the best strategy for using the available resources. Knowing the communication
218+
pattern between processes can guide placement of the processes. We also need to know how
219+
to control placement for the processes and threads of our application. The tools to
220+
understand system topology and techniques for setting affinity will be discussed in [Part
221+
2](./Affinity_Part2.md) of the Affinity blog series.
222+
223+
### References
224+
225+
- [Frontier, the first exascale computer](https://www.olcf.ornl.gov/frontier/)
226+
- [Frontier User Guide, Oak Ridge Leadership Compute Facility, Oak Ridge National Laboratory (ORNL)](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#id2)
227+
- Parallel and High Performance Computing, Robert Robey and Yuliana Zamora, Manning
228+
Publications, May 2021
229+
- [OpenMP® Specification](https://www.openmp.org/)
230+
- [MPICH](https://www.mpich.org/)
231+
- [OpenMPI](https://www.open-mpi.org/)
232+
- [Slurm](https://slurm.schedmd.com/)
233+
- Performance Analysis of CP2K Code for Ab Initio Molecular Dynamics on CPUs and GPUs,
234+
Dewi Yokelson, Nikolay V. Tkachenko, Robert Robey, Ying Wai Li, and Pavel A. Dub,
235+
*Journal of Chemical Information and Modeling 2022 62 (10)*, 2378-2386, DOI:
236+
10.1021/acs.jcim.1c01538
237+
238+
### Disclaimers
239+
240+
The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture
241+
Review Board.
242+
243+
HPE is a registered trademark of Hewlett Packard Enterprise Company and/or its
244+
affiliates.
245+
246+
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
247+
248+
### Acknowledgements
249+
250+
We thank Bill Brantley and Leopold Grinberg for their guidance and feedback.

0 commit comments

Comments
 (0)