|
| 1 | +# Affinity part 1 - Affinity, placement, and order |
| 2 | + |
| 3 | +Modern hardware architectures are increasingly complex with multiple sockets, |
| 4 | +many cores in each Central Processing Unit (CPU), Graphical Processing Units |
| 5 | +(GPUs), memory controllers, Network Interface Cards (NICs), etc. Peripherals such as |
| 6 | +GPUs or memory controllers will often be local to a CPU socket. Such designs present |
| 7 | +interesting challenges in optimizing memory access times, data transfer times, etc. |
| 8 | +Depending on how the system is built, hardware components are connected, |
| 9 | +and the workload being run, it may be advantageous |
| 10 | +to use the resources of the system in a specific way. In this article, |
| 11 | +we will discuss the role of affinity, placement, and order in improving performance for |
| 12 | +High Performance Computing (HPC) workloads. A short case study is also presented to |
| 13 | +familiarize you with performance considerations on a node in the |
| 14 | +[Frontier](https://www.olcf.ornl.gov/frontier/) supercomputer. In a |
| 15 | +[follow-up article](./Affinity_Part2.md), we also aim to equip you with the tools you |
| 16 | +need to understand your system's hardware topology and set up affinity for your |
| 17 | +application accordingly. |
| 18 | + |
| 19 | +## A brief introduction to NUMA systems |
| 20 | + |
| 21 | +In Non-Uniform Memory Access (NUMA) systems, resources are logically partitioned into |
| 22 | +multiple *domains* or *nodes*. Even though all processor cores can read or write from |
| 23 | +any memory on the system, each processor core has *local* memory that is attached to it, |
| 24 | +and *non-local* or *remote* memory that is attached to other processor cores or that it |
| 25 | +shares with other processors. Accessing data from *local* memory is faster than |
| 26 | +accessing data from *remote* memory. This latency is higher especially for data accesses |
| 27 | +that cross a socket-to-socket interconnect. With local accesses, memory contention from |
| 28 | +CPU cores is reduced resulting in increased bandwidth. Therefore, in such systems, it is |
| 29 | +important to spread the processes and their threads across multiple NUMA domains so that |
| 30 | +all resources of the system are used uniformly. |
| 31 | + |
| 32 | +NUMA systems can be configured with multiple *domains* per socket. The NUMA domains Per |
| 33 | +Socket (NPS) configuration is performed at boot-time and typically by administrators of |
| 34 | +large compute clusters. In dual-socket nodes, for instance, it is common to find NPS1 |
| 35 | +or NPS4 configurations where each socket is set up to have 1 or 4 NUMA domains. All the |
| 36 | +memory controllers, processor cores, NICs, GPUs, and other similar resources are |
| 37 | +partitioned among the various NUMA domains based on how they are physically connected |
| 38 | +to each other. |
| 39 | + |
| 40 | +Consider a dual socket node where there are 16 memory channels in all. In the NPS1 case, |
| 41 | +there is one NUMA domain per socket, each with 8 memory channels. In this case, memory |
| 42 | +accesses will be interleaved across all 8 memory channels resulting in uniform |
| 43 | +bandwidth. In contrast, in a NPS4 configuration, each of the 4 NUMA domains in a socket |
| 44 | +will have memory accesses being interleaved across 2 memory channels. This reduced |
| 45 | +contention may potentially increase achieved memory bandwidth if all the processes are |
| 46 | +spread across the various NUMA domains. |
| 47 | + |
| 48 | +## Affinity, placement and order - introduction and motivation |
| 49 | + |
| 50 | +Scheduling processes and their threads on processor cores is controlled by the Operating |
| 51 | +System (OS). The OS manages preemption of processes when resources are scarce and are |
| 52 | +shared between many processes. When the OS decides to reschedule a process, it may |
| 53 | +choose a new processor core. In this case, any cached data has to be moved to the caches |
| 54 | +closer to the new core. This increases latency and lowers performance for the workload. |
| 55 | +The operating system is unaware of parallel processes or their threads. In the case of |
| 56 | +a multi-process job, such process movement and associated data movement may |
| 57 | +cause all the other processes to wait longer at synchronization barriers. OS schedulers |
| 58 | +need assistance from the software developer to efficiently manage CPU and GPU affinity. |
| 59 | + |
| 60 | +### Affinity |
| 61 | + |
| 62 | +Affinity is a way for processes to indicate preference of hardware components so that a |
| 63 | +given process is always scheduled to the same set of compute cores and is able to access |
| 64 | +data from *local* memory efficiently. Processes can be pinned to resources typically |
| 65 | +belonging to the same NUMA domain. Setting affinity improves cache reuse and NUMA memory |
| 66 | +locality, reduces contention for resources, lowers latency and reduces variability from |
| 67 | +run to run. Affinity is extremely important for processes running on CPU cores and the |
| 68 | +resulting placement of their data in CPU memory. On systems with CPUs and GPUs, affinity |
| 69 | +is less critical unless there is a bottleneck with the location of data in host memory. |
| 70 | +If data in host memory is not in the same NUMA domain as the GPU, then memory copies |
| 71 | +between host and device, page migration and direct memory access may be affected. |
| 72 | + |
| 73 | +For parallel processes, affinity is more than binding; we also have to pay attention to |
| 74 | +placement and order. Let us look at these two ideas more closely. |
| 75 | + |
| 76 | +### Placement |
| 77 | + |
| 78 | +Placement indicates where the processes of a job are placed. Our goal is to maximize |
| 79 | +available resources for our workload. To achieve this goal for different types of |
| 80 | +workloads, we may do different things. Consider some scenarios to illustrate this point: |
| 81 | + |
| 82 | +- We may want to use all resources such as CPU cores, caches, GPUs, NICs, memory |
| 83 | +controllers, etc. |
| 84 | +- If processes have multiple threads (OpenMP®), we may require each thread to run on |
| 85 | +a separate CPU core |
| 86 | +- In some cases, to avoid thrashing of caches, we may want to use only one Hardware |
| 87 | +Thread (HWT) per physical core |
| 88 | +- In cases where there is not enough memory per process, we may want to skip some CPU |
| 89 | +cores |
| 90 | +- We may want to reserve some cores for system operations such as servicing GPU |
| 91 | +interrupts, etc. to reduce jitter for timing purposes |
| 92 | +- Message Passing Interface (MPI) prefers "gang scheduling" whereas the OS does not know |
| 93 | +that the processes are connected |
| 94 | + |
| 95 | +On today's hardware, controlling placement may help avoid oversubscription of compute |
| 96 | +resources and thereby avoid unnecessary contention for common resources. Proper |
| 97 | +placement can help avoid non-uniform use of compute resources where some resources are |
| 98 | +used and some idle. When processes are placed too widely apart, this may result in |
| 99 | +sub-optimal communication performance. And most importantly, using process placement, |
| 100 | +we can prevent migration of processes by the operating system. We must note that |
| 101 | +affinity controls in the OS and MPI have greatly improved and changed over the years. |
| 102 | + |
| 103 | +### Order |
| 104 | + |
| 105 | +Order defines how processes of a parallel job are distributed across the sockets of the |
| 106 | +node. There are many ways to order processes and we can choose the right one for our |
| 107 | +application if we understand the communication pattern in our application. For instance, |
| 108 | +if processes communicating with each other are placed close together, maybe on the same |
| 109 | +socket, we can lower communication latency. If we had a heavy workload, it may be better |
| 110 | +balanced if scattered across all available compute resources. |
| 111 | + |
| 112 | +In many job scheduling systems, the default ordering mechanism is *Round-Robin* or |
| 113 | +*Cyclic* where processes are distributed in a round-robin fashion across sockets as |
| 114 | +shown in the figure below. In this example, 8 MPI ranks are being scheduled across two |
| 115 | +4-core sockets. Cyclic ordering helps maximize available cache for each process and |
| 116 | +evenly utilize the resources of a node. |
| 117 | + |
| 118 | + |
| 119 | + |
| 120 | +Another commonly used ordering mechanism is called *Packed* or *Close* where consecutive |
| 121 | +MPI ranks are assigned to processors in the same socket until it is filled before |
| 122 | +scheduling a rank on a different socket. Packed ordering is illustrated in the figure |
| 123 | +below for the same case where 8 MPI ranks are scheduled across two sockets. Closely |
| 124 | +packing processes can result in improved performance due to data locality if ranks that |
| 125 | +communicate the most are accessing data in the same memory node and sharing caches. |
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +Choosing rank order carefully helps optimize communication. We know that intra-node |
| 130 | +communication is faster than inter-node communication. The application or domain expert |
| 131 | +may know the best placement for the application at hand. For example, stencil |
| 132 | +near-neighbors are best placed next to each other. Tools such as HPE's CrayPat profiler |
| 133 | +or `grid_order` utility can be used to detect communication pattern between MPI ranks |
| 134 | +and generate an optimal rank order in a file that can further be supplied to Cray MPICH |
| 135 | +when running the workload. Slurm binding options may also be available at large |
| 136 | +computing sites. |
| 137 | + |
| 138 | +## Case study: Placement considerations on a Frontier node |
| 139 | + |
| 140 | +Oak Ridge National Laboratory (ORNL)'s |
| 141 | +[Frontier supercomputer](https://www.olcf.ornl.gov/frontier/) |
| 142 | +is a system based on HPE Cray's EX architecture with optimized 3rd Gen AMD EPYC™ |
| 143 | +CPUs and AMD Instinct™ MI250X GPUs. In the figure depicting the topology of a |
| 144 | +Frontier node below, we see that the 64-core CPU is connected with 4 MI250X GPUs via |
| 145 | +high speed Infinity Fabric™ links. We also observe that each MI250X GPU consists |
| 146 | +of two Graphics Compute Dies (GCDs), each with 64 GB of High Bandwidth Memory (HBM). |
| 147 | +The CPU is connected to 512 GB of DDR4 memory. The two GCDs in each GPU have four |
| 148 | +Infinity Fabric™ links between them. GCDs between different GPUs are also |
| 149 | +connected via Infinity Fabric™ links, but fewer of them. We see that there are 4 |
| 150 | +NICs connected directly to odd numbered GCDs. Lastly, we see that the CPU is configured |
| 151 | +in NPS4 mode, so every 16 cores belong to a NUMA domain. Simultaneous Multi-Threading |
| 152 | +(SMT) is enabled, so there are two HWTs per physical core. |
| 153 | + |
| 154 | + |
| 155 | + |
| 156 | +On this complex architecture, it is important to choose rank order and placement |
| 157 | +carefully to optimize communication. Let us look at a few aspects of this architecture |
| 158 | +and attempt to prescribe best practices for each. |
| 159 | + |
| 160 | +### Consideration 1 - Each GCD is connected to 8 CPU cores in a NUMA domain |
| 161 | + |
| 162 | +In the simplified figure below, we see that each GCD is connected to 8 CPU cores and |
| 163 | +they belong to the same NUMA domain. For instance, we see that CPU cores 0-7 are closest |
| 164 | +to GCD 4 and CPU cores 48-55 are closest to GCD 0. Therefore, pinning a process and its |
| 165 | +threads on cores closest to the GCD it uses would improve the efficiency of |
| 166 | +Host-to-Device (H2D) and Device-to-Host (D2H) transfers. |
| 167 | + |
| 168 | + |
| 169 | + |
| 170 | +### Consideration 2 - Memory bandwidth is highest between GCDs of the same MI250X GPU |
| 171 | + |
| 172 | +As seen in the figure below, we have four Infinity Fabric™ links between the two |
| 173 | +GCDs of a MI250X GPU for a combined 200 GB/s peak bandwidth in each direction. This can |
| 174 | +be advantageous for reducing communication latency if we place pairs of ranks that |
| 175 | +communicate the most on GCDs of the same GPU. Note that even though bandwidths are |
| 176 | +different between different pairs of GCDs, communication using device buffers will be at |
| 177 | +least as fast as communication using host buffers. |
| 178 | + |
| 179 | + |
| 180 | + |
| 181 | +### Consideration 3 - NICs are attached to odd-numbered GCDs |
| 182 | + |
| 183 | +In the figure below, we see that there are four NICs on a Frontier node and they are |
| 184 | +directly connected to odd-numbered GCDs. Hence, inter-node MPI communication using |
| 185 | +device buffers (GPU Aware MPI) is expected to be faster. HPE Cray's MPI implementation, |
| 186 | +for instance, provides environment variables to pick the ideal mapping between a process |
| 187 | +and the default NIC. You can find more information about this using `man mpi` on Cray |
| 188 | +systems. |
| 189 | + |
| 190 | + |
| 191 | + |
| 192 | +### Consideration 4 - Multiple processes on the same GCD |
| 193 | + |
| 194 | +AMD GPUs natively support running multiple MPI ranks on the same device where processes |
| 195 | +share the available resources improving utilization. Depending on the application's |
| 196 | +communication pattern, packing ranks that communicate most on the same device can |
| 197 | +improve performance. In the figure shown below, 4 MPI ranks are running on GCD 4. These |
| 198 | +4 ranks are pinned to CPU cores 0, 2, 4 and 6 respectively. |
| 199 | + |
| 200 | + |
| 201 | + |
| 202 | +In this case study, we examined the topology of Frontier nodes and this helped us |
| 203 | +understand how we may want to bind, place and order the processes when running our |
| 204 | +workloads. Such an analysis is required on any system that you have in order to extract |
| 205 | +a little more performance from your jobs. We hope these ideas help you ask the right |
| 206 | +questions when optimizing your runs for a new system. |
| 207 | + |
| 208 | +## Conclusion |
| 209 | + |
| 210 | +In parallel applications, affinity involves placement, order and binding. Setting |
| 211 | +affinity is a critical piece of the optimization puzzle for hybrid applications on the |
| 212 | +complex hardware architectures of today. Choosing the right binding, placement and order |
| 213 | +can help improve achieved memory bandwidth, improve achieved bandwidth of data transfers |
| 214 | +between host and device, optimize communication, and avoid excessive thread or process |
| 215 | +migration. To achieve proper affinity for a given application, we need to know the |
| 216 | +hardware topology. Understanding the performance limiters of the application can help |
| 217 | +design the best strategy for using the available resources. Knowing the communication |
| 218 | +pattern between processes can guide placement of the processes. We also need to know how |
| 219 | +to control placement for the processes and threads of our application. The tools to |
| 220 | +understand system topology and techniques for setting affinity will be discussed in [Part |
| 221 | +2](./Affinity_Part2.md) of the Affinity blog series. |
| 222 | + |
| 223 | +### References |
| 224 | + |
| 225 | +- [Frontier, the first exascale computer](https://www.olcf.ornl.gov/frontier/) |
| 226 | +- [Frontier User Guide, Oak Ridge Leadership Compute Facility, Oak Ridge National Laboratory (ORNL)](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#id2) |
| 227 | +- Parallel and High Performance Computing, Robert Robey and Yuliana Zamora, Manning |
| 228 | +Publications, May 2021 |
| 229 | +- [OpenMP® Specification](https://www.openmp.org/) |
| 230 | +- [MPICH](https://www.mpich.org/) |
| 231 | +- [OpenMPI](https://www.open-mpi.org/) |
| 232 | +- [Slurm](https://slurm.schedmd.com/) |
| 233 | +- Performance Analysis of CP2K Code for Ab Initio Molecular Dynamics on CPUs and GPUs, |
| 234 | +Dewi Yokelson, Nikolay V. Tkachenko, Robert Robey, Ying Wai Li, and Pavel A. Dub, |
| 235 | +*Journal of Chemical Information and Modeling 2022 62 (10)*, 2378-2386, DOI: |
| 236 | +10.1021/acs.jcim.1c01538 |
| 237 | + |
| 238 | +### Disclaimers |
| 239 | + |
| 240 | +The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture |
| 241 | +Review Board. |
| 242 | + |
| 243 | +HPE is a registered trademark of Hewlett Packard Enterprise Company and/or its |
| 244 | +affiliates. |
| 245 | + |
| 246 | +Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. |
| 247 | + |
| 248 | +### Acknowledgements |
| 249 | + |
| 250 | +We thank Bill Brantley and Leopold Grinberg for their guidance and feedback. |
0 commit comments