- GPUs process multiple data streams simultaneously
- They have lower clock speeds than CPUs
- They have higher memory bandwidth than CPUs (optimized for througput)
- SIMT - Threads are explicitly defined by programmer; you can see thread ID and are aware of the thread as a unit of execution.
- SIMD - Lanes are implicit. They are built into GPU hardware and process threads simultaneously.
GPUs expose the SIMT programmihng model while execution is implemented in GPU compute units/streaming multiprocessors.
-
Kernel - think C functions. Unit of a code program that is called once and returned once but executed many times over multiple threads.
-
Thread/Work-item - simplest unit of execution. A sequence of instructions. Think of vector additon C[i] = A[i] + B[i]; each addition can be seen as a thread. For example:
- Load A[i] into Vector0 for all lanes
- Load B[i] into Vector1 for all lanes
- Add Vector0 and Vector1 and store into Vector2 for all lanes
- Store Vector2 into memory (C[i]) for all lanes
-
Thread block/Work-group - smallest unit of thread coordination exposed to programmers.
- Composed of several threads
- Each block is assigned to a SM/CU, and a SM/CU can accomodate several blocks
- Arbitrarily sized (typically multiples of the warp size)
-
Warp/Wavefront - Group of threads scheduled together and execute in parallel--these threads have the same operations that need to be executed
- Each thread block is composed of warps/wavefronts which can be executed in lockstep
- All threads in a warp are scheduled onto a sigle SM/CU--a single SM/CU can typically store and execute multiple warps/wavefronts
- The results of the execution may of a warp/wavefront usually do not occur in a single clock cycle
- For modern GPUs: wavefront size is typically 32 threads
-
Streaming Multiprocessor (SM)/Compute Unit (CU) -The processing heart of GPUs
- Contains warp/wavefront scheduler and SIMD units (plus others like cache, scalar units, local data share, etc., but those are not important for the purposes of this)
-
Kernel Launch
- Host (CPU) launches the kernel with a grid of thread blocks/work groups. Each block contains many threads/work items.
- GPU scheduler sends these blocks to the SMs/CUs.
-
A wavefront is dispatched/issued to each SIMD unit inside the SM/CU using a scheduling algorithm (like Round-Robin)
- Within a block, threads are grouped into wavefronts (AMD GCN1 - 64 threads).
- All wavefronts in a block are guaranteed to reside in the same CU.
- The SM's/CU's scheduler can hold wavefronts from many blocks (GCN1 - 40 wavefronts/CU, up to 10 wavefronts/SIMD unit). The specific amount of wavefronts that can be held by a SIMD unit depend on its resource availability (occupancy).
- If a wavefront on a SIMD unit stalls (e.g., waiting for memory), the SIMD can switch to another ready wavefront from its buffer, keeping the hardware busy -- aka 'hiding memory latency'.
- At each clock cycle:
- The warp scheduler issues a warp/wavefront to one of the SIMD units.
- At most 1 instruction per wavefront may be issued.
-
Each SIMD unit executes the same instruction for all the threads in a wavefront in lockstep
-
Repeat until all wavefronts exhausted/done executing in SM/CU
- Global Data Memory (think DRAM)
- Global Instruction/Program Memory
- Device Control Register - stores metadata of how kernels should be executed by GPU, e.g., how many threads for kernel that was launched
- Block Dispatcher - organizes threads into blocks that can be executed in parallel on a CU and dispatches these blocks to available CUs
- Blocks that can be launched - as many as needed for the kernel workload (queued if all CUs are taken up)
- Memory Controller - coordinates global memory accesses from CUs
-
- Wavefront Dispatcher - dispatches wavefronts (64 threads/wave) to SIMD units
- Considers a wave from one of the SIMD units for execution, selected in a round-robin fashion between SIMDs
- Instruction Fetch - fetches instructions from memory into SIMD units
- Instruction Decoder - breaks down an instruction into opcode, source/destination registers, immediate, etc.
-
- Holds up to one wavefront at a time
- Program Counter
- ALU (16 lanes)
- Load/Store Unit (16)
- Vector Register File - Registers to store data for up to 1 wavefront
- Wavefront Dispatcher - dispatches wavefronts (64 threads/wave) to SIMD units
- [x] Block Dispatcher
- [x] Wavefront Dispatcher
- [x] Instruction Fetcher/Decoder
- [x] Scheduler (SIMD Controller)
- [x] SIMD Unit (includes PC, ALU/LSU lanes, etc.)
- [ ] Compute Unit
- [ ] Memory, Memory Controllers
- [ ] Instruction Buffer (hard)
- [ ] Caching (hard x10)
- [ ] Wave Scheduler (why would you do this? especially in Verilog...)
- Memory/Memory Controllers are simulated in the SIMD testbench for now.
Doubleword: 64 bits
Word: 32 bits
Instruction format: Each instruction takes exactly one word.
Registers: 64 bits x 32 registers for each SIMD lane (need 5b for register addresses within in each lane)
- 16 lanes/SIMD unit means 16 x 32 registers = 512 registers/SIMD unit
- 4 KB of data per SIMD unit
Program Memory: 32 bits x 64 registers (Up to 64 instructions, need 6b for addresses)
Data Memory: 64 bits x 128 registers = 1 KB data (need 7b for addresses)
Instructions have this format: | opcode: 6b | Rd: 7b | Rm: 7b | Rn: 7b | Other: 5b |
| Mnemonic | Instruction Operation | Opcode | Notes |
|---|---|---|---|
| LDUR | LDUR rd, rm | 000000 | Rd = global_mem[Rm] |
| STUR | STUR rn, rm | 000001 | global_mem[Rm] = Rn |
| ADD | ADD rd, rm, rn | 000010 | Rd = Rm + Rn |
| SUB | SUB rd, rm, rn | 000011 | Rd = Rm - Rn |
| MUL | MUL rd, rm, rn | 000100 | Rd = Rm * Rn |
| DIV | DIV rd, rm, rn | 000101 | Rd = Rm / Rn |
| AND | AND rd, rm, rn | 000110 | Rd = Rm bitwise_AND Rn |
| ORR | ORR rd, rm, rn | 000111 | Rd = Rm bitwise_OR Rn |
| CONST | CONST rd, imm_19 | 001000 | Rd = imm_19 (imm_19 = Rd_Rm_Rn_Other) |
| RET | thread done | 111111 | 111111 x...x |
Each SIMD lane has 64 bits x 32 registers.
R0-R27: general purpose data
R28-R30: %blockIdx, %blockDim, and %threadIdx respectively
R31: zero
Important parameters:
%blockIdx: block's ID within a block grid (0 through numberOfBlocks-1)- Same for all threads in a block
%blockDim: number of threads per block- Same for all blocks
%threadIdx: thread's ID within a block (0 through blockDim-1)- Unique per thread within a block
Global Thread Id Calculation:
%blockIdx * %blockDim + %threadIdx
%threadId.x = wave_id * wave_size + (warp_cycle * SIMD_width + lane_id)
.threads 32
.data 0 1 2 3 ... 31; matrix A (1 x 32)
.data 0 1 2 3 ... 31; matrix B (1 x 32)
MUL R4, %blockIdx, %blockDim
ADD R4, R4, %threadIdx ; i = blockIdx * blockDim + threadIdx
CONST R5, #0 ; baseA (matrix A base address)
CONST R6, #64 ; baseB (matrix B base address)
CONST R7, #128 ; baseC (matrix C base address)
ADD R8, R5, R4 ; addr(A[i]) = baseA + i
LDR R8, R8 ; load A[i] from global memory
ADD R9, R6, R4 ; addr(B[i]) = baseB + i
LDR R9, R9 ; load B[i] from global memory
ADD R10, R8, R9 ; C[i] = A[i] + B[i]
ADD R11, R7, R4 ; addr(C[i]) = baseC + i
STR R10, R11 ; store C[i] in global memory
RET ; end of kernel




