Below is a ready‑to‑drop‑in README.md that walks through provisioning an Ubuntu 22.04 (or newer) box with CUDA 12.8 + A100 for TRL‑based GRPO training.
Everything is written as sequential shell blocks so you can copy‑paste one chunk at a time.
# ⚡️ A100 CUDA 12.8 — GRPO Training Stack (PyTorch + TRL + vLLM + FlashAttention + W&B)
This guide assumes **Ubuntu 22.04+**, a working **CUDA 12.8** driver/toolkit, and an **A100** GPU.
The same steps work on H100 & Blackwell cards; adjust compute capability flags if you build from source.
---
## 0 Prerequisites
```bash
# Sanity‑check that Ubuntu sees the GPU & CUDA 12.8
nvidia-smi # should list A100 + Driver built against CUDA 12.8
nvcc --version # CUDA compilation tools release 12.8sudo apt update && sudo apt -y full-upgrade
# Core build chain + Python headers (needed for PyTorch / FlashAttention wheels)
sudo apt install -y build-essential pkg-config git curl wget ca-certificates \
libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev \
liblzma-dev ninja-build
sudo apt-get install python3-dev python3.12-dev apt install python3.12-venvWe already have Python 3.12.3 system‑wide, but pyenv lets us isolate project envs and pin micro‑versions as needed. (Medium)
python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip wheel setuptoolspip install --pre torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/nightly/cu128 # cu128 wheelsOfficial cu128 wheels are distributed via the PyTorch extra index. (GitHub)
python - <<'PY'
import torch, platform
print("PyTorch", torch.__version__, "CUDA", torch.version.cuda)
print("GPU OK:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))
PYpip install trl transformers datasets evaluate accelerate(The current TRL 0.19.x line supports GRPO, DPO, RM, etc.)
# Pull the wheel compiled against cu128 and the PyTorch you already installed
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128vLLM ships wheels built with CUDA 12.8—no manual compile needed. (vLLM)
Flash‑Attn requires a short source build to link against your local CUDA toolkit:
pip install flash-attn --no-build-isolation # fastest path
# or, if you need FlashAttention‑3 for Hopper/H100:
# git clone https://github.com/Dao-AILab/flash-attention.git && cd flash-attention
# python setup.py install
#then lets do liger kernel for some extra speed and memory savings...
pip install liger-kernelFlash‑Attn 2.8+ targets CUDA ≥ 12.3, with 12.8 recommended for best perf. (PyPI)
pip install wandb
wandb login # paste your API keyLightweight install—no GPU code. (PyPI)
pip install pandas pyarrow numpy matplotlib seaborn tqdm richThese cover Parquet IO (pyarrow), general data munging (pandas), and quick exploratory plots (matplotlib, seaborn) without clashing with the deep‑learning stack.
🎉 You’re ready to launch GRPO runs using TRL. Want to know more? https://huggingface.co/docs/trl/main/en/grpo_trainer
python -m trl.grpo_trainer \
--model_name_or_path facebook/opt-1.3b \
--reward_model my_reward_model \
--logging_steps 10 \
--report_to wandbFor distributed multi‑GPU launches on the A100, consult DeepSpeed or Accelerate config files as needed.
Happy training!
---
**Anything missing?**
If you need Triton, cuDNN, or additional monitoring (e.g., `gpustat`, `nvtop`), just add them under a new bullet—this scaffold is meant to stay minimal and conflict‑free.
::contentReference[oaicite:5]{index=5}