GPU kernels for E8 nearest-point (Triton)

The E8 nearest-point computation currently runs on CPU (~60s for 3500 tokens on Mistral-7B). This is the latency bottleneck. The algorithm is embarrassingly parallel (independent per 8-dim group) and maps naturally to a Triton kernel. Target: <100ms for the full compression pipeline on A100.