GPU compression (NVIDIA nvCOMP)

Try with GPU compression (NVIDIA nvCOMP)

# Requires NVIDIA Container Toolkit + a CUDA-capable GPU
docker compose -f docker-compose.gpu.yml up -d
aws --endpoint-url http://localhost:8014 s3 cp parquet-file.parq s3://demo/

See docker-compose.gpu.yml for details.

GPU small-PUT batching (`--gpu-batch-small-puts`, opt-in)

Per-object GPU compression below --gpu-min-bytes (default 1 MiB) loses to CPU because each call pays a fixed kernel-launch + PCIe round-trip. --gpu-batch-small-puts (v1.1, off by default; requires the nvcomp-gpu build + a CUDA GPU at boot, refuses to start otherwise) coalesces concurrent small PUTs into a single nvcompBatchedZstd kernel launch:

s4 --endpoint-url ... --gpu-batch-small-puts \
   --gpu-batch-max-items 32 \      # flush at 32 pending bodies (default)
   --gpu-batch-window-ms 4 \       # ...or after 4 ms, whichever first (default)
   --gpu-batch-floor-bytes 4096    # bodies below 4 KiB stay on cpu-zstd (default)

Eligibility: dispatcher picked cpu-zstd, no --zstd-dict match, declared Content-Length in [--gpu-batch-floor-bytes, --gpu-min-bytes). Stored objects are standard nvcomp-zstd bodies — wire-format identical to the per-object GPU path; the GET path has zero batch awareness. Any decline (queue full, GPU error, batched output not smaller than the input) falls back to the unchanged cpu-zstd path; watch the split via the s4_gpu_batch_total{result="batched"|"fallback"} counter.

Trade-offs, measured on 1000 × 8 KiB log-like objects (RTX 4070 Ti SUPER + Ryzen 9 9950X, nvCOMP 5.2.0.10, cargo bench -p s4-codec --features nvcomp-gpu --bench gpu_small_batch, 2026-06-11):

Path	Wall time	Objects/s	Total compressed
cpu-zstd-3, sequential	15.7–19.5 ms	~52–64k	735,396 B (11.14×)
nvcomp-zstd per-object	702–707 ms	~1.4k	665,375 B (12.31×)
nvcomp-zstd batched (32/launch)	29.7–29.9 ms	~33.5k	665,375 B (12.31×)

Honest read: batching makes small-object GPU compression ~24× faster than per-object GPU and yields ~10% smaller output than cpu-zstd-3, but a single CPU core still finishes this 8 KiB workload ~1.5–1.9× sooner in wall time on this hardware. Enable the flag when (a) ingest CPU is the bottleneck and you want to offload small-object compression to an otherwise-idle GPU, or (b) the extra compression ratio matters at fleet scale; skip it if raw single-node PUT latency is what you optimise — each batched PUT also waits up to --gpu-batch-window-ms for its batch to fill.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU compression (NVIDIA nvCOMP)

Try with GPU compression (NVIDIA nvCOMP)

GPU small-PUT batching (`--gpu-batch-small-puts`, opt-in)

FilesExpand file tree

gpu.md

Latest commit

History

gpu.md

File metadata and controls

GPU compression (NVIDIA nvCOMP)

Try with GPU compression (NVIDIA nvCOMP)

GPU small-PUT batching (--gpu-batch-small-puts, opt-in)

GPU small-PUT batching (`--gpu-batch-small-puts`, opt-in)