diff --git a/.gitignore b/.gitignore index 6d34b8d11..57402798a 100644 --- a/.gitignore +++ b/.gitignore @@ -126,7 +126,7 @@ venv.bak/ # mkdocs documentation /site -docs/.cache/ +docs/.cache/* # mypy .mypy_cache/ diff --git a/docs/index.md b/docs/index.md index a1504302a..8d63a85e5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -20,7 +20,7 @@ Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project !!! info "Support for multiple modifiers in oneshot compression runs" LLM Compressor now supports using multiple modifiers in oneshot compression runs such as applying both AWQ and GPTQ in a single model. - Using multiple modifiers is an advanced usage of LLM Compressor and an active area of research. See [Non-uniform Quantization](examples/quantization_non_uniform/) for more detail and example usage. + Using multiple modifiers is an advanced usage of LLM Compressor and an active area of research. See [Non-uniform Quantization](/examples/quantization_non_uniform/) for more detail and example usage. !!! info "Quantization and calibration support for Qwen3 models" Quantization and calibration support for Qwen3 Next models has been added to LLM Compressor. @@ -39,16 +39,16 @@ Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project ## Recent Updates !!! info "QuIP and SpinQuant-style Transforms" - The newly added [`QuIPModifier`](../examples/transform/quip_example.py) and [`SpinQuantModifier`](../examples/transform/spinquant_example.py) allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization. + The newly added [`QuIPModifier` and `SpinQuantModifier`](/examples/transform) transforms allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization. !!! info "DeepSeekV3-style Block Quantization Support" - Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](../examples/quantization_w8a8_fp8/fp8_block_example.py). + Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](/examples/quantization_w8a8_fp8/). !!! info "FP4 Quantization - now with MoE and non-uniform support" - Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](../examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](../examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](../examples/quantization_non_uniform/README.md) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4. + Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](/examples/quantization_w4a4_fp4/), [MoE support](/examples/quantization_w4a4_fp4/), and [Non-uniform quantization support](/examples/quantization_non_uniform/) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4. !!! info "Llama4 Quantization Support" - Quantize a Llama4 model to [W4A16](../examples/quantization_w4a16) or [NVFP4](../examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM. + Quantize a Llama4 model to [W4A16](/examples/quantization_w4a16/) or [NVFP4](/examples/quantization_w4a4_fp4/). The checkpoint produced can seamlessly run in vLLM. For more information, check out the [latest release on GitHub](https://github.com/vllm-project/llm-compressor/releases/latest).