diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx index ed92c896b..7666aebdd 100644 --- a/docs/source/quickstart.mdx +++ b/docs/source/quickstart.mdx @@ -1,15 +1,162 @@ # Quickstart -## How does it work? +## What is BitsAndBytes? -... work in progress ... +`bitsandbytes` is a lightweight, open-source library that makes it possible to train and run **very large models** on consumer GPUs or limited hardware by using **8-bit and 4-bit quantization** techniques. -(Community contributions would we very welcome!) +👉 Put simply: -## Minimal examples +* Most deep learning models normally store weights in 16-bit (`float16`) or 32-bit (`float32`) numbers. +* `bitsandbytes` compresses those into 8-bit or even 4-bit representations. +* This reduces the **memory footprint**, makes models **faster to run**, and still preserves nearly the same accuracy. -The following code illustrates the steps above. +This unlocks the ability to run models like **LLaMA, Mistral, Falcon, or GPT-style LLMs** on GPUs with as little as **8–16 GB VRAM**. -```py -code examples will soon follow +--- + +## How does it work? (Beginner-friendly) + +Let’s break it down with an analogy: + +* Imagine you have a library of books. Each book is written in **fancy calligraphy (32-bit precision)** — beautiful but heavy. +* Now, you rewrite the same books in **compact handwriting (8-bit)** — still readable, much lighter to carry. +* That’s what `bitsandbytes` does for machine learning weights: it stores the same information in a compressed but efficient format. + +**Key benefits for beginners:** + +* ✅ **Memory savings** → Run bigger models on smaller GPUs. +* ✅ **Speedups** → Smaller weights mean faster computations. +* ✅ **Plug-and-play** → Works with PyTorch and Hugging Face Transformers without huge code changes. + +So, as a beginner, you don’t need to understand all the math under the hood. Just know: it makes models lighter and faster while still accurate. + +--- + +## How does it work? (Nerd edition) + +Now let’s peek under the hood 🔬: + +* **Quantization**: + + * Floating point weights (e.g., `float32`) are mapped to lower precision representations (`int8` or `int4`). + * This involves scaling factors so that the reduced representation doesn’t lose too much information. + +* **Custom CUDA kernels**: + + * `bitsandbytes` provides hand-optimized CUDA kernels that handle low-precision matrix multiplications efficiently. + * These kernels apply **dynamic range scaling** to reduce quantization error. + +* **8-bit Optimizers**: + + * Optimizers like Adam, AdamW, RMSProp, etc., are reimplemented in 8-bit precision. + * Instead of storing massive optimizer states in 32-bit (which usually takes *more memory than the model itself*), these states are stored in 8-bit with clever scaling. + +* **Dynamic Quantization**: + + * Instead of using one scale for the entire tensor, `bitsandbytes` uses per-block quantization (e.g., per 64 values). This improves accuracy significantly. + +* **Integrations**: + + * Hugging Face Transformers can load models in 4-bit or 8-bit precision with `load_in_4bit=True` or `load_in_8bit=True`. + * Compatible with FSDP (Fully Sharded Data Parallel) and QLoRA fine-tuning techniques. + +In short: it’s not *just smaller numbers*. It’s **mathematically smart quantization + GPU-optimized code** that makes it production-ready. + +--- + +## Minimal Examples + +### 1. Using quantized embedding directly + +```python +import torch +import bitsandbytes as bnb + +# Quantized embedding layer +embedding = bnb.nn.Embedding(num_embeddings=1000, embedding_dim=128) +x = torch.randint(0, 1000, (4,)) +y = embedding(x) +print(y.shape) # torch.Size([4, 128]) +``` + +This shows that you can drop in `bitsandbytes` layers just like PyTorch ones. + +--- + +### 2. Loading a 4-bit model with Hugging Face Transformers + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "HuggingFaceTB/SmolLM3-3B" # replace with a model you have access to +tokenizer = AutoTokenizer.from_pretrained(model_id) + +# Load in 4-bit precision with device map for GPU offloading +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_4bit=True, + device_map="auto" +) + +# Verify quantized layers +print(model) + +# Generate text +inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda") +outputs = model.generate(**inputs, max_new_tokens=50) +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` + +When you print the model, you’ll see `Linear4bit` layers, confirming it’s running in **4-bit precision**. + +--- + +### 3. Training with 8-bit optimizers (and verifying) + +```python +import torch +import bitsandbytes as bnb + +# Simple model +model = torch.nn.Linear(128, 2).cuda() +criterion = torch.nn.CrossEntropyLoss() + +# Use 8-bit Adam optimizer +optimizer = bnb.optim.Adam8bit(model.parameters(), lr=1e-3) + +x = torch.randn(16, 128).cuda() +y = torch.randint(0, 2, (16,)).cuda() + +optimizer.zero_grad() +loss = criterion(model(x), y) +loss.backward() +optimizer.step() + +print(f"Loss: {loss.item():.4f}") + +# --- Inspect optimizer state to confirm 8-bit usage --- +print("Optimizer type:", type(optimizer)) +for i, group in enumerate(optimizer.param_groups): + for p in group['params']: + state = optimizer.state[p] + print(f"Param {i} state keys: {list(state.keys())}") ``` + +The optimizer type will be ``, and state tensors are stored in quantized form, confirming training in **8-bit precision**. + +--- + +## What’s next? + +- [Get started](index.mdx) +- [Installation](installation.mdx) +- [Quickstart](quickstart.mdx) +- [8-bit optimizers](optimizers.mdx) + +--- + +✨ **In summary:** + +* Beginners → `bitsandbytes` makes big models smaller and faster. +* Nerds → It achieves this through clever quantization, CUDA kernels, and 8-bit optimizer implementations. +* Everyone → Can benefit by dropping it into their PyTorch or Hugging Face workflows with minimal code changes, and can **verify** the bit precision being used.