Welcome to the repository for LLM (Large Language Model) engineers! This collection of Jupyter Notebooks is designed to collect pratical aspects of our job. I will collect and add jupyter and/or script for learning and experimenting purpose.
Notebook | Description | Url |
---|---|---|
1_understanding_llms_benchmarks.ipynb | This notebook provides an explanation of the main benchmarks used in the openLLM leaderboard. It aims to help you grasp the key metrics and methodologies used in benchmarking LLMs. | Link |
2_quantization_base.ipynb | In this notebook, you'll learn how to open a Hugging Face model in 8-bit and 4-bit using the BitandBytes library. Quantization is a crucial technique for optimizing model performance and resource usage, and this notebook guides you through the process. | Link |
3_quantization_gptq.ipynb | Explore quantization in GPTQ format using the auto-gptq library with this notebook. GPTQ format is gaining popularity for its effectiveness in compressing and quantizing large models like GPT. Learn how to leverage this format for your models. | Link |
4_quantization_exllamav2.ipynb | How to quantize a model from HF to exllamav2 | Link |
5_sharding_and_offloading.ipynb | How to shard a model in multiple chunk. This allow to load it on different devices or load one at time managing memory. Learn how to offload some layer to CPU or even disk | Link |
6_gguf_quantization_and_inference.ipynb | Quantize a model into GGUF using the llama.cpp library. Inferencing on OpenAI-compatible server. | Link |
7_gguf_split_and_load.ipynb | Split a GGUF Quantized model in multiple parts, making it easily sharable | Link |
8_hqq_quantization.ipynb | Explore quantization using Half-Quadratic Quantization (HQQ) | Link |
9_inference_big_model_cpu_plus_gpu.ipynb | This notebook shows how to calculate the RAM required by a quantized GGUF model and how to load it into memory using both RAM and VRAM, optimizing the number of layers that can be offloaded to the GPU. The notebook demonstrates loading Qwen/Qwen1.5-32B-Chat-GGUF as an example on a system with a T4 15GB VRAM and approximately 32GB of RAM | Link |
a10_inference_llama3.ipynb | LLama3 has been released. This notebook demonstrates how to run LLama3-8B-Instruct half precision if you have access to a GPU with 24GB of VRAM, quantized to 8 bits if you have 10GB of VRAM, and shows how to run the Q8 GGUF version to achieve maximum performance if you only have 10GB of VRAM. | Link |
a11_llm_guardrails_using_llama3_guard.ipynb | Protect your backend and your generative AI applications using LLama3-guard-2. In this notebook, I show you how to set up a server using 10GB of VRAM and how to perform inference through HTTP POST requests. | Link |
a12_speculative_decoding.ipynb | The notebook practically describes and demonstrates the technique of 'speculative decoding' to increase the tokens/second generated by a Target Model through the use of a smaller and lighter Draft Model. Example realized on LLama-3-70B-Instruct (Target) and LLama-3-8B-Instruct (Draft). | Link |
a13_inference_vision_llm.ipynb | The notebook demonstrates how to perform a simple inference using a vision LLM. For the example, I chose Microsoft's newly released Phi-3-vision-128k-instruct. The model is MIT licensed, so it can be used in your own applications without any restrictions. The model can run on one Nvidia L4. | Link |
a14_llm_as_evaluator.ipynb | The notebook demonstrates how to use an LLM as Judge using Prometheus 2. In the notebook is shown how to evaluate an answer returned from any of our LLM or application pipeline. | Link |
a15_llm_evaluation.ipynb | The notebook demonstrates how to use EleutherAI/lm-evaluation-harness to evaluate LLMs on commons benchmarks used also in the official leaderboards. The process is the same used automatically when you submit a model to the leaderboard | Link |
a16_synthetic_data_generation.ipynb | In this notebook, I created a custom class for generating a synthetic QA dataset from an input file using Llama-3-8B as LLM. The script also demonstrates how to build and run the new version of llama-server on llama-3-8b-Q_8 GGUF | Link |
a17_sglan_serving_llm_multiusers.ipynb | In this notebook, I show all the steps on how to efficiently deploy LLama3.1-8B-FP8 on a custom server using SGLang and serve 64 potentially parallel users while maintaining good performance. | Link |
a18_jailbreak_control_using_promptguard.ipynb | Trying the new PromptGuard-86M for jailbreak. Spoiler: the model seems broken or really bad in this moment | Link |
a19_document_information_and_table_extraction.ipynb | This notebook demonstrates how to use a multimodal literate model (Kosmos 2.5) to accurately and efficiently extract text and tables without using paid cloud services. The model runs on your personal GPU, keeping your data private and secure. | Link |
a20_finetuning_llm_unsloth.ipynb | This notebook shows how to finetune Phi-3.5-mini-instruct using unsloth on a HF dataset of chain of 'thinking' structure | Link |
a21_vllm_inference_llmcompressor.ipynb | This notebook shows how to use vLLM to serve your models and how to quantize them using LLMCompressor gaining 25% to 30% performance increase | Link |
a22_cache_augmented_generation.ipynb | This notebook shows my implementation of Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks | Link |
For further resources and support, feel free to reach out to the community or refer to the following:
- BitandBytes GitHub Repository: Learn more about the BitandBytes library for quantization.
- Auto-GPTQ GitHub Repository: Access the auto-gptq library for GPTQ format quantization.
- ExLlamaV2 GitHub Repository: Learn more about the ExLlamaV2 library for quantization and fast inference.
- Accelerate GitHub Repository: Learn more about the Accelerate library from HF.
- llama.cpp Github Repository: Learn more about the llama.cpp library.
- HQQ Github Repository: Learn more about the HQQ library.
- EleutheraAI LLM Evaluation Harness: Official repo of EleutherAI/lm-evaluation-harness
- SGLang official repo: Official repo of SGLang - fast serving framework
- VLLM docs: Official documentation of VLLM - Easy, fast, and cheap LLM serving for everyone
- LLM Compressor by Neural Magic: LLM Compressor is a unified library for creating compressed models for faster inference with vLLM
- Which GGUF is right for me?: Useful reference on GGUF and guide on how to choose the right quantization for your scenario.
- Interesting reddit thread on GGUF: Useful reference on GGUF.
- Half-Quadratic Quantization of Large Machine Learning Models: HQQ Blog post
- GPTQ vs AWS vs EXL2 vs llamacpp: Quantization method performance (Memory, Speed and VRAM) comparison
- PROMETHEUS 2 Model: Prometheus 2 model optimize to evaluate the answers of LLMs
- SGLang Blog: Fast and Expressive LLM Inference with RadixAttention and SGLang
- KOSMOS 2.5 Model: Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images.
- Cache-Augmented Generation: Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
Happy learning and experimenting with LLMs! 🚀