LLaMbA is a minimalistic cross-platform batching engine/server for LLMs, powered by ASP.NET Core and LLamaSharp.
The engine's goal is to be able to serve multiple requests with small models as quick as possible, and it was made while having in mind its primary purposes of Serving, Classifying, and Generating Synthetic Data, within a minimal and extensible environment.
LLaMbA introduces quick and customizable ways to sample, made possible by .NET's System.Numerics.Tensors
and threading. The Out-Of-The-Box sampling is arguably not as extensive as llama.cpp's, but it serves its purposes nicely and it's quite faster (up to ~10x increasing with smaller model sizes).
In addition, it hosts a python tokenizer, and utilizes llama.cpp's token grouping features to reduce the total amount of tokens in the batch, by reusing tokens that share the same position in multiple sequences, reducing the total amount of tokens the model sees. This can further be taken advantage of during multiple classifications of the same prompt, where most tokens are the same but the classification purposes change.
While LLaMbA contains a basic Web UI for chatting with the LLM, it wasn't made to contain rich features and single-user session efficiency, but with ease-of-testing in mind. That said, the primary use of the Web UI is testing any imposed changes, custom samplers, or systems.
It also isn't an all-in-one & one-for-all deliverable; the user is expected to get hands-on and adjust code parts to their needs.
Anyone can use LLaMbA for Synthetic Data generation locally as it is, but for more advanced purposes like Serving or Classifying, the primary target audience is Developers that should create safeguards (e.g. auth, limits for max_tokens, moderation) and other systems to compliment the backend and take advantage of the high speeds.
Developers are encouraged to experiment and customize the engine to their specs.
- CUDA 12 or the backend of your choice (CUDA11, CUDA12, Vulkan, OpenCL, Metal, CPU).
- .NET 8 SDK. Necessary for building and running the project.
- Python (+ packages). After installing python, install the necessary packages:
pip install tokenizers uvicorn fastapi asyncio requests
The model used in the videos is LLama3.1-Instruct-8B-Q8, on a single RTX 4080, utilizing ~12GB of VRAM.
Llamba.Tests.mp4
Llamba.Chat.mp4
Batches sent with Completion mode get passed without formatting, whereas Chat mode formats them to model's prompt format.
Llamba.Batch.Short.mp4
Llamba.Batch.Json.mp4
Check out the General Guide and Example Usage for example usage of the API and a quick code tour.
Context Size can be increased in Model.cs
to further increase throughput. The default parameters are for LLaMA3.1-8B-Q8 with ~12GB of VRAM.
Enabling Flash Attention will also increase generation throughput.
LLaMbA supports all language models currently supported by llama.cpp.
- see InferenceFormat.cs to add your own prompt format.
- and Tokenizer.cs for adding a tokenizer. It's easy!