This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.
Besides the usual FP32, it supports FP16, quantized INT4, INT5 and INT8 inference. This project is focused on CPU, but cuBLAS is also supported.
This project provides a C library rwkv.h and a convinient Python wrapper for it.
RWKV is a large language model architecture. In contrast to Transformer with O(n^2)
attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
This project supports RWKV v4, v5, v6 and the latest v7 architectures.
Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.
If you use rwkv.cpp
for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.
Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads. The models are RWKV v4 Pile 169M
, RWKV v4 Pile 1.5B
.
Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
---|---|---|---|
Q4_0 |
17.507 | 76 | 1.53 |
Q4_1 |
17.187 | 72 | 1.68 |
Q5_0 |
16.194 | 78 | 1.60 |
Q5_1 |
15.851 | 81 | 1.68 |
Q8_0 |
15.652 | 89 | 2.13 |
FP16 |
15.623 | 117 | 2.82 |
FP32 |
15.623 | 198 | 5.64 |
Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8 GB. The model is RWKV-4-Pile-169M
, 12 layers were offloaded to GPU.
Latency per token in ms shown.
Format | 1 thread | 2 threads | 4 threads | 8 threads | 24 threads |
---|---|---|---|---|---|
Q4_0 |
7.9 | 6.2 | 6.9 | 8.6 | 20 |
Q4_1 |
7.8 | 6.7 | 6.9 | 8.6 | 21 |
Q5_1 |
8.1 | 6.7 | 6.9 | 9.0 | 22 |
Format | 1 thread | 2 threads | 4 threads | 8 threads | 24 threads |
---|---|---|---|---|---|
Q4_0 |
59 | 51 | 50 | 54 | 94 |
Q4_1 |
59 | 51 | 49 | 54 | 94 |
Q5_1 |
77 | 69 | 67 | 72 | 101 |
Note: since cuBLAS is supported only for ggml_mul_mat()
, we still need to use few CPU resources to execute remaining operations.
Measurements were made on CPU AMD Ryzen 9 5900X & GPU AMD Radeon RX 7900 XTX. The model is RWKV-novel-4-World-7B-20230810-ctx128k
, 32 layers were offloaded to GPU.
Latency per token in ms shown.
Format | 1 thread | 2 threads | 4 threads | 8 threads | 24 threads |
---|---|---|---|---|---|
f16 |
94 | 91 | 94 | 106 | 944 |
Q4_0 |
83 | 77 | 75 | 110 | 1692 |
Q4_1 |
85 | 80 | 85 | 93 | 1691 |
Q5_1 |
83 | 78 | 83 | 90 | 1115 |
Note: same as cuBLAS, hipBLAS only supports ggml_mul_mat()
, we still need to use few CPU resources to execute remaining operations.
Requirements: git.
git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp
Check out Releases, download appropriate ZIP for your OS and CPU, extract rwkv
library file into the repository directory.
On Windows: to check whether your CPU supports AVX2 or AVX-512, use CPU-Z.
This option is recommended for maximum performance, because the library would be built specifically for your CPU and OS.
Requirements: CMake or CMake from anaconda, Build Tools for Visual Studio 2019.
cmake .
cmake --build . --config Release
If everything went OK, bin\Release\rwkv.dll
file should appear.
Refer to docs/cuBLAS_on_Windows.md for a comprehensive guide.
Refer to docs/hipBLAS_on_Windows.md for a comprehensive guide.
Requirements: CMake (Linux: sudo apt install cmake
, MacOS: brew install cmake
, anaconoda: cmake package).
cmake .
cmake --build . --config Release
Anaconda & M1 users: please verify that CMAKE_SYSTEM_PROCESSOR: arm64
after running cmake .
— if it detects x86_64
, edit the CMakeLists.txt
file under the # Compile flags
to add set(CMAKE_SYSTEM_PROCESSOR "arm64")
.
If everything went OK, librwkv.so
(Linux) or librwkv.dylib
(MacOS) file should appear in the base repo folder.
cmake . -DRWKV_CUBLAS=ON
cmake --build . --config Release
If everything went OK, librwkv.so
(Linux) or librwkv.dylib
(MacOS) file should appear in the base repo folder.
Requirements: Python 3.x with PyTorch.
First, download a model from Hugging Face like this one.
Second, convert it into rwkv.cpp
format using following commands:
# Windows
python python\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin FP16
# Linux / MacOS
python python/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin FP16
Optionally, quantize the model into one of quantized formats from the table above:
# Windows
python python\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q5_1.bin Q5_1
# Linux / MacOS
python python/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q5_1.bin Q5_1
Requirements: Python 3.x with numpy. If using Pile
or Raven
models, tokenizers is also required.
To generate some text, run:
# Windows
python python\generate_completions.py C:\rwkv.cpp-169M-Q5_1.bin
# Linux / MacOS
python python/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin
To chat with a bot, run:
# Windows
python python\chat_with_bot.py C:\rwkv.cpp-169M-Q5_1.bin
# Linux / MacOS
python python/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin
Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.
The short and simple script inference_example.py demostrates the use of rwkv.cpp
in Python.
To use rwkv.cpp
in C/C++, include the header rwkv.h.
To use rwkv.cpp
in any other language, see Bindings section below. If your language is missing, you can try to bind to the C API using the tooling provided by your language.
These projects wrap rwkv.cpp
for easier use in other languages/frameworks.
- Golang: seasonjs/rwkv
- Node.js: Atome-FE/llama-node
ggml
moves fast, and can occasionally break compatibility with older file formats.
rwkv.cpp
will attempt it's best to explain why a model file can't be loaded and what next steps are available to the user.
For reference only, here is a list of latest versions of rwkv.cpp
that have supported older formats. No support will be provided for these versions.
Q4_2
, old layout of quantized formatsQ4_3
,Q4_1_O
See also docs/FILE_FORMAT.md for version numbers of rwkv.cpp
model files and their changelog.
Please follow the code style described in docs/CODE_STYLE.md.