koboldcpp-1.13.1
koboldcpp-1.13.1
- A multithreading bug fix has allowed CLBlast to greatly increase prompt processing speed. It should now be up to 50% faster than before, and just slightly slower than CuBLAS alternatives. Because of this, we probably will no longer need to integrate CuBLAS.
- Merged the q4_2 and q4_3 CLBlast dequantization kernels, allowing them to be used with CLBlast.
- Added a new flag
--unbantokens
. Normally, KoboldAI prevents certain tokens such as EOS and Square Brackets. This flag unbans them. - Edit: Fixed compile errors, made mmap automatic when lora is selected, added updated quantizers, and quantization handling for gpt neox gpt 2 and gptj
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17