Releases: LostRuins/koboldcpp
koboldcpp-1.17
koboldcpp-1.17
-
Removed Cloudflare Insights - this was previously in Kobold Lite and was included in KoboldCpp. For disclosure: Cloudflare Insights is a GDPR compliant tool that Kobold Lite used previously used to provide information on browser and platform distribution (e.g. ratio of desktop/mobile users), browser type (chrome/firefox etc), to determine which browser platforms I have to support for Kobold Lite. You can read more about it here: https://www.cloudflare.com/insights/ It did not track any personal information, and did not relay any data you load, use, enter or access within Kobold. It was not intended to be included in KoboldCpp, and I originally removed it but forgot for subsequent versions. As of this version, it is removed from both Kobold Lite and KoboldCpp by request.
-
Added the Token Unbanning to the UI, and allowed it to prevent generation of the EOS token, which is required for newer Pygmalion models. You can trigger it with
--unbantokens
-
Pulled upstream fixes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
koboldcpp-1.16
koboldcpp-1.16
- Integrated the overhauled Token Samplers. The whole sampling system has been reworked for Top-P, Top-K and Rep Pen, all model architectures and types now use the same sampling functions. Also added 2 new samplers - Tail Free Sampling (TFS) and Typical Sampling. As I did not test the new implementations for correctness, please let me know if you are experiencing weird results (or degradations for previous samplers).
- Integrated CLBlast support for the
q5_0
andq5_1
formats. Note: Upstream llama.cpp repo has completely removed support for theq4_3
format. For now I still plan to keep support forq4_3
available within KoboldCpp but you are strongly advised not to use q4_3 anymore. Please switch or reconvert any q4_3 models if you can. - Fixed a few edge cases with GPT2 models going OOM with small batch sizes.
- Fixed a regression where older GPT-J models (e.g. the original model from Alpin's Pyg.cpp fork) failed to load due to some upstream changes in the GGML library. You are strongly advised to not use outdated formats - reconvert if you can, it will be faster.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.15
koboldcpp-1.15
- Added a brand new "Easy Mode" GUI which triggers if no command line arguments are set. This is aimed to be a noob-friendly way to get into KoboldCpp, but for full functionality you are still advised to run it from the command line with customized arguments. You can skip it with any command line argument, or using the flag
--skiplauncher
which does nothing else. - Pulled the new quantization format support for q5_0 and q5_1 for llama.cpp from upstream. Also pulled the q5 changes for GPT-2, GPT-J and GPT-NeoX formats. Note that these will not work in CLBlast yet - but OpenBLAS should work fine.
- Added a new flag
--debugmode
which shows the Tokenized prompt being sent to the backend within the terminal window. - Setting
--stream
flag now automatically redirects the URL in the embedded Kobold Lite UI, no need to type?streaming=1
anymore. - Updated Kobold Lite, now supports multiple custom stopping sequences which you can specify, separating in the UI with the
||$||
delimiter. Lite also now saves your custom stopping sequences into your save files and autosaves. - Merged upstream fixes and improvements.
- Minor console fixes for Linux, and OSX compatibility.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.14
koboldcpp-1.14
- Added backwards compatibility for an older version of NeoX with different quantizations
- Fixed a few scenarios where users may encounter OOM crashes
- Pulled upstream updates
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.13.1
koboldcpp-1.13.1
- A multithreading bug fix has allowed CLBlast to greatly increase prompt processing speed. It should now be up to 50% faster than before, and just slightly slower than CuBLAS alternatives. Because of this, we probably will no longer need to integrate CuBLAS.
- Merged the q4_2 and q4_3 CLBlast dequantization kernels, allowing them to be used with CLBlast.
- Added a new flag
--unbantokens
. Normally, KoboldAI prevents certain tokens such as EOS and Square Brackets. This flag unbans them. - Edit: Fixed compile errors, made mmap automatic when lora is selected, added updated quantizers, and quantization handling for gpt neox gpt 2 and gptj
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.12
koboldcpp-1.12
This is a bugfix release
- Fixed a few more scenarios where GPT2/GPTJ/GPTNeoX will go out of memory when using BLAS. Also, the max blas batch for non llama models currently capped to 256.
- Minor CLBlast optimizations should slightly increase speed
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.11
koboldcpp-1.11
- Now has GPT-NeoX / Pythia / StableLM support!
- Try my special model, Pythia-70m-ChatSalad here: https://huggingface.co/concedo/pythia-70m-chatsalad-ggml/tree/main
- Added upstream LORA file support for llama, use the
--lora
parameter. - Added limited fast-forwarding capabilities for RWKV, context can be reused if its completely unmodified.
- Kobold Lite now supports using an additional custom stopping sequence, edit it in the Memory panel.
- Updated Kobold Lite, and pulled llama improvements from upstream.
- Improved OSX and Linux build support - now automatically builds all libraries with the requested flags, and you can select which ones to use at runtime. Example: do a
make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1
and it will build both OpenBlas and CLBlast libraries on your platform, then you select clblast with--useclblast
at runtime.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17
koboldcpp-1.10
koboldcpp-1.10
- Now has RWKV support without needing pytorch or tokernizers or other external libraries!
- Try RWKV-v4-169m here: https://huggingface.co/concedo/rwkv-v4-169m-ggml/tree/main
- Now allows direct launching browser with
--launch
parameter. You can also do something like--stream --launch
. - Updated Kobold Lite, and pulled llama improvements from upstream.
- API now lists the KoboldCpp version number with a new endpoint
/api/extra/version
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup
koboldcpp-1.9
koboldcpp-1.9
This was such a good update that I had to make a new version, so there are 2 new releases today.
- Now has support for stopping sequences fully implemented in the API! They've been implemented in a similar and compatible way to my United PR one-some/KoboldAI-united#5 and they should be shortly usable in online Lite as well as (eventually) the main kobold client when it gets merged. What this means is that now the AI will be able to finish a response early even if not all the response tokens are consumed, and save time by sending the reply instead of generating excess unneeded tokens. Automatically integrates into the latest version of Kobold Lite which sets the correct stop sequences from Chat and Instruct mode, which is also updated here.
- GPT-J and GPT2 models now support BLAS mode! They will use a smaller batch size than llama models, but the effect should still be very noticeably faster!
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup! (Credits to Occam)
koboldcpp-1.8.1
koboldcpp-1.8.1
- Another amazing improvement by @0cc4m, CLBlast now does the 4bit dequantization on GPU! That translates to about a 20% speed increase when using CLBlast for me, and should be a very welcome improvement. To use it, run with
--useclblast [platform_id] [device_id]
(you may have to figure out the values for your correct GPU through trial and error) - Merged fixes and optimizations from upstream
- Fixed a compile error in OSX
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags