Releases: LostRuins/koboldcpp
koboldcpp-1.66.1
koboldcpp-1.66.1
Phi guess that's the way the cookie crumbles edition
- NEW: Added custom SD LoRA support! Specify it with
--sdlora
and set the LoRA multiplier with--sdloramult
. Note that SD LoRAs can only be used when loading in 16bit (e.g. with the.safetensors
model) and will not work on quantized models (so incompatible with--sdquant
) - NEW: Added custom SD VAE support, which can be specified in the Image Gen tab of the GUI launcher, or using
--sdvae [vae_file.safetensors]
- NEW: Added in-built support for TAE SD for SD1.5 and SDXL. This is a very small VAE replacement that can be used if a model has a broken VAE, it also works faster than regular VAE. To use it, select "Fix Bad VAE" checkbox or use the flag
--sdvaeauto
- Note: Do not use the above new flags with
--sdconfig
, which is a deprecated flag and not to be used.
- Note: Do not use the above new flags with
- NEW: Added experimental support for Rep Pen Slope. This is not a true slope, but the end result is it applies a slightly reduced rep pen for older tokens within the rep pen range, scaled by the slope value. Setting rep pen slope to 1 negates this effect. For compatibility reasons, rep pen slope defaults to 1 if unspecified (same behavior as before).
- NEW: You can now specify a http/https URL to a GGUF file when passing the
--model
parameter, or in the model selector UI. KoboldCpp will attempt to download the model file into your current working directory, and automatically load it when the download is done. - Disable UI launcher scaling on MacOS due to display issues. Please report any further scaling issues.
- Improved EOT token handling, fixed a bug in token speed calculations.
- Default thread count will not exceed 8 unless overridden, this helps mitigate e-core issues.
- Merged improvements and fixes from upstream, including new Phi support and Vulkan fixes from @0cc4m
- Updated Kobold Lite:
- Now attempts to function correctly if hosted on a subdirectory URL path (e.g. using a reverse proxy), if that fails it defaults back to the root URL.
- Changed default chatmode player name from "You" to "User", which solves some wonky phrasing issues.
- Added viewport width controls in settings, including horizontal fullscreen.
- Minor bugfixes for markdown
Fix for 1.66.1 - Fixed quant tools makefile, fixed sd seed parsing, updated lite
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.65
koboldcpp-1.65
at least we have a shovel edition
- NEW: Added a new standalone UI for Image Generation, thanks to @ayunami2000 for porting StableUI (original by @aqualxx) to KoboldCpp! Now you have a powerful dedicated A1111 compatible GUI for generating images locally, with a similar look and feel to Automatic1111. And it runs in your browser, launching straight from KoboldCpp, simply load a Stable Diffusion model and visit http://localhost:5001/sdui/
- NEW: Added official CUDA 12 binaries. If you have a newer NVIDIA GPU and don't mind larger files, you may get increased speeds by using the CUDA 12 build koboldcpp_cuda12.exe
- Added a new API field
bypass_eos
to skip EOS tokens while still allowing them to be generated. - Hopefully fixed tk window resizing issues
- Increased interrogate mode token amount by 30%, and increased default chat completions token amount by 250%
- Merged improvements and fixes from upstream
- Updated Kobold Lite:
- Added option to insert Instruct System Prompt
- Added option to bypass (skip) EOS
- Added toggle to return special tokens
- Added Chat Names insertion for instruct mode
- Added button to launch StableUI
- Various minor fixes, support importing cards from CharacterHub urls.
Important Deprecation Notice: The flags --smartcontext
, --hordeconfig
and --sdconfig
are being deprecated.
--smartcontext
is no longer as useful nowadays with context shifting, and just adds clutter and confusion. With it's removal, if contextshift is enabled, smartcontext will be used as a fallback if contextshift is unavailable, such as with old models. --noshift
can still be used to turn both behaviors off.
--hordeconfig
and --sdconfig
are being replaced, as the number of configurations for these arguments grow, the order of these positional arguments confuses people, and makes it very difficult to add new flags and toggles as well, since a misplaced new parameter breaks existing parameters. Additionally, it also prevented me from properly validating each input for data type and range.
As this is a large change, these deprecated flags will remain functional for now. However, you are strongly advised to switch over to the new replacement flags below:
Replacement Flags:
--hordemodelname Sets your AI Horde display model name.
--hordeworkername Sets your AI Horde worker name.
--hordekey Sets your AI Horde API key.
--hordemaxctx Sets the maximum context length your worker will accept.
--hordegenlen Sets the maximum number of tokens your worker will generate.
--sdmodel Specify a stable diffusion model to enable image generation.
--sdthreads Use a different number of threads for image generation if specified.
--sdquant If specified, loads the model quantized to save memory.
--sdclamped If specified, limit generation steps and resolution settings for shared use.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.64.1
koboldcpp-1.64.1
- Added fixes for Llama 3 tokenization: Support updated Llama 3 GGUFs with pre-tokenizations.
- Note: In order to benefit from the tokenizer fix, the GGUF models need to be reconverted after this commit. A warning will be displayed if the model was created before this fix.
- Automatically support and apply both EOS and EOT tokens. EOT tokens are also correctly biased when EOS is banned.
finish_reason
is now correctly communicated in both sync and SSE streamed modes responses when token generation is stopped by EOS/EOT. Also, Kobold Lite no longer trims sentences if a EOS/EOT is detected as the stop reason in instruct mode.- Added proper support for
trim_stop
in SSE streaming modes. Stop sequences will no longer be exposed even during streaming whentrim_stop
is enabled. Additionally, using the Chat Completions endpoint automatically applies trim stop to the instruct tag format used. This allows better out-of-box compatibility with third party clients like LibreChat. --bantokens
flag has been removed. Instead, you can now submitbanned_tokens
dynamically via the generate API, for each specific generation, and all matching tokens will be banned for that generation.- Added
render_special
to the generate API, which allows you to enable rendering of special tokens like<|start_header_id|>
or<|eot_id|>
if enabled. - Added new experimental flag
--flashattention
to enable Flash Attention for compatible models. - Added support for resizing the GUI launcher, all GUI elements will auto-scale to fit. This can be useful for high DPI screens.
- Improved speed of rep pen sampler.
- Added additional debug information in
--debugmode
. - Added a button for starting the benchmark feature in GUI launcher mode.
- Fixed slow clip processing speed issue on Colab
- Fixed quantization tool compilation again
- Updated Kobold Lite:
- Improved stop sequence and EOS handling
- Fixed instruct tag dropdown
- Added token filter feature
- Added enhanced regex replacement (now also allowed for submitted text)
- Support custom
{{placeholder}}
tags. - Better max context handling when used in Kcpp
- Support for Inverted world info secondary keys (triggers when NOT present)
- Language customization for XTTS
Hotfix 1.64.1: Fixed LLAVA being incoherent on the second generation onwards. Also, the gui launcher has been tidied up, lowvram is now removed from quick launch tab and only in hardware tab. --benchmark
includes version and gives clearer exit instructions in console output now. Fixed some tkinter error outputs on quit.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.63
koboldcpp-1.63
Enable Sound, Press Play
kobo_gif.mp4
- Added support for special tokens in
stop_sequences
. Thus, if you set<|eot_id|>
as a stop sequence and it can be tokenized into a single token, it will just work and function like the EOS token, allowing multiple EOS-like tokens. - Reworked the Automatic RoPE scaling calculations to support Llama3 (just specify the desired
--contextsize
and it will trigger automatically). - Added a console warning if another program is already using the desired port.
- Improved server handling for bad or empty requests, which fixes a potential flooding vulnerability.
- Fixed a scenario where the BOS token could get lost, potentially resulting in lower quality especially during context-shifting.
- Pulled and merged new model support, improvements and fixes from upstream.
- Updated Kobold Lite: Fixed markdown, reworked memory layout, added a regex replacer feature, added aesthetic background color settings, added more save slots, added usermod saving, added Llama3 prompt template
Edit: Something seems to be flagging the CI built binary on windows defender. Replaced it with a locally built one until I can figure it out.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.62.2
koboldcpp-1.62.2
There and back again edition
- NEW: Img2Img is now supported when generating images using KoboldCpp. An A1111 compatible endpoint
/sdapi/v1/img2img
is now emulated. When using Kobold Lite, you can now click an existing image, and generate a new image based off it with Img2Img. - NEW: OpenAI Chat Completions adapters can now be specified on load with
--chatcompletionsadapter
. This allows you to use any instruct tag format you want via the Chat Completions API. Please refer to the wiki for documentation. The instruct tags should now also handle all stop sequences correctly and not overflow past them when using OpenAI Chat Completions API. - Added automatic cleanup of old orphaned koboldcpp pyinstaller temp directories.
- Added more usage statistics available in
/api/extra/perf/
- Do not display localhost url if using remote tunnel
- Added
/docs
endpoint which is an alias for/api
, containing API documentation - Embedded Horde Worker job polling URL changed to aihorde.net
- Embedded Horde Workers will now give priority to the local user, and pause/unpause themselves briefly whenever generating on a local active client, and then returning to full speed when idle. This should allow you to comfortably run a busy horde worker, even when you want to use KoboldCpp locally at the same time.
- Try to fix SSL cert directory not found by specifying a default path.
- Fixed old quant tools not compiling
- Pulled and merged new model support, improvements and fixes from upstream.
- Updated Kobold Lite with some layout fixes, support for Cohere API, Claude Haiku and Gemini 1.5 API, and Img2Img features for local and horde.
Hotfix 1.62.1 - Merged command R plus from upstream. I cannot confirm if it works correctly as CR+ is too big for me to run locally.
Hotfix 1.62.2 - CommandR lite template and fix for appending stop sequences in chat completions
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
CUDA 11 Cublas Libraries
This release is NOT a proper koboldcpp build!
It only contains the CUDA11 CuBLAS libraries to be packaged with KoboldCpp pyinstallers, intended for CI usage.
If you're looking for KoboldCpp, please get it from here: https://github.com/LostRuins/koboldcpp/releases/latest
koboldcpp-1.61.2
koboldcpp-1.61.2
Finally multimodal edition
- NEW: KoboldCpp now supports Vision via Multimodal Projectors (aka LLaVA), allowing it to perceive and react to images! Load a suitable
--mmproj
file or select it in the GUI launcher to use vision capabilities. (Not working on Vulkan)- Note: This is NOT limited to only LLaVA models, any compatible model of the same size and architecture can gain vision capabilities!
- Simply grab a 200mb mmproj file for your architecture here, load it with
--mmproj
and stick it into your favorite compatible model, and it will be able to see images as well! - KoboldCpp supports passing up to 4 images, each one will consume about 600 tokens of context (LLaVA 1.5). Additionally, KoboldCpp token fast-forwarding and context-shifting works with images seamlessly, so you only need to process each image once!
- A compatible OpenAI GPT-4V API endpoint is emulated, so GPT-4-Vision applications should work out of the box (e.g. for SillyTavern in Chat Completions mode, just enable it). For Kobold API and OpenAI Text-Completions API, passing an array of base64 encoded
images
in the submit payload will work as well (planned Aphrodite compatible format). - An A1111 compatible
/sdapi/v1/interrogate
endpoint is also emulated, allowing easy captioning for other image-interrogation frontends. - In Kobold Lite, click any image to select from available AI Vision options.
- NEW: Support for authentication via API Keys has been added, set it with
--password
. This key will be required for all text generation endpoints, usingBearer
Authorization. Image endpoints are not secured. - Proper support for generating non-square images, scaling correctly based on aspect ratio
--benchmark
limit increased to 16k context- Added aliases for the image sampler names for txt2img generation.
- Added the
clamped
option for--sdconfig
which prevents generating too large resolutions and potentially crashing due to OOM. - Pulled and merged improvements and fixes from upstream
- Includes support for mamba models, (CPU only). Note: mamba does not support context shifting
- Updated Kobold Lite:
- Added better support for displaying larger images, added support for generating portrait and landscape aspect ratios
- Increased max image resolution in HD mode, allow downloading non-square images properly
- Added ability to choose image samplers for image generation
- Added ability to upload images to KoboldCpp for LLaVA usage, with 4 selectable "AI Vision" modes
- Allow inserting images from files even when no image generation backend is selected
- Added support for password input and using API keys over KoboldAI API
Fix 1.61.1 - Fixed mamba (removed broken context shifting), merged other fixes from upstream, support uploading non-square images.
Fix 1.61.2 - Added new launch flag --ignoremissing
which deliberately ignores any optional missing files that were passed in, e.g. --lora
, --mmproj
, skipping them instead of exiting. Also, paste image from clipboard is added to lite.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.60.1
koboldcpp-1.60.1
KoboldCpp is just a 'Dirty Fork' edition
- KoboldCpp now natively supports Local Image Generation, thanks to the phenomenal work done by @leejet in stable-diffusion.cpp! It provides an A1111 compatible
txt2img
endpoint which you can use within the embedded Kobold Lite, or in many other compatible frontends such as SillyTavern.- Just select a compatible SD1.5 or SDXL
.safetensors
fp16 model to load, either through the GUI launcher or with--sdconfig
- Enjoy zero install, portable, lightweight and hassle free image generation directly from KoboldCpp, without installing multi-GBs worth of ComfyUi, A1111, Fooocus or others.
- With just 8GB VRAM GPU, you can run both a 7B q4 GGUF (lowvram) alongside any SD1.5 image model at the same time, as a single instance, fully offloaded. If you run out of VRAM, select
Compress Weights (quant)
to quantize the image model to take less memory. - KoboldCpp allows you to run in text-gen-only, image-gen-only or hybrid modes, simply set the appropriate launcher configs.
- Known to not work correctly in Vulkan (for now).
- Just select a compatible SD1.5 or SDXL
- When running from command line,
--contextsize
can now be set to any arbitrary number in range instead of locked to fixed values. However, using a non-recommended value may result in incoherent output depending on your settings. The GUI launcher for this remains unchanged. - Added new quant types, pulled and merged improvements and fixes from upstream.
- Fixed some issues loading older GGUFv1 models, they should be working again.
- Added cloudflare tunnel support for macOS, (via
--remotetunnel
, however it probably won't work on M1, only amd64). - Updated API docs and Colab for image gen.
- Updated Kobold Lite:
- Integrated support for AllTalk TTS
- Added "Auto Jailbreak" for instruct mode, useful to wrangle stubborn or censored models.
- Auto enable image gen button if KCPP loads image model
- Improved Autoscroll and layout, defaults to SSE streaming mode
- Added option to import and export story via clipboard
- Added option to set personal notes/comments in story
Update v1.60.1: Port fix for CVE-2024-21836 for GGUFv1, enabled LCM sampler, allowed loading gguf SD models, fix SD for metal.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.59.1
koboldcpp-1.59.1
This is mostly a bugfix release to resolve multiple minor issues.
- Added
--nocertify
mode which allows you to disable SSL certificate checking on your embedded Horde worker. This can help bypass some SSL certificate errors. - Fixed pre-gguf models loading with incorrect thread counts. This issue affected the past 2 versions.
- Added build target for Old CPU (NoAVX2) Vulkan support.
- Fixed cloudflare remotetunnel URLs not displaying on runpod.
- Reverted CLBlast back to 1.6.0, pending CNugteren/CLBlast#533 and other correctness fixes.
- Smartcontext toggle is now hidden when contextshift toggle is on.
- Various improvements and bugfixes merged from upstream, which includes google gemma support.
- Bugfixes and updates for Kobold Lite
Fix for 1.59.1: Changed makefile build flags, fix for tooltips, merged IQ3_S support
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.58
koboldcpp-1.58
- Added a toggle for row split mode with cuda multigpu. Split mode changed to layer split by default. If using command line, add
rowsplit
to--usecublas
to enable row split mode. With the GUI launcher, it's a checkbox toggle. - Multiple bugfixes: fixed benchmark command, fixed SSL streaming issues, fixed some SSE formatting with OAI endpoints.
- Make context shifting more forgiving when determining eligibility.
- Upgraded CLBlast to latest version, should result in a modest prompt processing speedup when using CL.
- Various improvements and bugfixes merged from upstream.
- Updated Kobold Lite with many improvements and new features:
- New: Integrated 'AI Vision' for images, this uses AI Horde or a local A1111 endpoint to perform image interrogation, allowing the AI to recognize and interpret uploaded or generated images. This should provide an option for multimodality similar to llava, although not as precise. Click on any image and you can enable it within Lite. This functionality is not provided by KCPP itself.
- New: Importing characters from Pygmalion.Chat is now supported in Lite, select it from scenarios.
- Added option to run Lite in background. It plays a dynamically generated silent audio sound. This should prevent the browser tab from hibernating.
- Fixed printable view, persist streaming text on error, fixed instruct timestamps
- Added "Auto" option for idle responses.
- Allow importing images into story from local disk
- Multiple minor formatting and bug fixes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.