Releases: LostRuins/koboldcpp
koboldcpp-1.32.3
koboldcpp-1.32.3
- Ported the optimized K-Quant CUDA kernels to OpenCL ! This speeds up K-Quants generation speed by about 15% with CL (Special thanks: @0cc4m)
- Implemented basic GPU offloading for MPT, GPT-2, GPT-J and GPT-NeoX via OpenCL! It still keeps a copy of the weights in RAM, but generation speed for these models should now be much faster! (50% speedup for GPT-J, and even WizardCoder is now 30% faster for me.)
- Implemented scratch buffers for the latest versions of all non-llama architectures except RWKV (MPT, GPT-2, NeoX, GPT-J), BLAS memory usage should be much lower on average, and larger BLAS batch sizes will be usable on these models.
- Merged GPT-Tokenizer improvements for non-llama models. Support Starcoder special added tokens. Coherence for non-llama models should be improved.
- Updated Lite, pulled updates from upstream, various minor bugfixes.
1.32.1 Hotfix:
- A number of bugs were fixed. The include memory allocation errors with OpenBLAS, and errors recognizing the new MPT-30B model correctly.
1.32.2 Hotfix.
Solves an issue with the MPT-30B vocab having missing words due to an problems with wide-string tokenization.- Solve an issue with LLAMA WizardLM-30B running out of memory near 2048 context at larger k-quants.
1.32.3 Hotfix.
- Reverted wstring changes, they negatively affected model coherency.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
koboldcpp-1.31.2
koboldcpp-1.31.2
This is mostly a bugfix build, with some new features to Lite.
- Better EOS token handling for Starcoder models.
- Major Kobold Lite update, including new scenarios, a variety of bug fixes, italics chat text, customized idle message counts, and improved sentence trimming behavior.
- Disabled RWKV sequence mode. Unfortunately, the speedups were too situational, and some users experienced speed regressions. Additionally, it was not compatible without modifying the ggml library to increase the max node counts, which had adverse impacts on other model architectures. Sequence mode will be disabled until it has been sufficiently improved upstream.
- Display token generation rate in console
Update 1.31.1:
- Cleaned up debug output, now only shows the server endpoint debugs if
--debugmode
is set. Also, no longer shows incoming horde prompts if--hordeconfig
is set unless--debugmode
is also enabled. - Fixed markdown in lite
Update 1.31.2:
- Allowed
--hordeconfig
to specify max context length allowed in horde too, which is separate from the real context length used to allocate memory.
koboldcpp-1.30.3
koboldcpp-1.30.3
A.K.A The "Back from the dead" edition.
KoboldCpp Changes:
- Added full OpenCL / CLBlast support for K-Quants, both prompt processing and GPU offloading for all K-quant formats (credits: @0cc4m)
- Added RWKV Sequence Mode enhancements for over 3X FASTER prompt processing in RWKV (credits: @LoganDark)
- Added support for the RWKV World Tokenizer and associated RWKV-World models. It will be automatically detected and selected as necessary.
- Added a true SSE-streaming endpoint (Agnaistic compatible) that can stream tokens in realtime while generating. Integrators can find it at
/api/extra/generate/stream
. (Credits @SammCheese) - Added an enhanced polled-streaming endpoint to fetch in-progress results without disrupting generation, which is now the default for Kobold Lite when using streaming in KoboldCpp. Integrators can find it at
/api/extra/generate/check
. The old 8-token-chunked-streaming can still be enabled by setting the parameterstreamamount=8
in the URL. Also, the original KoboldAI United compatible/api/v1/generate
endpoint is still available. - Added a new abort endpoint at
/api/extra/abort
which aborts any in-progress generation without stopping the server. It has been integrated into Lite, by pressing the "abort" button below the Submit button. - Added support for lora base, which is now added as an optional second parameter e.g.
--lora [lora_file] [base_model]
- Updated to latest Kobold Lite (required for new endpoints).
- Pulled other various enhancements from upstream, plus a few RWKV bugfixes .
1.30.2 Hotfix - Added a fix for RWKV crashing in seq mode, pulled upstream bugfixes, rebuild CUDA version. For those wondering why CUDA exe version is not always included, apart from size, dependencies and only supporting nvidia, that's partially also because it's a pain to build for me, since it can only be done in a dev environment with CUDA toolkit and visual studio on windows.
1.30.3 Hotfix - Disabled RWKV seq mode for now, due to multiple complaints about speed and memory issues with bigger quantized models. I will keep a copy of 1.30.2 here in case anyone still wants it.
CUDA Bonus
Bonus: An alternative CUDA build has also been provided for this version, capable of running all latest formats including K-Quants. To use, download and run the koboldcpp_CUDA_only.exe, which is a one-file pyinstaller.
Extra Bonus: CUDA now also supports the older ggjtv2 models as well, as support has been back ported in! Note that CUDA builds will still not be generated by default, and support for them will be limited.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
koboldcpp-1.29
koboldcpp-1.29
KoboldCpp Changes:
- Added BLAS batch size to the KoboldCpp Easy Launcher GUI.
- Merged the upstream K-quantization implementations for OpenBLAS. Note that the new K-quants are still not supported in CLBlast yet. Please remain on the regular quantization formats to use CLBlast for now.
- Fixed LLAMA 3B OOM errors and a few other OOMs.
- Multiple bugfixes and improvements in Lite, including streaming for aesthetic chat mode.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
koboldcpp-1.28
koboldcpp-1.28
KoboldCpp Changes:
- NEW: Added support for MPT models! Note that to use larger context lengths, remember to set it with
--contextsize
. Values up to around 5000 context tokens have been tested successfully. - The KoboldCpp Easy Launcher GUI has been enhanced! You can now set the number of CLBlast GPU layers in the GUI, as well as the number of threads to use. Additional toggles have also been added.
- Added a more efficient memory allocation to CLBlast! You should be able to offload more layers than before.
- The flag
--renamemodel
has been renamed (lol) to--hordeconfig
and now accepts 2 parameters, the horde name to display, and the advertised max generation length on horde. - Fixed memory issues with Starcoder models. They still don't work very well with BLAS especially for lower RAM devices, so you might want to use a smaller
--blasbatchsize
with them, 64 or 128. - Added the option to use
--blasbatchsize -1
which disables BLAS but still allows you to use GPU Layer offloading in Clblast. This means if you don't use BLAS, you can offload EVEN MORE LAYERS and generate even faster (at the expense of slow prompt processing). - Minor tweaks and adjustments to defaults settings.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
koboldcpp-1.27
koboldcpp-1.27
KoboldCpp Changes:
- Integrated the Clblast GPU offloading improvements from @0cc4m which allows you to have a layer fully stored in VRAM instead of keeping a duplicate copy in RAM. As a result, offloading GPU layers will reduce overall RAM used.
- Pulled upstream support for OpenLlama 3B models.
- Added support for the new version of RWKV.cpp models (v101) from @saharNooby that uses the updated GGML library, and is smaller and faster. Both the older and newer quantization formats will still be supported automatically, backwards compatible.
- Added support for EOS tokens in RWKV
- Updated Kobold Lite. One new and exciting feature is AutoGenerated Memory, which performs a text summary on your story to generate a short memory with a single click. Works best on instruct models.
- Allowed users to rename their displayed model name now, intended for use in horde. Using
--renamemodel
lets you change the default name to any string, with an addedkoboldcpp/
prefix as suggested by Henky. - Fixed some build errors on some versions of OSX and Linux
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
koboldcpp-1.26
koboldcpp-1.26
KoboldCpp Changes:
- NEW! Now, you can view Token Probabilities when using
--debugmode
. When enabled, for every generated token, the console will display the probabilities of up to 4 alternative possible tokens. Good way to know how biased/confident/overtrained a model is. The probability percentage values shown are after all the samplers have been applied, so it's also a great way to test your sampler configurations to see how good they are.--debugmode
also displays the contents of your input and context, as well as their token IDs. Note that using--debugmode
has a slight performance hit, so it is off by default. - NEW! The Top-A sampler has been added! This is my own implementation of a special Kobold-exclusive sampler that does not exist in the upstream llama.cpp repo. This sampler reduces the randomness of the AI whenever the probability of one token is much higher than all the others, proportional to the squared softmax probability of the most probable token. Higher values have a stronger effect. (Put this value on 0 to disable its effect).
- Added support for the Starcoder and Starcoder Chat models.
- Cleaned up and slightly refactored the sampler code, EOS stop tokens should now work for all model types, use
--unbantokens
to enable it. Additionally, the left square bracket[
token is no longer banned by default as modern models don't really need it, and the token IDs were inconsistent across architectures.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
koboldcpp-1.25.1
koboldcpp-1.25.1
KoboldCpp Changes:
- Add a new Failsafe mode, triggered by running
--noavx2 --noblas --nommap
which disables all CPU intrinsics, allowing even ancient devices with no AVX or SSE support to run KoboldCpp, though they will be extremely slow. - Fixed a bug in the GUI that selected noavx2 mode incorrectly.
- Pulled new changes for other non-llama architectures. In particular, the GPT Tokenizer has been improved.
- Added support for setting the
sampler_seed
via the/generate
API. Please refer to KoboldAI API documentation for details. - Pulled upstream fixes and enhancements, and compile fixes for other architectures.
- Added more console logging in
--debugmode
which can now display the context token contents.
Edit: v1.25.1
- Changed python for pyinstaller from 3.9 to 3.8. Combined with a change in failsafe mode that avoids PrefetchVirtualMemory, failsafe mode should now work in Windows 7! To use it, run with
--noavx2 --noblas --nommap
and failsafe mode will trigger. - Upgraded CLBlast to 1.6
Kobold Lite UI Changes:
- Kobold Lite UI now supports variable streaming lengths (defaults to 8 tokens), you can see by adding ?streamamount=[value] to the URL after launching with
--stream
Removed newlines from automatically being inserted into the very start of chat scenarios. The chat regex has been slightly adjusted.- Above change was reverted as it was buggy.
- Remove default Alpaca instruction prompt as it was less useful on newer instruct models. You can still use it by adding it to Memory.
- Fixed an autosave bug which happened sometimes when disconnecting while using Lite.
- Greatly improved markdown support
- Added drag and drop load file functionality
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
koboldcpp-1.24
koboldcpp-1.24
A.K.A The "He can't keep getting away with it!" edition.
KoboldCpp Changes:
- Added support for the new GGJT v3 (q4_0, q4_1 and q8_0) quantization format changes.
- Still retains backwards compatibility with every single historical GGML format (GGML, GGHF, GGJT v1,2,3 + all other formats from supported architectures).
- Fixed F16 format detection in NeoX, including a fix for use_parallel_residual.
- Various small fixes and improvements, sync to upstream and updated Kobold Lite.
Embedded Kobold Lite has also been updated, with the following changes:
- Improved the spinning circle waiting animation to use less processing.
- Fixed a bug with stopping sequences when in streaming mode.
- Added a toggle to avoid inserting newlines in Instruct mode (good for Pygmalion and OpenAssistant based instruct models).
- Added a toggle to enable basic markdown in instruct mode (off by default).
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
EDIT: An alternative CUDA build has been provided by Henky for this version, to allow access to the latest quantizations for CUDA users. Do note that it only supports the latest version of LLAMA based models. CUDA builds will still not be generated by default, and support for them will be limited.
koboldcpp-1.23.1
koboldcpp-1.23.1
A.K.A The "Is Pepsi Okay?" edition.
Changes:
- Integrated support for the new quantization formats for GPT-2, GPT-J and GPT-NeoX
- Integrated Experimental OpenCL GPU Offloading via CLBlast (Credits to @0cc4m)
- You can only use this in combination with
--useclblast
, combine with--gpulayers
to pick number of layers to offload - Currently works for new quantization formats of LLAMA models only
- Should work on all GPUs
- You can only use this in combination with
- Still supports all older GGML models, however they will not be able to enjoy new features.
- Updated Lite, integrated various fixes and improvements from upstream.
1.23.1 Edit:
- Pulled Occam's fix for the q8 dequant kernels, so now q8 formats can enjoy GPU offloading as well.
- Disabled fp16 prompt processing as it appears to be slower. Please compare!
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
Please share your Performance Bencharks for CLBlast GPU offloading or issues here: #179 . Do include whether your GPU supports F16.