Changes
- Gemma 4 support with full tool-calling in the API and UI. 🆕
- ik_llama.cpp support: Add ik_llama.cpp as a new backend through new
textgen-portable-ikportable builds and a new--ikflag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. - API: Add echo + logprobs for
/v1/completions. The completions endpoint now supports theechoandlogprobsparameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a newtop_logprobs_idsfield. - Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc).
- Transformers: Autodetect
torch_dtypefrom model config instead of always forcing bfloat16/float16. The--bf16flag still works as an override. - Remove the obsolete
models/config.yamlfile. Instruction templates are now detected from model metadata instead of filename patterns. - Rename "truncation length" to "context length" in the terminal log message.
Security
- Gradio fork: Fix ACL bypass via case-insensitive path matching on Windows/macOS.
- Gradio fork: Add server-side validation for Dropdown, Radio, and CheckboxGroup.
- Fix SSRF in superbooga extensions: URLs fetched by superbooga/superboogav2 are now validated to block requests to private/internal networks.
Bug fixes
- Fix
--idle-timeoutfailing on encode/decode requests and not tracking parallel generation properly. - Fix stopping string detection for chromadb/context-1 (
<|return|>vs<|result|>). - Fix Qwen3.5 MoE failing to load via ExLlamav3_HF.
- Fix
ban_eos_tokennot working for ExLlamav3. EOS is now suppressed at the logit level.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/a1cfb645307edc61a89e41557f290f441043d3c2 .
- Adds Gemma-4 support
- Adds improved KV cache quantization via activations rotation, based on TurboQuant https://github.com/ggml-org/llama.cpp/pull/21038
- Update ExLlamaV3 to 0.0.28
- Update transformers to 5.5
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
- Windows/Linux:
- NVIDIA GPU
- Older driver: Use
cuda12.4. - Newer driver (nvidia-smi reports CUDA Version >= 13.1): Use
cuda13.1.
- Older driver: Use
- AMD/Intel GPU: Use
vulkan. - AMD GPU (ROCm): Use
rocm. -
CPU only: Use
cpu. -
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
textgen-portable-ik is for ik-llama.cpp builds
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
:::txt
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs