Download Latest Version llama-b8121-bin-910b-openEuler-x86-aclgraph.tar.gz (61.6 MB)
Email in envelope

Get an email when there's a new version of llama.cpp

Home / b8121
Name Modified Size InfoDownloads / Week
Parent folder
llama-b8121-xcframework.zip < 12 hours ago 168.5 MB
llama-b8121-bin-win-vulkan-x64.zip < 12 hours ago 47.7 MB
llama-b8121-bin-win-sycl-x64.zip < 12 hours ago 120.6 MB
llama-b8121-bin-win-opencl-adreno-arm64.zip < 12 hours ago 25.3 MB
llama-b8121-bin-win-hip-radeon-x64.zip < 12 hours ago 369.3 MB
llama-b8121-bin-win-cuda-13.1-x64.zip < 12 hours ago 148.5 MB
llama-b8121-bin-win-cuda-12.4-x64.zip < 12 hours ago 220.0 MB
llama-b8121-bin-win-cpu-x64.zip < 12 hours ago 31.0 MB
llama-b8121-bin-win-cpu-arm64.zip < 12 hours ago 24.4 MB
llama-b8121-bin-ubuntu-x64.tar.gz < 12 hours ago 24.6 MB
llama-b8121-bin-ubuntu-vulkan-x64.tar.gz < 12 hours ago 41.5 MB
llama-b8121-bin-ubuntu-s390x.tar.gz < 12 hours ago 25.6 MB
llama-b8121-bin-macos-x64.tar.gz < 12 hours ago 86.0 MB
llama-b8121-bin-macos-arm64.tar.gz < 12 hours ago 30.4 MB
llama-b8121-bin-910b-openEuler-x86-aclgraph.tar.gz < 12 hours ago 61.6 MB
llama-b8121-bin-910b-openEuler-aarch64-aclgraph.tar.gz < 12 hours ago 55.6 MB
llama-b8121-bin-310p-openEuler-x86.tar.gz < 12 hours ago 61.6 MB
llama-b8121-bin-310p-openEuler-aarch64.tar.gz < 12 hours ago 55.6 MB
cudart-llama-bin-win-cuda-13.1-x64.zip < 12 hours ago 402.6 MB
cudart-llama-bin-win-cuda-12.4-x64.zip < 12 hours ago 391.4 MB
b8121 source code.tar.gz < 14 hours ago 29.0 MB
b8121 source code.zip < 14 hours ago 30.1 MB
README.md < 14 hours ago 3.7 kB
Totals: 23 Items   2.5 GB 0
Improve CUDA graph capture (#19754) * Improve CUDA graph capture Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because: - The first call always incurs CUDA graph capture overhead even if the graph is unstable - Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode) The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize. This also fixes issues such as https://github.com/ggml-org/llama.cpp/discussions/19708 * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Remove EM dashes * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com>

macOS/iOS:

Linux:

Windows:

openEuler:

Source: README.md, updated 2026-02-21