Download Latest Version llama-b9257-bin-ubuntu-openvino-2026.0-x64.tar.gz (12.4 MB)
Email in envelope

Get an email when there's a new version of llama.cpp

Home / b9254
Name Modified Size InfoDownloads / Week
Parent folder
llama-b9254-xcframework.zip < 15 hours ago 203.6 MB
llama-b9254-bin-win-vulkan-x64.zip < 15 hours ago 32.7 MB
llama-b9254-bin-win-sycl-x64.zip < 15 hours ago 111.6 MB
llama-b9254-bin-win-opencl-adreno-arm64.zip < 15 hours ago 10.1 MB
llama-b9254-bin-win-hip-radeon-x64.zip < 15 hours ago 319.6 MB
llama-b9254-bin-win-cuda-13.1-x64.zip < 15 hours ago 158.4 MB
llama-b9254-bin-win-cuda-12.4-x64.zip < 15 hours ago 259.9 MB
llama-b9254-bin-win-cpu-x64.zip < 15 hours ago 15.9 MB
llama-b9254-bin-win-cpu-arm64.zip < 15 hours ago 9.5 MB
llama-b9254-bin-ubuntu-x64.tar.gz < 15 hours ago 14.0 MB
llama-b9254-bin-ubuntu-vulkan-x64.tar.gz < 15 hours ago 31.5 MB
llama-b9254-bin-ubuntu-vulkan-arm64.tar.gz < 15 hours ago 24.8 MB
llama-b9254-bin-ubuntu-sycl-fp32-x64.tar.gz < 15 hours ago 44.7 MB
llama-b9254-bin-ubuntu-sycl-fp16-x64.tar.gz < 15 hours ago 44.8 MB
llama-b9254-bin-ubuntu-s390x.tar.gz < 15 hours ago 12.4 MB
llama-b9254-bin-ubuntu-rocm-7.2-x64.tar.gz < 15 hours ago 129.6 MB
llama-b9254-bin-ubuntu-openvino-2026.0-x64.tar.gz < 15 hours ago 12.4 MB
llama-b9254-bin-ubuntu-arm64.tar.gz < 15 hours ago 11.1 MB
llama-b9254-bin-macos-x64.tar.gz < 15 hours ago 8.5 MB
llama-b9254-bin-macos-arm64.tar.gz < 15 hours ago 8.5 MB
llama-b9254-bin-macos-arm64-kleidiai.tar.gz < 15 hours ago 8.5 MB
llama-b9254-bin-android-arm64.tar.gz < 15 hours ago 65.2 MB
llama-b9254-bin-910b-openEuler-x86-aclgraph.tar.gz < 15 hours ago 11.7 MB
llama-b9254-bin-910b-openEuler-aarch64-aclgraph.tar.gz < 15 hours ago 11.0 MB
llama-b9254-bin-310p-openEuler-x86.tar.gz < 15 hours ago 11.6 MB
llama-b9254-bin-310p-openEuler-aarch64.tar.gz < 15 hours ago 11.0 MB
cudart-llama-bin-win-cuda-13.1-x64.zip < 15 hours ago 402.6 MB
cudart-llama-bin-win-cuda-12.4-x64.zip < 15 hours ago 391.4 MB
b9254 source code.tar.gz 2026-05-20 33.9 MB
b9254 source code.zip 2026-05-20 35.3 MB
README.md 2026-05-20 7.0 kB
Totals: 31 Items   2.4 GB 0
Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (#22522) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. * Further optimization pass of the first half of kernels * Optimized PDL barriers for the second batch of kernels * Further refinements after rebase. * Moves pdl logic to separate function, removes some whitespace * Strips post-hoc PDL logic * Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels * Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL * Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL * Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility * Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 * Enrolls flash_attn_combine_results * Fix: Drops needless and broken check of CUDA arch for PDL. PDL either works or is without effect. * Enrolls flash-attention kernels to pdl * Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL. * Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via and template alias and template expansion * Enrolls all remaining kernels for qwen3-coder-next into PDL * Remove all PDL LC calls to create a baseline * Added LC according to internal guidance and tested kernel performance. * Enrols missing qwen3-5 kernels passively into PDL. * Kernel optimizations (LC signals) for qwen3.5 * Enrolls ssm-scan kernels into PDL * Adds GGML_CUDA_PDL command line option to toggle PDL. * Fix: Ada and lower compilation by guarding PDL calls correctly * Cleanup: Removes commented out GGML_CUDA_PDL_LC * Cleanup: Removes experimental comments * Adds 90-virtual to build script so that Hopper GPUs can leverage PDL. * Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL. * Fix: Correct PDL en/disablement based on device-side arch check. Host side check is UB. Required moving from macros to inlined functions * Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1 * Enable PDL by default for Hopper+ devices * Enrolls softcap_f32 and two flash_attn kernels into PDL. * Improves flash attn PDL barrier placement * Fix: Perf regression on ada; excludes ada and below from PDL launches * Improves some sync barrier placements * Drops superfluous constructor * Adds #endif guard comments * Reverts experimental change to top-k-moe.cu, which moved expensive allocations in front of the PDL barrier. It did not have a meaningful impact. * Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0 PDL is disabled * Revert "Drops superfluous constructor". Adds const to remaining arguments This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1. * Cleanup: Removes and fixes some comments and whitespace * Clarifies comment of sync-barrier position * Relocates and refactors PDL launch functions and accessories * Adds error checking to the regular kernel launch path * Drops "auto" in favor of "ggml_cuda_kernel_params" * Adds "const" to ggml_cuda_kernel_launch_params * [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Source: README.md, updated 2026-05-20