llama.cpp - Browse /b9254 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
llama-b9254-xcframework.zip	< 15 hours ago	203.6 MB	0
llama-b9254-bin-win-vulkan-x64.zip	< 15 hours ago	32.7 MB	0
llama-b9254-bin-win-sycl-x64.zip	< 15 hours ago	111.6 MB	0
llama-b9254-bin-win-opencl-adreno-arm64.zip	< 15 hours ago	10.1 MB	0
llama-b9254-bin-win-hip-radeon-x64.zip	< 15 hours ago	319.6 MB	0
llama-b9254-bin-win-cuda-13.1-x64.zip	< 15 hours ago	158.4 MB	0
llama-b9254-bin-win-cuda-12.4-x64.zip	< 15 hours ago	259.9 MB	0
llama-b9254-bin-win-cpu-x64.zip	< 15 hours ago	15.9 MB	0
llama-b9254-bin-win-cpu-arm64.zip	< 15 hours ago	9.5 MB	0
llama-b9254-bin-ubuntu-x64.tar.gz	< 15 hours ago	14.0 MB	0
llama-b9254-bin-ubuntu-vulkan-x64.tar.gz	< 15 hours ago	31.5 MB	0
llama-b9254-bin-ubuntu-vulkan-arm64.tar.gz	< 15 hours ago	24.8 MB	0
llama-b9254-bin-ubuntu-sycl-fp32-x64.tar.gz	< 15 hours ago	44.7 MB	0
llama-b9254-bin-ubuntu-sycl-fp16-x64.tar.gz	< 15 hours ago	44.8 MB	0
llama-b9254-bin-ubuntu-s390x.tar.gz	< 15 hours ago	12.4 MB	0
llama-b9254-bin-ubuntu-rocm-7.2-x64.tar.gz	< 15 hours ago	129.6 MB	0
llama-b9254-bin-ubuntu-openvino-2026.0-x64.tar.gz	< 15 hours ago	12.4 MB	0
llama-b9254-bin-ubuntu-arm64.tar.gz	< 15 hours ago	11.1 MB	0
llama-b9254-bin-macos-x64.tar.gz	< 15 hours ago	8.5 MB	0
llama-b9254-bin-macos-arm64.tar.gz	< 15 hours ago	8.5 MB	0
llama-b9254-bin-macos-arm64-kleidiai.tar.gz	< 15 hours ago	8.5 MB	0
llama-b9254-bin-android-arm64.tar.gz	< 15 hours ago	65.2 MB	0
llama-b9254-bin-910b-openEuler-x86-aclgraph.tar.gz	< 15 hours ago	11.7 MB	0
llama-b9254-bin-910b-openEuler-aarch64-aclgraph.tar.gz	< 15 hours ago	11.0 MB	0
llama-b9254-bin-310p-openEuler-x86.tar.gz	< 15 hours ago	11.6 MB	0
llama-b9254-bin-310p-openEuler-aarch64.tar.gz	< 15 hours ago	11.0 MB	0
cudart-llama-bin-win-cuda-13.1-x64.zip	< 15 hours ago	402.6 MB	0
cudart-llama-bin-win-cuda-12.4-x64.zip	< 15 hours ago	391.4 MB	0
b9254 source code.tar.gz	2026-05-20	33.9 MB	0
b9254 source code.zip	2026-05-20	35.3 MB	0
README.md	2026-05-20	7.0 kB	0
Totals: 31 Items		2.4 GB	0

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (#22522) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. * Further optimization pass of the first half of kernels * Optimized PDL barriers for the second batch of kernels * Further refinements after rebase. * Moves pdl logic to separate function, removes some whitespace * Strips post-hoc PDL logic * Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels * Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL * Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL * Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility * Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 * Enrolls flash_attn_combine_results * Fix: Drops needless and broken check of CUDA arch for PDL. PDL either works or is without effect. * Enrolls flash-attention kernels to pdl * Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL. * Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via and template alias and template expansion * Enrolls all remaining kernels for qwen3-coder-next into PDL * Remove all PDL LC calls to create a baseline * Added LC according to internal guidance and tested kernel performance. * Enrols missing qwen3-5 kernels passively into PDL. * Kernel optimizations (LC signals) for qwen3.5 * Enrolls ssm-scan kernels into PDL * Adds GGML_CUDA_PDL command line option to toggle PDL. * Fix: Ada and lower compilation by guarding PDL calls correctly * Cleanup: Removes commented out GGML_CUDA_PDL_LC * Cleanup: Removes experimental comments * Adds 90-virtual to build script so that Hopper GPUs can leverage PDL. * Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL. * Fix: Correct PDL en/disablement based on device-side arch check. Host side check is UB. Required moving from macros to inlined functions * Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1 * Enable PDL by default for Hopper+ devices * Enrolls softcap_f32 and two flash_attn kernels into PDL. * Improves flash attn PDL barrier placement * Fix: Perf regression on ada; excludes ada and below from PDL launches * Improves some sync barrier placements * Drops superfluous constructor * Adds #endif guard comments * Reverts experimental change to top-k-moe.cu, which moved expensive allocations in front of the PDL barrier. It did not have a meaningful impact. * Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0 PDL is disabled * Revert "Drops superfluous constructor". Adds const to remaining arguments This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1. * Cleanup: Removes and fixes some comments and whitespace * Clarifies comment of sync-barrier position * Relocates and refactors PDL launch functions and accessories * Adds error checking to the regular kernel launch path * Drops "auto" in favor of "ggml_cuda_kernel_params" * Adds "const" to ggml_cuda_kernel_launch_params * [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy

macOS/iOS:

Linux: