llama.cpp - Browse /b8333 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
llama-b8333-xcframework.zip	< 14 hours ago	173.1 MB	0
llama-b8333-bin-win-vulkan-x64.zip	< 14 hours ago	52.1 MB	0
llama-b8333-bin-win-sycl-x64.zip	< 14 hours ago	130.9 MB	0
llama-b8333-bin-win-opencl-adreno-arm64.zip	< 14 hours ago	29.2 MB	0
llama-b8333-bin-win-hip-radeon-x64.zip	< 14 hours ago	349.0 MB	0
llama-b8333-bin-win-cuda-13.1-x64.zip	< 14 hours ago	152.8 MB	0
llama-b8333-bin-win-cuda-12.4-x64.zip	< 14 hours ago	224.5 MB	0
llama-b8333-bin-win-cpu-x64.zip	< 14 hours ago	35.1 MB	0
llama-b8333-bin-win-cpu-arm64.zip	< 14 hours ago	28.2 MB	0
llama-b8333-bin-ubuntu-x64.tar.gz	< 14 hours ago	28.7 MB	0
llama-b8333-bin-ubuntu-vulkan-x64.tar.gz	< 14 hours ago	45.8 MB	0
llama-b8333-bin-ubuntu-s390x.tar.gz	< 14 hours ago	30.3 MB	0
llama-b8333-bin-ubuntu-rocm-7.2-x64.tar.gz	< 14 hours ago	148.9 MB	0
llama-b8333-bin-macos-x64.tar.gz	< 14 hours ago	95.3 MB	0
llama-b8333-bin-macos-arm64.tar.gz	< 14 hours ago	36.0 MB	0
llama-b8333-bin-910b-openEuler-x86-aclgraph.tar.gz	< 14 hours ago	66.3 MB	0
llama-b8333-bin-910b-openEuler-aarch64-aclgraph.tar.gz	< 14 hours ago	59.5 MB	0
llama-b8333-bin-310p-openEuler-x86.tar.gz	< 14 hours ago	66.3 MB	0
llama-b8333-bin-310p-openEuler-aarch64.tar.gz	< 14 hours ago	59.5 MB	0
cudart-llama-bin-win-cuda-13.1-x64.zip	< 14 hours ago	402.6 MB	0
cudart-llama-bin-win-cuda-12.4-x64.zip	< 14 hours ago	391.4 MB	0
b8333 source code.tar.gz	2026-03-13	29.5 MB	0
b8333 source code.zip	2026-03-13	30.6 MB	0
README.md	2026-03-13	4.3 kB	0
Totals: 24 Items		2.7 GB	0

graph : remove redundant GDN state transposes (#20443) * ggml : transpose fused GDN state access for coalesced memory reads (#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * llama : rever fgdn argument changes * graph : remove GDN state transposes * vulkan : adapt * cuda : remove obsolete smem code --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com>

macOS/iOS:

Linux:

Windows: