llama.cpp - Browse /b8179 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
llama-b8179-xcframework.zip	2026-02-27	169.4 MB	0
llama-b8179-bin-win-vulkan-x64.zip	2026-02-27	48.3 MB	0
llama-b8179-bin-win-sycl-x64.zip	2026-02-27	121.0 MB	0
llama-b8179-bin-win-opencl-adreno-arm64.zip	2026-02-27	25.6 MB	0
llama-b8179-bin-win-hip-radeon-x64.zip	2026-02-27	345.0 MB	0
llama-b8179-bin-win-cuda-13.1-x64.zip	2026-02-27	148.9 MB	0
llama-b8179-bin-win-cuda-12.4-x64.zip	2026-02-27	220.4 MB	0
llama-b8179-bin-win-cpu-x64.zip	2026-02-27	31.4 MB	0
llama-b8179-bin-win-cpu-arm64.zip	2026-02-27	24.7 MB	0
llama-b8179-bin-ubuntu-x64.tar.gz	2026-02-27	25.1 MB	0
llama-b8179-bin-ubuntu-vulkan-x64.tar.gz	2026-02-27	42.3 MB	0
llama-b8179-bin-ubuntu-s390x.tar.gz	2026-02-27	26.2 MB	0
llama-b8179-bin-ubuntu-rocm-7.2-x64.tar.gz	2026-02-27	145.2 MB	0
llama-b8179-bin-macos-x64.tar.gz	2026-02-27	87.0 MB	0
llama-b8179-bin-macos-arm64.tar.gz	2026-02-27	30.7 MB	0
llama-b8179-bin-910b-openEuler-x86-aclgraph.tar.gz	2026-02-27	62.5 MB	0
llama-b8179-bin-910b-openEuler-aarch64-aclgraph.tar.gz	2026-02-27	56.5 MB	0
llama-b8179-bin-310p-openEuler-x86.tar.gz	2026-02-27	62.5 MB	0
llama-b8179-bin-310p-openEuler-aarch64.tar.gz	2026-02-27	56.5 MB	0
cudart-llama-bin-win-cuda-13.1-x64.zip	2026-02-27	402.6 MB	0
cudart-llama-bin-win-cuda-12.4-x64.zip	2026-02-27	391.4 MB	0
b8179 source code.tar.gz	2026-02-27	29.1 MB	0
b8179 source code.zip	2026-02-27	30.1 MB	0
README.md	2026-02-27	4.2 kB	0
Totals: 24 Items		2.6 GB	0

CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806) * CUDA: add CDNA3 MFMA support for flash attention MMA kernel Add MI300X (gfx942) MFMA tensor core flash attention using v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate). - Add FATTN_WARP_SIZE=64 for CDNA wavefront64 - Add CDNA config for head sizes 64, 80, 96, 112, 128 - Add FP16 MFMA intrinsic path in mma.cuh - Add manual V transpose load for MFMA register layout - Route CDNA to MMA for prompt processing, VEC for token generation - Fix Q loading and combine stride granularity for non-power-of-2 heads Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X): pp512 +7%, pp1024 +13%, pp2048 +23%, pp4096 +39% tg128 -10% (FA overhead, VEC used for both) All 2480 flash attention tests pass. Ref: https://github.com/ggml-org/llama.cpp/issues/17917 * address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch - Replace #define FATTN_WARP_SIZE with constexpr int warp_size = ggml_cuda_get_physical_warp_size() in each device function - Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked crossover on MI300X @ d32768 with power-of-2 GQA models: hsk=64 (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%) hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%) Unified threshold: eff_nq >= 128 for all head sizes. - Remove VEC fallback; small batches fall through to tile kernel * Update ggml/src/ggml-cuda/fattn.cu * use ggml_cuda_info().devices warp_size instead of hardcoded check --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

macOS/iOS:

Linux:

Windows: