ggml-hexagon: flash-attn opt (#19025)
* optimize flash attention kernel by improving score computation and online softmax update
* wip
* Refactor online softmax update in flash attention kernel for improved performance
* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation
* wip
macOS/iOS: - macOS Apple Silicon (arm64) - macOS Intel (x64) - iOS XCFramework
Linux: - Ubuntu x64 (CPU) - Ubuntu x64 (Vulkan) - Ubuntu s390x (CPU)
Windows: - Windows x64 (CPU) - Windows arm64 (CPU) - Windows x64 (CUDA 12) - CUDA 12.4 DLLs - Windows x64 (CUDA 13) - CUDA 13.1 DLLs - Windows x64 (Vulkan) - Windows x64 (SYCL) - Windows x64 (HIP)
openEuler: - openEuler x86 (310p) - openEuler x86 (910b, ACL Graph) - openEuler aarch64 (310p) - openEuler aarch64 (910b, ACL Graph)