HIP: add mmf for CDNA (#18896)
* refactor mmf rows_per_block
* speed up compile
* pass cdna compile
* fix cuda error
* clean up mmf
* f32 mmf
* clean float mma
* fix mmf error
* faster mmf
* extend tile k
* fix compile error
* Revert "extend tile k"
This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7.
* fix smem overflow
* speed up compiling mmf
* speed up compile for hip
* 512 block for cdna
* config pad size
* fix as comment
* update select logic
* move some code to cuh
* fix as comment
* correct cdna3 config
---------
Co-authored-by: zhang hui <you@example.com>
macOS/iOS: - macOS Apple Silicon (arm64) - macOS Intel (x64) - iOS XCFramework
Linux: - Ubuntu x64 (CPU) - Ubuntu x64 (Vulkan) - Ubuntu s390x (CPU)
Windows: - Windows x64 (CPU) - Windows arm64 (CPU) - Windows x64 (CUDA 12) - CUDA 12.4 DLLs - Windows x64 (CUDA 13) - CUDA 13.1 DLLs - Windows x64 (Vulkan) - Windows x64 (SYCL) - Windows x64 (HIP)
openEuler: - openEuler x86 (310p) - openEuler x86 (910b, ACL Graph) - openEuler aarch64 (310p) - openEuler aarch64 (910b, ACL Graph)