What's Changed
- ci: add option to skip nvbench build by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/390
- ci: build devel image with cuda 12.8 for blackwell by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/391
- kernel: added query packing support for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/392
- refactor: rename attention to mha to differentiate it from mla by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/393
- kernel: added triton aot compiler by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/394
- kernel: generate smaller kernel instantiations by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/395
- kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/397
- upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/398
- kernel: added simple MLA kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/396
- kernel: added pipeline support for mla by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/399
- kernel: added ping-pong rmem support for MLA by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/400
- kernel: revert experimental TiledMMA separation change. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/401
- kernel: put query alwasy in registers for mha by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/402
- kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/403
- kernel: revert mla ping-pong rmem change by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/404
- kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/405
- kernel: added causal mask for MLA kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/406
- kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/407
- kernel: fix mask bugs for MLA by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/408
- kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/409
- kernel: added stage support for MLA kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/410
- misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/412
- kernel: added q and kv oob handling for MLA kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/413
- kernel: optimize mask loop for MLA kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/414
- kernel: added paged kv support for MLA kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/415
- kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/416
- kernel: use FastDivmod in attention kernels by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/417
Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.2.3...v0.2.4