What's Changed
- kernel: added flash infer attention impl by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/327
- refactor: flatten block tables to 1d tensor by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/328
- kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/329
- refactor: move flash attn and flash infer into attention folder by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/330
- kernel: port flash infer handler + wrapper logics by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/331
- ut: added unittests for flash infer kernels by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/332
- refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/333
- feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/334
- refactor: move paged kv related logic into paged_kv_t by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/335
- ut: added fp8 kv unittests for flash infer kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/336
- ci: added pip cache to avoid redownloading by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/337
- upgrade pytorch to 2.4.1 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/341
- ci: run package test in docker by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/345
- ci: build cuda 12.4 for scalellm cpp images by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/346
- Upgrade pytorch to 2.5.0 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/347
- ut: add more tests for different warp layout by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/340
- misc: attention kernel refactoring by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/339
Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.2.1...v0.2.2