What's Changed
- misc: remove legacy logic to support quantization for other types. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/350
- upgrade pytorch to 2.5.1 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/351
- added cuda 12.6 build image by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/353
- fix cmake version issue for manylinux image by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/354
- kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/355
- ci: fix package test workflow by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/357
- kernel: refactor attention kernel for readibility by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/358
- dev: config dev container with proper extensions by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/359
- kernel: added attention bench for profiling before optimization by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/360
- kernel: added logits soft cap support for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/362
- tools: added attention traits viewer by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/363
- kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/364
- kernel: added causal, alibi, sliding window mask for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/365
- kernel: refactor attention kernel and add more unittests by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/366
- kernel: added M/N OOB handling for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/367
- tools: update svg build to generate small file by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/368
- kernel: Added attention params and tile for different input types. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/369
- kernel: added mqa and gqa support for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/370
- kernel: added var len and paged kv cache support for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/371
- kernel: added varlen and pagedkv unittests for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/372
- kernel: added attention kernel launch by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/373
- kernel: added build script to generate kernel instantiations for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/374
- kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/375
- kernel: added head_dim=96 support for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/376
- kernel: optimize attention kernel performance by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/377
- upgrade cutlass to 3.7.0 by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/379
- kernel: handle kv block range for attention kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/382
- kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/383
- kernel: seperate oob iterations for better performance. by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/384
- refactor: remove batch_prefill interface by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/385
- refactor: stop build flash_infer kernel by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/386
- feat: integrate in-house scale attention and use it by default by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/380
- kernel: only zfill k once to improve perf for attention by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/387
- refactor: skip flash_attn build by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/388
- refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/389
Full Changelog: https://github.com/vectorch-ai/ScaleLLM/compare/v0.2.2...v0.2.3