Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
CTranslate2 4.4.0 source code.tar.gz | 2024-09-09 | 3.4 MB | |
CTranslate2 4.4.0 source code.zip | 2024-09-09 | 3.6 MB | |
README.md | 2024-09-09 | 684 Bytes | |
Totals: 3 Items | 7.0 MB | 1 |
Removed: Flash Attention support in the Python package due to significant package size increase with minimal performance gain.
Note: Flash Attention remains supported in the C++ package with the WITH_FLASH_ATTN
option.
Flash Attention may be re-added in the future if substantial improvements are made.
New features
- Support Llama3 (#1751)
- Support Gemma2 (#1772)
- Add log probs for all tokens in vocab (#1755)
- Grouped conv1d (#1749 + [#1758])
Fixes and improvements
- Fix pipeline (#1723 + [#1747])
- Some improvements in flash attention (#1732)
- Fix crash when using return_alternative on CUDA (#1733)
- Quantization AWQ GEMM + GEMV (#1727)