| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2026-04-28 | 8.2 kB | |
| Release v5.7.0 source code.tar.gz | 2026-04-28 | 17.7 MB | |
| Release v5.7.0 source code.zip | 2026-04-28 | 23.5 MB | |
| Totals: 3 Items | 41.3 MB | 3 | |
Release v5.7.0
New Model additions
Laguna
Laguna is Poolside's mixture-of-experts language model family that extends standard SwiGLU MoE transformers with two key innovations. It features per-layer head counts allowing different decoder layers to have different query-head counts while sharing the same KV cache shape, and implements a sigmoid MoE router with auxiliary-loss-free load balancing that uses element-wise sigmoid of gate logits plus learned per-expert bias for router scoring.
Links: Documentation
- Laguna XS.2 implementation (#45673) by @joerowell in #45673
DEIMv2
DEIMv2 (DETR with Improved Matching v2) is a real-time object detection model that extends DEIM with DINOv3 features and spans eight model sizes from X to Atto for diverse deployment scenarios. It uses a Spatial Tuning Adapter (STA) for larger variants to convert DINOv3's single-scale output into multi-scale features, while ultra-lightweight models employ pruned HGNetv2 backbones. The unified design achieves superior performance-cost trade-offs, with DEIMv2-X reaching 57.8 AP with only 50.3M parameters and DEIMv2-S being the first sub-10M model to exceed 50 AP on COCO.
Links: Documentation | Paper
- model: Add DEIMv2 to Transformers (#44339) by @harshaljanjani in #44339
Attention
Several attention-related bugs were fixed across multiple models, including a cross-attention cache type error in T5Gemma2 for long inputs, incorrect cached forward behavior in Qwen3.5's gated-delta-net linear attention, and a crash in GraniteMoeHybrid when no Mamba layers are present. Attention function dispatch was also updated to align with the latest model implementations.
- Fix cross-attention cache layer type for T5Gemma2 long inputs (#45540) by @Beichen-Ma in [#45540]
- [Qwen3.5] Fix GDN linear attention multi-token cached forward (#45513) by @kashif in [#45513]
- Fix GraniteMoeHybrid _update_mamba_mask crash on attention-only models (#45514) by @tianhaocui in [#45514]
- Align latest model attention function dispatch (#45598) by @Cyrilvallez in [#45598]
Tokenizers
There was a bug in AutoTokenizer that caused the wrong tokenizer class to be initialized. This caused regressions in models like DeepSeek R1.
- change got reverted (#45680) by @itazap in [#45680]
Generation
Continuous batching generation received several fixes and improvements, including correcting KV deduplication and memory estimation for long sequences (16K+), and removing misleading warnings about num_return_sequences and other unsupported features that were incorrectly firing even when functionality worked correctly. Documentation for per-request sampling parameters was also added.
- generate: drop stale num_return_sequences warning on continuous batching path (#45582) by @joaquinhuigomez in [#45582]
- Remove unnecessary generate warnings (#45619) by @Cyrilvallez in [#45619]
- [CB] Changes for long generation (#45530) by @remi-or in [#45530]
- [docs] per-request sampling params (#45553) by @stevhliu in [#45553]
Kernels
Improved kernel support by fixing configuration reading and error handling for FP8 checkpoints (e.g., Qwen3.5-35B-A3B-FP8), enabling custom expert kernels registered from the HF Hub to be properly loaded, and resolving an incompatibility that prevented Gemma3n and Gemma4 from using the rotary kernel.
- Fix configuration reading and error handling for kernels (#45610) by @hmellor in [#45610]
- Allow for registered experts from kernels hub (#45577) by @winglian in [#45577]
- Gemma3n and Gemma4 cannot use rotary kernel (#45564) by @Cyrilvallez in [#45564]
Bugfixes and improvements
- fixing more typos (#45689) by @vasqu in [#45689]
- [docs] cb memory management (#45587) by @stevhliu in [#45587]
- [docs] cpu offloading (#45660) by @stevhliu in [#45660]
- docs(README_zh-hans): clarify conditions for not using Transformers (#45688) by @GuaiZai233 in [#45688]
- fix padding side issue for fast_vlm tests (#45592) by @kaixuanliu in [#45592]
- Fix
x_clip: 8 failed test cases (#45394) by @kaixuanliu in [#45394] - zero_shot_object_detection ValueError fix for python 3.13 (#45669) by @AnkitAhlawat7742 in [#45669]
- Fix pageable H2D copies in Gated DeltaNet PyTorch fallback (#45665) by @ruixiang63 in [#45665]
- Fix UnboundLocalError in shard_and_distribute_module for replicated parameters (#45675) by @Abdennacer-Badaoui in [#45675]
- [MistralCommonBackend] Soften validation mode and apply_chat_template arguments check (#45628) by @juliendenize in [#45628]
- Fix
NameError: PeftConfigLiketriggered byPreTrainedModel.__init_subclass__(#45658) by @qgallouedec in [#45658] - chore(typing): added modeling_utils to ty (#45425) by @tarekziade in [#45425]
- [gemma4] infer from config instead of hardcoding (#45606) by @eustlb in [#45606]
- Update quants tests (#45480) by @SunMarc in [#45480]
- 🔴🔴🔴 fix: skip
clean_up_tokenizationfor BPE tokenizers inPreTrainedTokenizerFast(#44915) by @maxsloef-goodfire in [#44915] - Fix colmodernvbert tests (#45652) by @Cyrilvallez in [#45652]
- [CB] [Major] Add CPU request offloading (#45184) by @remi-or in [#45184]
- Fix peft constructors (#45622) by @Cyrilvallez in [#45622]
- chore: speedup modular converter (~30%) (#45046) by @tarekziade in [#45046]
- Fix whisper return language (#42227) by @FredHaa in [#42227]
- Add
supports_gradient_checkpointingtoNemotronHPreTrainedModel(#45625) by @sergiopaniego in [#45625] - Raise clear error for
problem_type="single_label_classification"withnum_labels=1(#45611) by @gaurav0107 in [#45611] - CircleCI with torch 2.11 (#45633) by @ydshieh in [#45633]
- chore: bump doc-builder SHA for main doc build workflow (#45631) by @rtrompier in [#45631]
- Allow more artifacts to be download in CI (#45629) by @ydshieh in [#45629]
- chore(qa): split pipeline and add type checking (#45432) by @tarekziade in [#45432]
- Skip failing offloading tests (#45624) by @Cyrilvallez in [#45624]
- fix: compute auxiliary losses when denoising is disabled in D-FINE (#45601) by @Abineshabee in [#45601]
- qa: bumped mlinter and allow local override (#45585) by @tarekziade in [#45585]
- Processing Utils: continue when content is a string (#45605) by @RyanMullins in [#45605]
- SonicMoe (#45433) by @IlyasMoutawwakil in [#45433]
- fix transformers + torchao nvfp4 serialization (#45573) by @vkuzo in [#45573]
- [AMD CI] Fix expectations for Gemma3n (#45602) by @Abdennacer-Badaoui in [#45602]
- [docs] multi-turn tool calling (#45554) by @stevhliu in [#45554]
- Fix
AttributeErrorons_aux=Noneinflash_attention_forward(#45589) by @jamesbraza in [#45589] - do not index past decoded chars with special tokens (#45435) by @itazap in [#45435]
- Update dev version (#45583) by @vasqu in [#45583]
- Update torchao usage for XPU and CPU (#45560) by @jiqing-feng in [#45560]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @vasqu
- fixing more typos (#45689)
- Update dev version (#45583)
- @joerowell
- Laguna XS.2 implementation (#45673)
- @tarekziade
- chore(typing): added modeling_utils to ty (#45425)
- chore: speedup modular converter (~30%) (#45046)
- chore(qa): split pipeline and add type checking (#45432)
- qa: bumped mlinter and allow local override (#45585)
- @harshaljanjani
- model: Add DEIMv2 to Transformers (#44339)
- @remi-or
- [CB] [Major] Add CPU request offloading (#45184)
- [CB] Changes for long generation (#45530)