Download Latest Version v4.2.0 source code.tar.gz (15.4 MB)
Email in envelope

Get an email when there's a new version of SWIFT LLM

Home / v4.2.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2026-05-07 19.2 kB
v4.2.0 source code.tar.gz 2026-05-07 15.4 MB
v4.2.0 source code.zip 2026-05-07 16.1 MB
Totals: 3 Items   31.5 MB 0

中文版

新特性

  1. Megatron-SWIFT a. 新增 model_type 支持:kimi_k25、hy_v3、llava_onevision。(llava_onevision 感谢 @randydl 的贡献) b. 支持 GLM-5 共享参数 MTP,可通过 --mtp_shared_weights 参数启用。 c. 支持 Qwen3.5 FP8 训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh d. 自定义 Megatron 模型文档:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Custom-Model.html e. 支持控制 MTP 分支中 decoder_input 是否停止梯度,即 MTP loss 能否直接通过 decoder_input 回传梯度到 Embedding/ViT,可通过 --mtp_decoder_input_detach 参数控制。 f. mlp_padding_free 参数兼容序列并行 g. 支持通过 megatron export 命令进行权重 FP8 量化导出,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh h. 移除对 megatron-core 0.12 - 0.14 版本的依赖兼容支持。
  2. RL a. GKD/OPSD 支持设置 generation_batch_size/steps_per_generaiton 参数。 b. GKD/OPSD teacher_server_api 兼容多模态训练。 c. GKD/OPSD 兼容 padding_free。 d. Megatron GRPO/GKD 权重同步支持仅同步 LoRA 权重。 e. swift rollout 新增异常捕获机制,避免进程静默卡死。 f. GRPO ref_sync_callback 支持在 ZeRO-3 下进行分层 gather,避免 OOM。 g. GRPO TRL 依赖版本升级至 >= 0.26。
  3. 训练 a. 支持 Qwen3.5 序列并行,可通过 --sequence_parallel_size 参数控制。(感谢 @meichangsu1 的贡献) b. 支持在数据集中直接指定 loss_scale,提供更灵活的控制方式,参考文档:https://swift.readthedocs.io/zh-cn/latest/Customization/Custom-dataset.html#id4 c. 数据集 datasets 依赖兼容 4.x 版本。 d. cached_dataset 与 --truncation_strategy split 策略兼容。
  4. 硬件 a. NPU 支持基于 transformers/Megatron 后端的 Qwen3.5 训练,使用 Megatron 后端时需开启 USE_MCORE_GDN=0 环境变量。(感谢 @addsubmuldiv、@hazelduan 的贡献) b. 新增 AMD 支持文档:https://swift.readthedocs.io/zh-cn/latest/BestPractices/AMD-support.html (感谢 @Treemann 的贡献) c. 支持 Metax 硬件的 RL 训练。(感谢 @suenphey 的贡献) d. NPU Megatron 训练兼容 megatron-core 0.15.3。(感谢 @addsubmuldiv 的贡献)

新模型

  1. 纯文本模型 a. ZhipuAI/GLM-5.1 b. MiniMax/MiniMax-M2.7 c. moonshotai/Kimi-K2.6(仅含纯文本) d. Tencent-Hunyuan/Hy3-preview e. AIDC-AI/Marco-Nano-Instruct 系列
  2. 多模态模型 a. Qwen/Qwen3.6-35B-A3B、Qwen/Qwen3.6-27B b. Qwen3-ASR(感谢 @xut806 的贡献) c. Gemma4 系列模型混合模态数据集训练支持 d. OpenDataLab/MinerU2.5-Pro-2604-1.2B e. OpenBMB/MiniCPM-o-4_5 新增音频模态支持(感谢 @fanqiNO1 的贡献) f. allenai/Molmo2-4B(感谢 @Kagura-0001 的贡献)

English Version

New Features

  1. Megatron-SWIFT a. Added model_type support: kimi_k25, hy_v3, llava_onevision. (llava_onevision contributed by @randydl) b. Added support for GLM-5 shared-parameter MTP, which can be enabled via the --mtp_shared_weights argument. c. Added support for Qwen3.5 FP8 training. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh d. Custom Megatron model documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Custom-Model.html e. Added support for controlling whether decoder_input stops gradient in the MTP branch (i.e., whether MTP loss can backpropagate gradients through decoder_input to Embedding/ViT), configurable via the --mtp_decoder_input_detach argument. f. mlp_padding_free is now compatible with Sequence Parallelism. g. Added support for FP8 quantization export via the megatron export command. Script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh h. Removed dependency compatibility support for megatron-core versions 0.12 - 0.14.
  2. RL a. GKD/OPSD now supports the generation_batch_size/steps_per_generation parameters. b. GKD/OPSD teacher_server_api is now compatible with multimodal training. c. GKD/OPSD is now compatible with padding_free. d. Megatron GRPO/GKD weight synchronization now supports syncing LoRA weights only. e. Added exception handling to swift rollout to prevent silent process hangs. f. GRPO ref_sync_callback now supports layer-wise gather under ZeRO-3 to avoid OOM. g. GRPO TRL dependency upgraded to >= 0.26.
  3. Training a. Added support for Qwen3.5 Sequence Parallelism, controllable via the --sequence_parallel_size argument. (Contributed by @meichangsu1) b. Added support for specifying loss_scale directly in the dataset for more flexible loss control. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning c. Dataset dependency is now compatible with datasets 4.x. d. cached_dataset is now compatible with the --truncation_strategy split strategy.
  4. Hardware a. NPU now supports Qwen3.5 training with transformers/Megatron backends. When using the Megatron backend, the USE_MCORE_GDN=0 environment variable must be set. (Contributed by @addsubmuldiv, @hazelduan) b. Added AMD support documentation: https://swift.readthedocs.io/en/latest/BestPractices/AMD-support.html (Contributed by @Treemann) c. Added RL training support for MetaX hardware. (Contributed by @suenphey) d. NPU Megatron training is now compatible with megatron-core 0.15.3. (Contributed by @addsubmuldiv)

New Models

  1. Text-only Models a. ZhipuAI/GLM-5.1 b. MiniMax/MiniMax-M2.7 c. moonshotai/Kimi-K2.6 (text-only) d. Tencent-Hunyuan/Hy3-preview e. AIDC-AI/Marco-Nano-Instruct series
  2. Multimodal Models a. Qwen/Qwen3.6-35B-A3B, Qwen/Qwen3.6-27B b. Qwen3-ASR (Contributed by @xut806) c. Added mixed-modality dataset training support for Gemma4 series models. d. OpenDataLab/MinerU2.5-Pro-2604-1.2B e. OpenBMB/MiniCPM-o-4_5 now supports audio modality. (Contributed by @fanqiNO1) f. allenai/Molmo2-4B (Contributed by @Kagura-0001)

What's Changed

New Contributors

Full Changelog: https://github.com/modelscope/ms-swift/compare/v4.1.0...v4.2.0

Source: README.md, updated 2026-05-07