Download Latest Version ESPnet version 202604 source code.tar.gz (20.3 MB)
Email in envelope

Get an email when there's a new version of ESPnet

Home / v.202604
Name Modified Size InfoDownloads / Week
Parent folder
ESPnet version 202604 source code.tar.gz 2026-04-07 20.3 MB
ESPnet version 202604 source code.zip 2026-04-07 24.7 MB
README.md 2026-04-07 8.0 kB
Totals: 3 Items   45.0 MB 3

Summary

Overview

This release focuses on significant improvements to the Continuous Integration (CI) infrastructure, performance optimizations for TTS models, and the introduction of new recipe support for various languages and tasks. Key highlights include a major overhaul of the CI pipeline using Docker containers, a 90% performance boost in FastSpeech2 inference, and the addition of new ASR and TTS recipes for Kinyarwanda, Emilia, and other datasets.


Important PRs

๐Ÿš€ Major CI Infrastructure Overhaul

  • PR [#6379] & [#6372]: Significant refactoring of the CI pipeline.
    • Introduced a new Docker-based build and test workflow to improve consistency, speed, and reproducibility.
    • Shifted from environment setup in individual jobs to using pre-built Docker images.
    • Modularized CI workflows for Ubuntu and macOS, splitting large jobs for better caching and parallelism.
    • Added a reusable composite GitHub Action for environment setup.
    • Improved Docker image publishing with a matrix strategy for CPU and GPU variants.
  • PR [#6394] & [#6371]: Enhanced robustness of installer scripts.
    • Added backup download URLs for FFmpeg installation and mwerSegmenter to handle primary source failures.
    • Updated macOS CI workflow to include FFmpeg installation via Homebrew and verification steps.
  • PR [#6321]: Updated PyTorch support to version 2.9.1, including improved installation script logic for CUDA compatibility.

๐Ÿš€ Performance Boost in FastSpeech2

  • PR [#6376]: Implemented shape-bucketing for XPU inference and torch.compile support.
    • Added ESPNET_BUCKET_INFER environment flag to round tensor sizes to fixed bucket boundaries, enabling efficient use of oneDNN/GEMM primitives.
    • Defered encoder output trimming to just before the length regulator.
    • Fixed batch inference olens computation and added compiler directives to prevent recompilation.
    • Result: TTS inference time reduced from 166ms to 89ms for Batch=8 (90% performance improvement).

๐Ÿš€ New Recipe Additions

  • PR [#6337]: Added a new TTS recipe for Kinyarwanda using Tacotron 2 with character-based tokenization.
  • PR [#6325]: Added an ASR recipe for the Tal-zh-adult-teach dataset (Mandarin Chinese educational data).
  • PR [#6295]: Added an ASR recipe for the kosp2e dataset (Korean Speech Perception and Production Experiment).
  • PR [#6291]: Added a TTS recipe for the Emilia dataset using the VITS model.
  • PR [#6366]: Added a recipe for MS-SNSD (Speech Enhancement) as part of the ESPnet bootcamp.

๐Ÿ› Bug Fixes and Stability

  • PR [#6391]: Fixed inference artifact output support and added named multi-optimizer training support in espnet3.
  • PR [#6356]: Fixed Whisper tokenizer compatibility with Transformers v5 by switching to extra_special_tokens.
  • PR [#6309]: Added epsilon to standard deviation in normalization to prevent division by zero.
  • PR [#6302]: Fixed CategoryChunkIterFactory to use actual sample lengths instead of padded lengths, reducing silent chunks.
  • PR [#6306]: Fixed visinger2 inference to support phoneme duration inference.
  • PR [#6293]: Fixed data processing steps in the POWSM recipe.

๐Ÿ“š Documentation and Refactoring

  • PR [#6335] & [#6386]: Updated documentation to reflect ESPnet1 EOL and standardized argument group creation across task modules.
  • PR [#6327]: Reorganized the espnet3 directory structure into components/, systems/, parallel/, and utils/.
  • PR [#6328] & [#6329]: Added ASR system and inference packages, along with logging utilities for espnet3.
  • PR [#6354], [#6353], [#6352]: Updated SpeechLM module with improvements to trainer, processor, model, data loading, and binary files.

Contributors

A total of 20 contributors participated in this release, including:

  • Fhrozen, Masao-Someki, jthakurH, whr-a, jctian98, chinjouli, Dahee96, osinkolu, HsunGong, South-Twilight, zheedong, NewGamezzz, LiChenda, HANJionghao, elnaske, sw005320, thecaptain789, popcornell, dependabot[bot], and pre-commit-ci[bot].

Full changelogn

What's Changed

### New Features - [SpeechLM] Multimodal IO (See [#6355], by @jctian98) - Add batch inference support for FastSpeech2 (See [#6333], by @jthakurH) ### Enhancement - [espnet3-14.2] Bugfix on inference and multiple optimizer (See [#6391], by @Masao-Someki) - Add MS-SNSD enh1 recipe (See [#6366], by @Dahee96) ### Recipe - [POWSM] POWSM-CTC recipe, and changes for s2t-ctc training (See [#6341], by @chinjouli) - feat(rw): add kinyarwanda tts recipe - Victor Olufemi (See [#6337], by @osinkolu) - Removal of egs folder (See [#6334], by @Fhrozen) - Add TAL_ZH_ADULT_TEACH ASR recipe (ESPnet Bootcamp) (See [#6325], by @HsunGong) - [SVS]: update recipe name (See [#6298], by @South-Twilight) - Add kosp2e asr recipe (Bootcamp Project) (See [#6295], by @zheedong) - Add Emilia TTS recipe (ESPnet Bootcamp) (See [#6291], by @NewGamezzz) ### Bugfix - [espnet3-14.1] Add TEMPLATE and bugfix for integration test (See [#6390], by @Masao-Someki) - Fix Whisper tokenizer to use extra_special_tokens for Transformers v5 compatibility (See [#6356], by @Masao-Someki) - Add epsilon to standard deviation for normalization (See [#6309], by @LiChenda) - [SVS]: fix: fix visinger2 inference (See [#6306], by @South-Twilight) - fix CategoryChunkIterFactory (See [#6302], by @whr-a) - [POWSM] Fix data processing mentioned in issue [#6289] (See [#6293], by @chinjouli) ### Documentation - [CI Fix] Documentation Update for CI (See [#6386], by @Fhrozen) - upgrade the Python version to 3.10 in README.md (See [#6369], by @sw005320) - Update Documentation with ESPnet1 EOL (See [#6335], by @Fhrozen) - [espnet3-13] Add logging utils (See [#6329], by @Masao-Someki) - [espnet3-12] Add ASR system and inference packages (See [#6328], by @Masao-Someki) - [espnet3-10] Merge espnet3 branch into master (See [#6304], by @Masao-Someki) ### Refactoring - [espnet3-11] Directory update to espnet-3 (See [#6327], by @Masao-Someki) ### Others - Add backup url for ffmpeg installation (See [#6394], by @Fhrozen) - Bump actions/upload-artifact from 6 to 7 (See [#6380], by @dependabot[bot]) - CI Fix - Change Action/Cache (See [#6379], by @Fhrozen) - FastSpeech2: Add shape-bucketing for XPU inference and torch.compile (See [#6376], by @jthakurH) - CI test for Numpy fix (See [#6374], by @whr-a) - Refactoring of Actions (See [#6372], by @Fhrozen) - Backup Download link for mwerSegmenter (See [#6371], by @Fhrozen) - Update MacOS Github Action for FFMPEG (See [#6367], by @Fhrozen) - Fixed espeak-ng installation issue (See [#6365], by @Masao-Someki) - fix: correct typos 'seperated' and 'fuction' (See [#6359], by @thecaptain789) - [SpeechLM] Trainer, processor and model (See [#6354], by @jctian98) - [SpeechLM] Update data loading files (See [#6353], by @jctian98) - [SpeechLM] Update SpeechLM bin files (See [#6352], by @jctian98) - Small bug fix for GTSinger data processing (See [#6351], by @HANJionghao) - Fixed some minor stuff (See [#6349], by @popcornell) - [pre-commit.ci] pre-commit autoupdate (See [#6344], by @pre-commit-ci[bot]) - [POWSM] Improve data prep for powsm (See [#6340], by @chinjouli) - Add unit tests for codec inference (See [#6332], by @elnaske) - Update Installation version to Pytorch 2.9.1 (See [#6321], by @Fhrozen) - 2025.11 Version pre-release (See [#6301], by @Fhrozen)

Acknowledgements

@Dahee96, @Fhrozen, @HANJionghao, @HsunGong, @LiChenda, @Masao-Someki, @NewGamezzz, @South-Twilight, @chinjouli, @dependabot[bot], @elnaske, @jctian98, @jthakurH, @osinkolu, @popcornell, @pre-commit-ci[bot], @sw005320, @thecaptain789, @whr-a, @zheedong.

Source: README.md, updated 2026-04-07