StreamSpeech is an “all-in-one” speech model designed to perform offline and simultaneous speech recognition, speech translation, and speech synthesis within a single unified architecture. Developed as part of an ACL 2024 paper, it targets streaming and low-latency scenarios where intermediate results and final translations or synthetic speech must be produced continuously as audio is being received. The model supports eight tasks: offline ASR, speech-to-text translation, speech-to-speech translation, and TTS, as well as their streaming or simultaneous counterparts, all handled by the same underlying system. During simultaneous translation, StreamSpeech can optionally output intermediate ASR transcripts and text translations, giving users or downstream applications real-time visibility into what the system is hearing and how it is translating.
Features
- Unified model for ASR, speech translation, and TTS in both offline and streaming modes
- Supports eight distinct tasks including simultaneous S2ST, S2TT, and real-time TTS
- Outputs intermediate transcripts and translations for richer low-latency interaction
- SimulEval integration and agent scripts for systematic streaming evaluation
- Web GUI demo and project page with audio samples and visualizations
- Achieves state-of-the-art performance on offline and simultaneous speech-to-speech translation