...The repository hosts the code and tooling for training, fine-tuning, and serving high-quality TTS, while the current flagship models (OpenAudio-S1 and S1-mini) are distributed via Fish Audio’s playground and Hugging Face. The models are evaluated with Seed TTS metrics and achieve exceptionally low word and character error rates, indicating strong intelligibility and alignment between text and audio. Fish Speech emphasizes expressive and controllable voices: it supports a long list of emotion tags, tone markers, and special audio effect markers that can be embedded in the text to drive prosody and vocal style, from basic emotions to nuanced states like sarcastic, conciliative, or hysterical. The system is multilingual and cross-lingual, handling multiple languages in a single input without explicit phoneme markup, and is trained on large-scale datasets.