MiMo Audio is an open-source audio language model project focused on few-shot learning across speech and audio tasks. It explores how large-scale next-token prediction can help audio models generalize from a few examples or simple instructions. The project includes MiMo-Audio-7B-Base and MiMo-Audio-7B-Instruct, along with a dedicated MiMo-Audio tokenizer. It supports audio understanding, speech intelligence, spoken dialogue, instruction-following audio generation, and text-to-speech-style tasks. The architecture combines audio tokenization, patch encoding, a language model, and patch decoding to make high-rate audio sequences more efficient to model. Overall, it is useful for researchers and developers experimenting with advanced audio LLMs, speech generation, audio reasoning, and instruction-tuned multimodal systems.
Features
- Audio language model for few-shot learning
- MiMo-Audio-7B-Base and MiMo-Audio-7B-Instruct model releases
- Dedicated MiMo-Audio tokenizer
- Audio understanding and speech intelligence support
- Instruction-following audio generation workflows
- Gradio demo and inference example scripts