SenseVoice is a speech foundation model designed to perform multiple voice understanding tasks from audio input. It provides capabilities such as automatic speech recognition, spoken language identification, speech emotion recognition, and audio event detection within a single system. SenseVoice is trained on more than 400,000 hours of speech data and supports over 50 languages for multilingual recognition tasks. It is built to achieve high transcription accuracy while maintaining efficient inference performance. It includes different model variants optimized for either speed or accuracy, allowing developers to choose a configuration suitable for their use case. In addition to speech transcription, SenseVoice can detect emotional cues in speech and identify common sound events such as applause, laughter, or coughing. It also provides tools for running inference, exporting models to formats like ONNX or LibTorch, and deploying the system through APIs.
Features
- Multilingual automatic speech recognition supporting more than 50 languages
- Spoken language identification to determine the language in audio input
- Speech emotion recognition capable of detecting emotional tone in speech
- Audio event detection for identifying sounds such as applause or coughing
- Efficient inference with low latency using non-autoregressive model variants
- Deployment options including Python API, service endpoints, and web UI