Qwen-Audio

Qwen-Audio is a large audio-language model developed by Alibaba Cloud, built to accept various types of audio input (speech, natural sounds, music, singing) along with text input, and output text. There is also an instruction-tuned version called Qwen-Audio-Chat which supports conversational interaction (multi-round), audio + text input, creative tasks and reasoning over audio. It uses multi-task training over many different audio tasks (30+), and achieves strong multi-benchmarks performance without task-specific fine‐tuning. It includes features such as flexible multi-run chat, audio understanding/reasoning, music appreciation, and also tool usage (e.g. voice editing).

Features

Supports various audio types: speech, natural sounds, music, singing etc.
Multi-task training framework covering 30+ audio tasks to allow transfer across them and avoid interference
Audio + text input and text output; Qwen-Audio-Chat enables dialogue over audio and text, multi-round interactions
Excellent zero- or few-shot performance: achieves state-of-the-art on multiple audio benchmarks (Aishell1, cochlscene, ClothoAQA, VocalSound) without task‐specific fine-tuning
Flexibility: supports multiple-audio analysis, sound understanding & reasoning, creative tasks like music appreciation, and external tool usage (e.g. voice editing)
Multilingual support in many languages/dialects in audio; voice chat modes; designed for flexible real-world audio interaction scenarios

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Qwen-Audio

Qwen-Audio Web Site

Other Useful Business Software

MongoDB Atlas runs apps anywhere

Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free

Rate This Project

User Reviews

Be the first to post a review of Qwen-Audio!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Large Language Models (LLM), Python AI Models

Registered

2025-09-23

Similar Business Software

Google AI Studio

Google AI Studio is a unified development platform that helps teams explore, build, and deploy applications using Google’s most advanced AI models, including Gemini 3.5. It brings text, image, audio, and video models together in one interactive playground. With vibe coding, developers can use...

See Software
LM-Kit.NET

LM-Kit.NET is a complete local AI runtime for .NET that lets engineering teams ship AI-powered features without cloud dependencies, per-token costs, or data leaving the network. Most .NET AI integrations stop at inference. LM-Kit.NET covers the full range of capabilities production...

See Software
Gemini Enterprise Agent Platform

Gemini Enterprise Agent Platform is a comprehensive solution from Google Cloud designed to help organizations build, scale, govern, and optimize AI agents. It represents the evolution of Vertex AI, combining advanced model development with new capabilities for agent orchestration and...

See Software
GPT-4o

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an...

See Software
GPT-Realtime-1.5

GPT-Realtime-1.5 is a flagship voice AI model from OpenAI designed for real-time audio interactions and conversational applications. It supports both audio input and output, making it ideal for voice agents and customer support systems. The model delivers fast performance with high...

See Software
gpt-4o-mini Realtime

The gpt-4o-mini-realtime-preview model is a compact, lower-cost, realtime variant of GPT-4o designed to power speech and text interactions with low latency. It supports both text and audio inputs and outputs, enabling “speech in, speech out” conversational experiences via a persistent WebSocket...

See Software