Capable of understanding text, audio, vision, video
Qwen3-omni is a natively end-to-end, omni-modal LLM
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Large Multimodal Models for Video Understanding and Editing
Python inference and LoRA trainer package for the LTX-2 audio–video
OCR expert VLM powered by Hunyuan's native multimodal architecture
Code for the paper Hybrid Spectrogram and Waveform Source Separation