Awesome multilingual OCR toolkits based on PaddlePaddle
Contexts Optical Compression
Open-source industrial-grade ASR models
Audio foundation model excelling in audio understanding
Robust Speech Recognition Across Languages, Dialects
Accurate × Fast × Comprehensive
OCR expert VLM powered by Hunyuan's native multimodal architecture
Repo of Qwen2-Audio chat & pretrained large audio language model
Video understanding codebase from FAIR for reproducing video models
Foundational Models for State-of-the-Art Speech and Text Translation
Visual Causal Flow
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Capable of understanding text, audio, vision, video
Qwen3-Coder is the code version of Qwen3
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
GLM-4-Voice | End-to-End Chinese-English Conversational Model
Qwen3-ASR is an open-source series of ASR models
VMZ: Model Zoo for Video Modeling
Language modeling in a sentence representation space
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Multi-modal large language model designed for audio understanding
Qwen3-omni is a natively end-to-end, omni-modal LLM
Chat & pretrained large vision language model
Detect faces in an image