Awesome multilingual OCR toolkits based on PaddlePaddle
Contexts Optical Compression
Audio foundation model excelling in audio understanding
Robust Speech Recognition Across Languages, Dialects
Accurate × Fast × Comprehensive
Repo of Qwen2-Audio chat & pretrained large audio language model
Video understanding codebase from FAIR for reproducing video models
Foundational Models for State-of-the-Art Speech and Text Translation
Visual Causal Flow
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Capable of understanding text, audio, vision, video
Qwen3-Coder is the code version of Qwen3
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
GLM-4-Voice | End-to-End Chinese-English Conversational Model
Qwen3-ASR is an open-source series of ASR models
Language modeling in a sentence representation space
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Qwen3-omni is a natively end-to-end, omni-modal LLM
Chat & pretrained large vision language model
Detect faces in an image
Blazeface is a lightweight model that detects faces in images
Code release for ConvNeXt V2 model
The official pytorch implementation of our paper
Multimodal Transformer for document image understanding and layout
Portuguese ASR model fine-tuned on XLSR-53 for 16kHz audio input