Contexts Optical Compression
Audio foundation model excelling in audio understanding
Accurate × Fast × Comprehensive
Video understanding codebase from FAIR for reproducing video models
Foundational Models for State-of-the-Art Speech and Text Translation
Visual Causal Flow
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Qwen3-ASR is an open-source series of ASR models
Language modeling in a sentence representation space
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Detect faces in an image
Blazeface is a lightweight model that detects faces in images
Code release for ConvNeXt V2 model
The official pytorch implementation of our paper
Multimodal Transformer for document image understanding and layout
Portuguese ASR model fine-tuned on XLSR-53 for 16kHz audio input
ClinicalBERT model trained on MIMIC notes for clinical NLP tasks
Russian ASR model fine-tuned on Common Voice and CSS10 datasets