A state-of-the-art open visual language model
Chinese and English multimodal conversational language model
Long-form streaming TTS system for multi-speaker dialogue generation
Open-source industrial-grade ASR models
OpenTinker is an RL-as-a-Service infrastructure for foundation models
Multimodal embedding and reranking models built on Qwen3-VL
Implementation of "MobileCLIP" CVPR 2024
VMZ: Model Zoo for Video Modeling
Official implementation of Watermark Anything with Localized Messages
Video understanding codebase from FAIR for reproducing video models
CLIP, Predict the most relevant text snippet given an image
DeepMind model for tracking arbitrary points across videos & robotics
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
GPT4V-level open-source multi-modal model based on Llama3-8B
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
Renderer for the harmony response format to be used with gpt-oss
The ChatGPT Retrieval Plugin lets you easily find personal documents
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Pretrained time-series foundation model developed by Google Research
General-purpose image editing model that delivers high-fidelity
Inference script for Oasis 500M
4M: Massively Multimodal Masked Modeling
This repository contains the official implementation of FastVLM
ICLR2024 Spotlight: curation/training code, metadata, distribution
A PyTorch library for implementing flow matching algorithms