This repository contains the official implementation of FastVLM
Unified Multimodal Understanding and Generation Models
Industrial-level controllable zero-shot text-to-speech system
Collection of Gemma 3 variants that are trained for performance
Open-source industrial-grade ASR models
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
Visual Causal Flow
Towards Real-World Vision-Language Understanding
Moonshot's most powerful AI model
Accurate × Fast × Comprehensive
Official inference repo for FLUX.2 models
Qwen2.5-VL is the multimodal large language model series
Encoder of greater-than-word length text trained on a variety of data
Multimodal model achieving SOTA performance
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
Multi-modal large language model designed for audio understanding
OCR expert VLM powered by Hunyuan's native multimodal architecture
Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
A latent text-to-image diffusion model
PyTorch implementation of MAE
Facebook AI Research Sequence-to-Sequence Toolkit
Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)
Code for "Image Generation from Scene Graphs", Johnson et al, CVPR 201
Dual LSTM Encoder for Dialog Response Generation
Compact 8B multimodal instruct model optimized for edge deployment