Bringing BERT into modernity via both architecture changes and scaling
This repository contains the official implementation of FastVLM
Unified Multimodal Understanding and Generation Models
Collection of Gemma 3 variants that are trained for performance
Industrial-level controllable zero-shot text-to-speech system
End-to-end speech processing toolkit
PyTorch code and models for V-JEPA self-supervised learning from video
PyTorch code and models for VJEPA2 self-supervised learning from video
Accurate × Fast × Comprehensive
Visual Causal Flow
Towards Real-World Vision-Language Understanding
Official inference repo for FLUX.2 models
Fast multimodal LLM for real-time voice interaction and AI apps
Retrieval and Retrieval-augmented LLMs
Provides code for running inference with the SegmentAnything Model
Encoder of greater-than-word length text trained on a variety of data
Multimodal model achieving SOTA performance
[CVPR 2025 Best Paper Award] VGGT
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
Towards Human-Level Text-to-Speech through Style Diffusion
Generate 3D objects conditioned on text or images
Transformer related optimization, including BERT, GPT
A latent text-to-image diffusion model
A High Performance Library for Sequence Processing and Generation
PyTorch implementation of MAE