CLIP, Predict the most relevant text snippet given an image
PyTorch code and models for VJEPA2 self-supervised learning from video
TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Synchronized Translation for Videos
PyTorch code and models for V-JEPA self-supervised learning from video
Generate Any 3D Scene in Seconds
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Large Multimodal Models for Video Understanding and Editing
Implementation of the Surya Foundation Model for Heliophysics
SOTA discrete acoustic codec models with 40/75 tokens per second
Diffusion Transformer with Fine-Grained Chinese Understanding
Language modeling in a sentence representation space
Generate 3D objects conditioned on text or images
Di♪♪Rhythm: Blazingly Fast & Simple End-to-End Song Generation
Code release for "Detecting Twenty-thousand Classes
Official PyTorch Implementation of "Scalable Diffusion Models"
PyBullet Gymnasium environments for multi-agent reinforcement
Deep Hough Voting for 3D Object Detection in Point Clouds
A library for Multilingual Unsupervised or Supervised word Embeddings
Code for reproducing key results in the paper
Efficient Approximate Nearest Neighbors for General Metric Spaces