[CVPR 2025 Best Paper Award] VGGT
Taming Stable Diffusion for Lip Sync
TorchMultimodal is a PyTorch library
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
Multi-modal large language model designed for audio understanding
OCR expert VLM powered by Hunyuan's native multimodal architecture
Python code able to convert / compress image to PI (3.14, π) Indexes
Towards Human-Level Text-to-Speech through Style Diffusion
VITS2 backbone with multilingual-bert
A high quality MP3 encoder
A deep learning toolkit for Text-to-Speech, battle-tested in research
Generate 3D objects conditioned on text or images
A C++ library for AVR and NodeMCU
An open-source framework for training large multimodal models
Basaran, an open-source alternative to the OpenAI text completion API
Meta-Transformer for Unified Multimodal Learning
Neural machine translation and sequence learning using TensorFlow
Official codebase for I-JEPA
Singing voice change based on whisper, lora for singing voice clone
Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
CPT: A Pre-Trained Unbalanced Transformer
Text-conditional image generation model based on OpenAI's unCLIP
A latent text-to-image diffusion model
State-of-the-art deep learning based audio codec
Simple video encoder