NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
Multi-modal large language model designed for audio understanding
OCR expert VLM powered by Hunyuan's native multimodal architecture
Towards Human-Level Text-to-Speech through Style Diffusion
VITS2 backbone with multilingual-bert
A deep learning toolkit for Text-to-Speech, battle-tested in research
Generate 3D objects conditioned on text or images
Basaran, an open-source alternative to the OpenAI text completion API
An open-source framework for training large multimodal models
Meta-Transformer for Unified Multimodal Learning
Neural machine translation and sequence learning using TensorFlow
Singing voice change based on whisper, lora for singing voice clone
Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
CPT: A Pre-Trained Unbalanced Transformer
Text-conditional image generation model based on OpenAI's unCLIP
A latent text-to-image diffusion model
PyTorch implementation of MAE
Deep learning PyTorch library for time series forecasting
Reformer, the efficient Transformer, in Pytorch
Clone a voice in 5 seconds to generate arbitrary speech in real-time
ALIbaba's Collection of Encoder-decoders from MinD
Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)
Adversarial Latent Autoencoders
An implementation of Tacotron 2 that supports multilingual experiments
End-to-end object detection with transformers