CLIP, Predict the most relevant text snippet given an image
An open source implementation of CLIP
Embed images and sentences into fixed-length vectors
Automatically translates the text of a video based on a subtitle file
ICLR2024 Spotlight: curation/training code, metadata, distribution
Instant voice cloning by MIT and MyShell. Audio foundation model
[NeurIPS 2023] ImageReward: Learning and Evaluating Human Preferences
The most powerful and modular diffusion model GUI, api and backend
TorchMultimodal is a PyTorch library
Stable Diffusion web UI
LTX-Video Support for ComfyUI
Generating Immersive, Explorable, and Interactive 3D Worlds
Tensor search for humans
Large Multimodal Models for Video Understanding and Editing
TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Interface for OuteTTS models
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
A Python library for audio data augmentation
MARS5 speech model (TTS) from CAMB.AI
Multi-Modal Neural Networks for Semantic Search, based on Mid-Fusion
The data structure for multimodal data
Implementation of Imagen, Google's Text-to-Image Neural Network
Automating making many trailer-like videos with a single click!
Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis
Generate Harmonious Colors Freely.