CLIP, Predict the most relevant text snippet given an image
Multimodal-Driven Architecture for Customized Video Generation
Video understanding codebase from FAIR for reproducing video models
Let us control diffusion models
Code for the paper Hybrid Spectrogram and Waveform Source Separation
PyTorch implementation of VALL-E (Zero-Shot Text-To-Speech)
Code release for ConvNeXt V2 model
Code for reproducing key results in the paper