CLIP, Predict the most relevant text snippet given an image
Generate Any 3D Scene in Seconds
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Large Multimodal Models for Video Understanding and Editing
Implementation of the Surya Foundation Model for Heliophysics
Diffusion Transformer with Fine-Grained Chinese Understanding
Language modeling in a sentence representation space
Di♪♪Rhythm: Blazingly Fast & Simple End-to-End Song Generation
Official PyTorch Implementation of "Scalable Diffusion Models"
A library for Multilingual Unsupervised or Supervised word Embeddings
Code for reproducing key results in the paper