Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Multimodal-Driven Architecture for Customized Video Generation
Chinese and English multimodal conversational language model
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
Learning to Act by Watching Unlabeled Online Videos