Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Multimodal-Driven Architecture for Customized Video Generation
Chinese and English multimodal conversational language model
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
High-Fidelity and Controllable Generation of Textured 3D Assets
Learning to Act by Watching Unlabeled Online Videos