Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Multimodal-Driven Architecture for Customized Video Generation
Clean and efficient FP8 GEMM kernels with fine-grained scaling
Learning to Act by Watching Unlabeled Online Videos
Metric monocular depth estimation (vision model)