Wan2.1: Open and Advanced Large-Scale Video Generative Model
Wan2.2: Open and Advanced Large-Scale Video Generative Model
Text and image to video generation: CogVideoX and CogVideo
Multimodal-Driven Architecture for Customized Video Generation
Official Python inference and LoRA trainer package
A Customizable Image-to-Video Model based on HunyuanVideo
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
RGBD video generation model conditioned on camera input
Lets make video diffusion practical
GPT4V-level open-source multi-modal model based on Llama3-8B
Capable of understanding text, audio, vision, video
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
A Multi-Modal World Model for Reconstructing, Generating, Simulation
Recovering the Visual Space from Any Views
Generating Immersive, Explorable, and Interactive 3D Worlds
Sharp Monocular Metric Depth in Less Than a Second
Qwen3-omni is a natively end-to-end, omni-modal LLM
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Advancing Open-source World Models
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
Code for running inference and finetuning with SAM 3 model
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Multimodal embedding and reranking models built on Qwen3-VL
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Project Lyra: Open Generative 3D World Models