Wan2.1: Open and Advanced Large-Scale Video Generative Model
Wan2.2: Open and Advanced Large-Scale Video Generative Model
Text and image to video generation: CogVideoX and CogVideo
Multimodal-Driven Architecture for Customized Video Generation
A Customizable Image-to-Video Model based on HunyuanVideo
Official Python inference and LoRA trainer package
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
RGBD video generation model conditioned on camera input
Diffusion model(SD,Flux,Wan,Qwen Image,Z-Image,...) inference
GPT4V-level open-source multi-modal model based on Llama3-8B
Capable of understanding text, audio, vision, video
Lets make video diffusion practical
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
A Multi-Modal World Model for Reconstructing, Generating, Simulation
Recovering the Visual Space from Any Views
Moonshot's most powerful AI model
Generating Immersive, Explorable, and Interactive 3D Worlds
Sharp Monocular Metric Depth in Less Than a Second
Qwen3-omni is a natively end-to-end, omni-modal LLM
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Advancing Open-source World Models
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
Code for running inference and finetuning with SAM 3 model
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Multimodal embedding and reranking models built on Qwen3-VL