Code for running inference and finetuning with SAM 3 model
LTX-Video Support for ComfyUI
Unified Multimodal Understanding and Generation Models
Tiny vision language model
GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Image
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
Chat & pretrained large vision language model
A state-of-the-art open visual language model
This repository contains the official implementation of FastVLM
Towards Real-World Vision-Language Understanding
Generating Immersive, Explorable, and Interactive 3D Worlds
VMZ: Model Zoo for Video Modeling
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
CogView4, CogView3-Plus and CogView3(ECCV 2024)
Wan2.1: Open and Advanced Large-Scale Video Generative Model
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Multimodal Diffusion with Representation Alignment
VGGSfM: Visual Geometry Grounded Deep Structure From Motion
Reference PyTorch implementation and models for DINOv3
Lets make video diffusion practical
Official implementation of Watermark Anything with Localized Messages
Python inference and LoRA trainer package for the LTX-2 audio–video
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Multimodal model achieving SOTA performance