Code for running inference and finetuning with SAM 3 model
LTX-Video Support for ComfyUI
GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Image
Unified Multimodal Understanding and Generation Models
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
A state-of-the-art open visual language model
Chat & pretrained large vision language model
Tiny vision language model
This repository contains the official implementation of FastVLM
Generating Immersive, Explorable, and Interactive 3D Worlds
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
Towards Real-World Vision-Language Understanding
CogView4, CogView3-Plus and CogView3(ECCV 2024)
Wan2.1: Open and Advanced Large-Scale Video Generative Model
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Multimodal Diffusion with Representation Alignment
Reference PyTorch implementation and models for DINOv3
Lets make video diffusion practical
Python inference and LoRA trainer package for the LTX-2 audio–video
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Qwen3-omni is a natively end-to-end, omni-modal LLM
Foundational Models for State-of-the-Art Speech and Text Translation
Contexts Optical Compression
Multimodal model achieving SOTA performance