Visual Causal Flow
LTX-Video Support for ComfyUI
Code for running inference and finetuning with SAM 3 model
Official Python inference and LoRA trainer package
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
Tiny vision language model
Recovering the Visual Space from Any Views
Unified Multimodal Understanding and Generation Models
Python inference and LoRA trainer package for the LTX-2 audio–video
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Video Object and Interaction Deletion
Multimodal Diffusion with Representation Alignment
Contexts Optical Compression
This repository contains the official implementation of FastVLM
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Reference PyTorch implementation and models for DINOv3
Foundation model for image generation
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Phi-3.5 for Mac: Locally-run Vision and Language Models
General-purpose image editing model that delivers high-fidelity
Multimodal embedding and reranking models built on Qwen3-VL
Inference script for Oasis 500M
ICLR2024 Spotlight: curation/training code, metadata, distribution
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Towards Real-World Vision-Language Understanding