Tiny vision language model
Code for running inference and finetuning with SAM 3 model
Unified Multimodal Understanding and Generation Models
LTX-Video Support for ComfyUI
GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Image
Towards Real-World Vision-Language Understanding
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Python inference and LoRA trainer package for the LTX-2 audio–video
This repository contains the official implementation of FastVLM
Multimodal model achieving SOTA performance
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
Multimodal Diffusion with Representation Alignment
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Reference PyTorch implementation and models for DINOv3
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Contexts Optical Compression
Foundational Models for State-of-the-Art Speech and Text Translation
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Phi-3.5 for Mac: Locally-run Vision and Language Models
Multimodal embedding and reranking models built on Qwen3-VL
Inference script for Oasis 500M
ICLR2024 Spotlight: curation/training code, metadata, distribution
Large-language-model & vision-language-model based on Linear Attention
Official code for Style Aligned Image Generation via Shared Attention
PyTorch implementation of MAE