Tiny vision language model
Visual Causal Flow
Moonshot's most powerful AI model
Code for running inference and finetuning with SAM 3 model
LTX-Video Support for ComfyUI
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Recovering the Visual Space from Any Views
Lets make video diffusion practical
Wan2.1: Open and Advanced Large-Scale Video Generative Model
Python inference and LoRA trainer package for the LTX-2 audio–video
Inference script for Oasis 500M
OCR expert VLM powered by Hunyuan's native multimodal architecture
ICLR2024 Spotlight: curation/training code, metadata, distribution
Official code for Style Aligned Image Generation via Shared Attention
GLIDE: a diffusion-based text-conditional image synthesis model
OpenAI’s compact 20B open model for fast, agentic, and local use
OpenAI’s open-weight 120B model optimized for reasoning and tooling
Vision-language-action model for robot control via images and text