Code for running inference and finetuning with SAM 3 model
Tiny vision language model
LTX-Video Support for ComfyUI
Unified Multimodal Understanding and Generation Models
GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Image
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
This repository contains the official implementation of FastVLM
Towards Real-World Vision-Language Understanding
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Reference PyTorch implementation and models for DINOv3
Multimodal model achieving SOTA performance
Multimodal Diffusion with Representation Alignment
Python inference and LoRA trainer package for the LTX-2 audio–video
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Contexts Optical Compression
Foundational Models for State-of-the-Art Speech and Text Translation
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Phi-3.5 for Mac: Locally-run Vision and Language Models
Inference script for Oasis 500M
ICLR2024 Spotlight: curation/training code, metadata, distribution
Large-language-model & vision-language-model based on Linear Attention
Official code for Style Aligned Image Generation via Shared Attention
PyTorch implementation of MAE
Multimodal 7B model for image, video, and text understanding tasks