This repo contains the code for 1D tokenizer and generator
Code for running inference and finetuning with SAM 3 model
Witness the aha moment of VLM with less than $3
LTX-Video Support for ComfyUI
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
A state-of-the-art open visual language model
Visual Instruction Tuning: Large Language-and-Vision Assistant
Chat & pretrained large vision language model
Unified Multimodal Understanding and Generation Models
A framework to enable multimodal models to operate a computer
Parse files for optimal RAG
This repository contains the official implementation of FastVLM
Generating Immersive, Explorable, and Interactive 3D Worlds
Towards Real-World Vision-Language Understanding
Wan2.1: Open and Advanced Large-Scale Video Generative Model
SAPIEN Manipulation Skill Framework
CogView4, CogView3-Plus and CogView3(ECCV 2024)
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Refer and Ground Anything Anywhere at Any Granularity
Self-supervised visual learning using momentum contrast in PyTorch
Reference PyTorch implementation and models for DINOv3
Lets make video diffusion practical
Multimodal Diffusion with Representation Alignment
Taming Stable Diffusion for Lip Sync
A neural network that transforms a design mock-up into static websites