Code for running inference with the SAM 3D Body Model 3DB
A Customizable Image-to-Video Model based on HunyuanVideo
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Chinese and English multimodal conversational language model
Lets make video diffusion practical
Reference PyTorch implementation and models for DINOv3
GPT4V-level open-source multi-modal model based on Llama3-8B
Diffusion Transformer with Fine-Grained Chinese Understanding
Code for running inference and finetuning with SAM 3 model
Contexts Optical Compression
Personalize Any Characters with a Scalable Diffusion Transformer
Unified Multimodal Understanding and Generation Models
Sharp Monocular Metric Depth in Less Than a Second
Capable of understanding text, audio, vision, video
Official implementation of DreamCraft3D
A state-of-the-art open visual language model
Phi-3.5 for Mac: Locally-run Vision and Language Models
Qwen3-omni is a natively end-to-end, omni-modal LLM
A Systematic Framework for Interactive World Modeling
Implementation of "MobileCLIP" CVPR 2024
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
PyTorch code and models for the DINOv2 self-supervised learning
Tooling for the Common Objects In 3D dataset
code for Mesh R-CNN, ICCV 2019
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning