Reference PyTorch implementation and models for DINOv3
Capable of understanding text, audio, vision, video
GPT4V-level open-source multi-modal model based on Llama3-8B
Personalize Any Characters with a Scalable Diffusion Transformer
Chinese and English multimodal conversational language model
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Phi-3.5 for Mac: Locally-run Vision and Language Models
Implementation of "MobileCLIP" CVPR 2024
Code for running inference and finetuning with SAM 3 model
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Official implementation of DreamCraft3D
Unified Multimodal Understanding and Generation Models
Sharp Monocular Metric Depth in Less Than a Second
A state-of-the-art open visual language model
code for Mesh R-CNN, ICCV 2019
Qwen3-omni is a natively end-to-end, omni-modal LLM
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Tooling for the Common Objects In 3D dataset
PyTorch code and models for the DINOv2 self-supervised learning
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Language modeling in a sentence representation space
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
High-Resolution Image Synthesis with Latent Diffusion Models
Powerful open source image generation model
Let us control diffusion models