Official DeiT repository
A Powerful Native Multimodal Model for Image Generation
Code for running inference with the SAM 3D Body Model 3DB
Contexts Optical Compression
Models for object and human mesh reconstruction
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
CLIP, Predict the most relevant text snippet given an image
A Customizable Image-to-Video Model based on HunyuanVideo
Towards Real-World Vision-Language Understanding
A Unified Framework for Text-to-3D and Image-to-3D Generation
Multimodal-Driven Architecture for Customized Video Generation
RGBD video generation model conditioned on camera input
Diffusion Transformer with Fine-Grained Chinese Understanding
Official code for Style Aligned Image Generation via Shared Attention
Reference PyTorch implementation and models for DINOv3
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Phi-3.5 for Mac: Locally-run Vision and Language Models
Code for running inference and finetuning with SAM 3 model
Implementation of "MobileCLIP" CVPR 2024
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Official implementation of DreamCraft3D
Unified Multimodal Understanding and Generation Models
Sharp Monocular Metric Depth in Less Than a Second
PyTorch code and models for the DINOv2 self-supervised learning
Language modeling in a sentence representation space