A Powerful Native Multimodal Model for Image Generation
Official DeiT repository
Easily turn large sets of image urls to an image dataset
Contexts Optical Compression
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Code for running inference with the SAM 3D Body Model 3DB
A neural network that transforms a design mock-up into static websites
Models for object and human mesh reconstruction
Guiding Instruction-based Image Editing via Multimodal Large Language
CLIP, Predict the most relevant text snippet given an image
A Customizable Image-to-Video Model based on HunyuanVideo
Towards Real-World Vision-Language Understanding
A Unified Framework for Text-to-3D and Image-to-3D Generation
Multimodal-Driven Architecture for Customized Video Generation
RGBD video generation model conditioned on camera input
Diffusion Transformer with Fine-Grained Chinese Understanding
Official code for Style Aligned Image Generation via Shared Attention
Provides code for running inference with the SegmentAnything Model
Reference PyTorch implementation and models for DINOv3
Open-Sora: Democratizing Efficient Video Production for All
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Phi-3.5 for Mac: Locally-run Vision and Language Models
Inference code for CodeLlama models
Implementation of "MobileCLIP" CVPR 2024
Code for running inference and finetuning with SAM 3 model