Image generation model with single-stream diffusion transformer
GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Image
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
Official inference repo for FLUX.2 models
A Powerful Native Multimodal Model for Image Generation
Easily turn large sets of image urls to an image dataset
Official DeiT repository
Models for object and human mesh reconstruction
Official inference repo for FLUX.1 models
A neural network that transforms a design mock-up into static websites
A Unified Framework for Text-to-3D and Image-to-3D Generation
Diffusion Transformer with Fine-Grained Chinese Understanding
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Collection of Gemma 3 variants that are trained for performance
Guiding Instruction-based Image Editing via Multimodal Large Language
Flexible Photo Recrafting While Preserving Your Identity
CLIP, Predict the most relevant text snippet given an image
This repo contains the code for 1D tokenizer and generator
Code for running inference with the SAM 3D Body Model 3DB
Reference PyTorch implementation and models for DINOv3
Towards Real-World Vision-Language Understanding
Multimodal-Driven Architecture for Customized Video Generation
Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
RGBD video generation model conditioned on camera input
A Customizable Image-to-Video Model based on HunyuanVideo