HunyuanDiT is a high-capability text-to-image diffusion transformer with bilingual (Chinese/English) understanding and multi-turn dialogue capability. It trains a diffusion model in latent space using a transformer backbone and integrates a Multimodal Large Language Model (MLLM) to refine captions and support conversational image generation. It supports adapters like ControlNet, IP-Adapter, LoRA, and can run under constrained VRAM via distillation versions. LoRA, ControlNet (pose, depth, canny), IP-adapter to extend control over generation. Integration with Gradio for web demos and diffusers / command-line compatibility. Supports multi-turn T2I (text-to-image) interactions so users can iteratively refine their images via dialogue.
Features
- Bilingual Chinese-English architecture for fine-grained understanding in both languages
- Supports multi-turn T2I (text-to-image) interactions so users can iteratively refine their images via dialogue
- Adapter support: LoRA, ControlNet (pose, depth, canny), IP-adapter to extend control over generation
- Versions for lower VRAM inference (e.g. “6 GB GPU VRAM inference”) and distillation versions
- Integration with Gradio for web demos and diffusers / command-line compatibility
- Training and full-parameter code released; includes pre-processing, model definition, captioning modules, etc.