Stable Diffusion 3 Medium is a next-generation text-to-image model by Stability AI, designed using a Multimodal Diffusion Transformer (MMDiT) architecture. It offers notable improvements in image quality, prompt comprehension, typography, and computational efficiency over previous versions. The model integrates three fixed, pretrained text encoders—OpenCLIP-ViT/G, CLIP-ViT/L, and T5-XXL—to interpret complex prompts more effectively. Trained on 1 billion synthetic and filtered public images, it was fine-tuned on 30 million high-quality aesthetic images and 3 million preference-labeled samples. SD3 Medium is optimized for both local deployment and cloud API use, with support via ComfyUI, Diffusers, and other tooling. It is distributed under the Stability AI Community License, permitting research and commercial use for organizations under $1M in annual revenue. While equipped with safety mitigations, developers are encouraged to apply additional safeguards.
Features
- Built on MMDiT architecture for enhanced multimodal performance
- Uses three text encoders: OpenCLIP-ViT/G, CLIP-ViT/L, and T5-XXL
- Greatly improved handling of typography and complex prompts
- Efficient memory usage with FP8/FP16 model variants
- Trained on 1B images and fine-tuned with curated aesthetic datasets
- Compatible with ComfyUI, Diffusers, and Stability API platforms
- Includes safety mitigations and red-teaming evaluations
- Community license allows wide usage under $1M revenue threshold