Stable Diffusion v1-5 is a latent text-to-image diffusion model capable of producing high-quality, photo-realistic images from natural language prompts. It builds upon the v1.2 checkpoint and was fine-tuned with 595,000 additional steps at 512x512 resolution on the “laion-aesthetics v2 5+” dataset. This model improves generation fidelity through classifier-free guidance sampling, including 10% prompt dropout during training. It leverages a CLIP ViT-L/14 text encoder and a UNet-based diffusion architecture operating in latent space to enable fast and efficient image synthesis. Stable Diffusion v1-5 is compatible with Diffusers, ComfyUI, AUTOMATIC1111, and other user interfaces. Its intended use is for research and creative applications such as digital art, design, and exploration of generative models. While powerful, it has known limitations with photorealism, compositionality, and cultural representation, and requires responsible usage under the CreativeML OpenRAIL-M license.
Features
- Generates images from natural language prompts using latent diffusion
- Fine-tuned on 595k steps for improved aesthetic quality and prompt alignment
- Uses CLIP ViT-L/14 for text encoding and UNet for image generation
- Supports classifier-free guidance for more accurate outputs
- Resolution optimized at 512x512 for high-quality results
- Compatible with Diffusers, ComfyUI, AUTOMATIC1111, SD.Next, and InvokeAI
- Licensed under CreativeML OpenRAIL-M for responsible open use
- Trained on laion-aesthetics v2 5+, optimized for visual appeal