Stable Diffusion 2.1 is a text-to-image generation model developed by Stability AI, building on the 768-v architecture with additional fine-tuning for improved safety and image quality. It uses a latent diffusion framework that operates in a compressed image space, enabling faster and more efficient image synthesis while preserving detail. The model is conditioned on text prompts via the OpenCLIP-ViT/H encoder and supports generation at resolutions up to 768×768. Released under the OpenRAIL++ license, it permits research and commercial use with specific content restrictions. Stable Diffusion 2.1 is designed for creative tasks such as digital art, design prototyping, and educational tools, but is not suitable for generating factual representations or non-English content. The model was trained on filtered subsets of LAION-5B, with additional steps to reduce NSFW content.
Features
- Latent diffusion model trained for 768×768 resolution image generation
- Fine-tuned with safety filters to reduce harmful or NSFW outputs
- Compatible with Hugging Face diffusers and custom schedulers like DPM++
- Uses OpenCLIP-ViT/H for prompt encoding with cross-attention
- Supports various specialized checkpoints for inpainting, depth, and upscaling
- Efficient sampling with guidance scales and attention slicing options
- Trained on LAION-5B with over 300k additional fine-tuning steps
- Released under OpenRAIL++ for responsible open-use with clear restrictions