stable-diffusion-v1-4 is a high-performance text-to-image latent diffusion model developed by CompVis. It generates photo-realistic images from natural language prompts using a pretrained CLIP ViT-L/14 text encoder and a UNet-based denoising architecture. This version builds on v1-2, fine-tuned over 225,000 steps at 512×512 resolution on the “laion-aesthetics v2 5+” dataset, with 10% text-conditioning dropout for improved classifier-free guidance. It is optimized for use with Hugging Face’s Diffusers library and supports both PyTorch and JAX/Flax frameworks, offering flexibility across GPUs and TPUs. Though powerful, the model has limitations with compositional logic, photorealism, non-English prompts, and rendering accurate text or faces. Intended for research and creative exploration, it includes safety tools to detect NSFW content but may still reflect dataset biases. Users are advised to follow responsible AI practices and avoid harmful, unethical, or out-of-scope applications.
Features
- Generates images from natural language prompts
- Fine-tuned on high-aesthetic image dataset (LAION-Aesthetics v2 5+)
- Compatible with Hugging Face’s Diffusers library
- Supports PyTorch (float16) and JAX/Flax (bfloat16)
- Uses CLIP ViT-L/14 for robust text understanding
- Allows customization of schedulers (e.g., Euler, PNDM)
- Includes safety checker for NSFW content filtering
- Trained for improved classifier-free guidance sampling