stable-diffusion-v1-4 is a high-performance text-to-image latent diffusion model developed by CompVis. It generates photo-realistic images from natural language prompts using a pretrained CLIP ViT-L/14 text encoder and a UNet-based denoising architecture. This version builds on v1-2, fine-tuned over 225,000 steps at 512×512 resolution on the “laion-aesthetics v2 5+” dataset, with 10% text-conditioning dropout for improved classifier-free guidance. It is optimized for use with Hugging Face’s...