CogView is a large-scale pretrained text-to-image transformer model, introduced in the NeurIPS 2021 paper CogView: Mastering Text-to-Image Generation via Transformers. With 4 billion parameters, it was one of the earliest transformer-based models to successfully generate high-quality images from natural language descriptions in Chinese, with partial support for English via translation. The model incorporates innovations such as PB-relax and Sandwich-LN to enable stable training of very deep transformers without NaN loss issues. CogView supports multiple tasks beyond text-to-image, including image captioning, post-selection (ranking candidate images by relevance to a prompt), and super-resolution (upscaling model-generated images). The repository provides pretrained models, inference scripts, and training examples, along with a Docker environment for reproducibility.
Features
- Supports simplified Chinese input (English input works better via translation)
- 4B-parameter transformer for text-to-image generation
- Pretrained models for text-to-image, captioning, and super-resolution
- Stable deep transformer training with PB-relax and Sandwich-LN techniques
- Includes post-selection scripts to rank image outputs by relevance
- Docker environment for easier setup and large-scale training reproduction