GLM-130B
GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)
...The model supports efficient inference via INT8 and INT4 quantization, reducing hardware requirements from 8× A100 GPUs to as little as a single server with 4× RTX 3090s. Built on the SwissArmyTransformer (SAT) framework and compatible with DeepSpeed and FasterTransformer, it supports high-speed inference (up to 2.5× faster) and reproducible evaluation across 30+ benchmark tasks.