A high-throughput and memory-efficient inference and serving engine
Run 100B+ language models at home, BitTorrent-style
Framework for Accelerating LLM Generation with Multiple Decoding Heads
Implementation of model parallel autoregressive transformers on GPUs