A high-throughput and memory-efficient inference and serving engine
Deep learning optimization library: makes distributed training easy
Low-latency REST API for serving text-embeddings
MII makes low-latency and high-throughput inference possible
Large Language Model Text Generation Inference
Tensor search for humans
Framework for Accelerating LLM Generation with Multiple Decoding Heads