vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.
Features
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Optimized CUDA kernels
- Seamless integration with popular HuggingFace models
- Tensor parallelism support for distributed inference
License
Apache License V2.0Follow vLLM
Other Useful Business Software
$300 in Free Credit Towards Top Cloud Services
Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
Rate This Project
Login To Rate This Project
User Reviews
Be the first to post a review of vLLM!