How to optimize some algorithm in cuda
A high-throughput and memory-efficient inference and serving engine
A RWKV management and startup tool, full automation, only 8MB
A lightweight vLLM implementation built from scratch
Universal LLM Deployment Engine with ML Compilation
LLM inference in C/C++
LLM training in simple, raw C/CUDA
Low-latency REST API for serving text-embeddings
Serving multiple LoRA finetuned LLM as one
Python bindings for the Transformer models implemented in C/C++
An ecosystem of Rust libraries for working with large language models
Code for the paper Fine-Tuning Language Models from Human Preferences
Implements a reference architecture for creating information systems