How to optimize some algorithm in cuda
A high-throughput and memory-efficient inference and serving engine
A lightweight vLLM implementation built from scratch
Universal LLM Deployment Engine with ML Compilation
Low-latency REST API for serving text-embeddings
LLM training in simple, raw C/CUDA
Serving multiple LoRA finetuned LLM as one
Code for the paper Fine-Tuning Language Models from Human Preferences
Implements a reference architecture for creating information systems