Real-time NVIDIA GPU dashboard
How to optimize some algorithm in cuda
157 models, 30 providers, one command to find what runs on hardware
High-speed Large Language Model Serving for Local Deployment
A high-performance inference engine for AI models
Performance-optimized AI inference on your GPUs
Alibaba's high-performance LLM inference engine for diverse apps
UCCL is an efficient communication library for GPUs
Unified KV Cache Compression Methods for Auto-Regressive Models
Low-latency REST API for serving text-embeddings
State-of-the-art Parameter-Efficient Fine-Tuning
ChatGLM-6B: An Open Bilingual Dialogue Language Model
LightLLM is a Python-based LLM (Large Language Model) inference
Run Local LLMs on Any Device. Open-source
Run AI models locally on your machine with node.js bindings for llama
The official repo of Qwen chat & pretrained large language model
Mooncake is the serving platform for Kimi
A high-performance ML model serving framework, offers dynamic batching
The official repository for ERNIE 4.5 and ERNIEKit
TT-NN operator library, and TT-Metalium low level kernel programming
Capable of understanding text, audio, vision, video
Recipes to train reward model for RLHF
A simple, performant and scalable Jax LLM
NeurIPS2025 Spotlight] Quantized Attention
Generate music based on natural language prompts using LLMs