Real-time NVIDIA GPU dashboard
How to optimize some algorithm in cuda
157 models, 30 providers, one command to find what runs on hardware
High-speed Large Language Model Serving for Local Deployment
A high-performance inference engine for AI models
Performance-optimized AI inference on your GPUs
Alibaba's high-performance LLM inference engine for diverse apps
UCCL is an efficient communication library for GPUs
Unified KV Cache Compression Methods for Auto-Regressive Models
State-of-the-art Parameter-Efficient Fine-Tuning
Low-latency REST API for serving text-embeddings
ChatGLM-6B: An Open Bilingual Dialogue Language Model
Run Local LLMs on Any Device. Open-source
The official repo of Qwen chat & pretrained large language model
Run AI models locally on your machine with node.js bindings for llama
LightLLM is a Python-based LLM (Large Language Model) inference
A high-performance ML model serving framework, offers dynamic batching
The official repository for ERNIE 4.5 and ERNIEKit
Capable of understanding text, audio, vision, video
Mooncake is the serving platform for Kimi
Recipes to train reward model for RLHF
A simple, performant and scalable Jax LLM
Generate music based on natural language prompts using LLMs
NeurIPS2025 Spotlight] Quantized Attention
TT-NN operator library, and TT-Metalium low level kernel programming