Real-time NVIDIA GPU dashboard
How to optimize some algorithm in cuda
157 models, 30 providers, one command to find what runs on hardware
High-speed Large Language Model Serving for Local Deployment
A high-performance inference engine for AI models
Performance-optimized AI inference on your GPUs
UCCL is an efficient communication library for GPUs
Unified KV Cache Compression Methods for Auto-Regressive Models
Alibaba's high-performance LLM inference engine for diverse apps
State-of-the-art Parameter-Efficient Fine-Tuning
Low-latency REST API for serving text-embeddings
ChatGLM-6B: An Open Bilingual Dialogue Language Model
Run Local LLMs on Any Device. Open-source
The official repo of Qwen chat & pretrained large language model
Run AI models locally on your machine with node.js bindings for llama
LightLLM is a Python-based LLM (Large Language Model) inference
Recipes to train reward model for RLHF
A high-performance ML model serving framework, offers dynamic batching
The official repository for ERNIE 4.5 and ERNIEKit
Capable of understanding text, audio, vision, video
A simple, performant and scalable Jax LLM
Mooncake is the serving platform for Kimi
Generate music based on natural language prompts using LLMs
NeurIPS2025 Spotlight] Quantized Attention
Chat & pretrained large vision language model