How to optimize some algorithm in cuda
Performance-optimized AI inference on your GPUs
Unified KV Cache Compression Methods for Auto-Regressive Models
State-of-the-art Parameter-Efficient Fine-Tuning
Low-latency REST API for serving text-embeddings
ChatGLM-6B: An Open Bilingual Dialogue Language Model
Run Local LLMs on Any Device. Open-source
The official repo of Qwen chat & pretrained large language model
LightLLM is a Python-based LLM (Large Language Model) inference
Recipes to train reward model for RLHF
A high-performance ML model serving framework, offers dynamic batching
The official repository for ERNIE 4.5 and ERNIEKit
Capable of understanding text, audio, vision, video
A simple, performant and scalable Jax LLM
NeurIPS2025 Spotlight] Quantized Attention
Chat & pretrained large vision language model
Traditional Mandarin LLMs for Taiwan
Open-source, high-performance Mixture-of-Experts large language model
Ray Aviary - evaluate multiple LLMs easily
Chinese LLaMA-2 & Alpaca-2 Large Model Phase II Project
Inference code and configs for the ReplitLM model family
Chinese LLaMA & Alpaca large language model + local CPU/GPU training