High-speed Large Language Model Serving for Local Deployment
Real-time NVIDIA GPU dashboard
LLM inference in C/C++
AirLLM 70B inference with single 4GB GPU
Accessible large language models via k-bit quantization for PyTorch
Run a 1-billion parameter LLM on a $10 board with 256MB RAM
Unified KV Cache Compression Methods for Auto-Regressive Models
Neural Network architecture based on ideas of the original LSTM
Open-source large language model family from Tencent Hunyuan
157 models, 30 providers, one command to find what runs on hardware
C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)
ChatGLM2-6B: An Open Bilingual Chat LLM
Redundancy-aware KV Cache Compression for Reasoning Models
Drag & drop UI to build your customized LLM flow
LLM training in simple, raw C/CUDA
Tools for merging pretrained large language models
AI Agent Development Guide, LangGraph in Action, Advanced RAG
The official repo of Qwen chat & pretrained large language model
High-performance inference framework for large language models
On the Structural Pruning of Large Language Models
DepGraph: Towards Any Structural Pruning
LangChain4j is an open-source Java library
Fast and efficient unstructured data extraction
Tensor search for humans
Capable of understanding text, audio, vision, video