Real-time NVIDIA GPU dashboard
AirLLM 70B inference with single 4GB GPU
Unified KV Cache Compression Methods for Auto-Regressive Models
Neural Network architecture based on ideas of the original LSTM
Run a 1-billion parameter LLM on a $10 board with 256MB RAM
Open-source large language model family from Tencent Hunyuan
Redundancy-aware KV Cache Compression for Reasoning Models
157 models, 30 providers, one command to find what runs on hardware
Tools for merging pretrained large language models
AI Agent Development Guide, LangGraph in Action, Advanced RAG
High-performance inference framework for large language models
On the Structural Pruning of Large Language Models
DepGraph: Towards Any Structural Pruning
Fast and efficient unstructured data extraction
LangChain4j is an open-source Java library
Run Mixtral-8x7B models in Colab or consumer desktops
A large model training tool that supports training large models
Serving multiple LoRA finetuned LLM as one
Calculate token/s & GPU memory requirement for any LLM
Flagship MoE model for long-context agents and complex coding