tiny-llm is an educational open-source project designed to teach system engineers how large language model inference and serving systems work by building them from scratch. The project is structured as a guided course that walks developers through the process of implementing the core components required to run a modern language model, including attention mechanisms, token generation, and optimization techniques. Rather than relying on high-level machine learning frameworks, the codebase uses mostly low-level array and matrix manipulation APIs so that developers can understand exactly how model inference works internally. The project demonstrates how to load and run models such as Qwen-style architectures while progressively implementing performance improvements like KV caching, request batching, and optimized attention mechanisms. It also introduces concepts behind modern LLM serving systems that resemble simplified versions of production inference engines such as vLLM.
Features
- Step-by-step implementation of LLM inference infrastructure
- Low-level matrix and tensor operations instead of high-level frameworks
- Hands-on implementation of transformer attention and RoPE mechanisms
- Support for serving Qwen-style language models
- Demonstrations of optimization techniques such as KV cache and batching
- Educational workflow explaining how modern LLM serving systems operate