llm.c
LLM training in simple, raw C/CUDA
...By stripping away heavy frameworks, it exposes the core math and memory flows of embeddings, attention, and feed-forward layers. The code illustrates how to wire forward passes, losses, and simple training or inference loops with direct control over arrays and buffers. Its compact design makes it easy to trace execution, profile hotspots, and understand the cost of each operation. Portability is a goal: it aims to compile with common toolchains and run on modest hardware for small experiments. Rather than delivering a production-grade stack, it serves as a reference and learning scaffold for people who want to “see the metal” behind LLMs.