... the C runtime at interactive speeds on a laptop. You can also export and run Meta’s Llama-2 models (currently practical up to 7B due to fp32 inference and memory limits), plus try chat/Code Llama variants with proper tokenizers. A quantized int8 path (runq.c) reduces checkpoint size (e.g., 26GB→6.7GB for 7B) and speeds up inference (e.g., ~3× vs fp32 in author’s notes), with modest quality tradeoffs.