llama.cpp is a high-performance C and C++ project for running large language models locally and in the cloud with minimal setup. It is built around efficient inference, broad hardware support, and the GGUF model format. The project supports many model families and has become a major foundation for local AI tools, model serving, and embedded inference workflows. It provides command-line tools, a server mode with an OpenAI-compatible API style, model conversion utilities, and extensive backend acceleration options. llama.cpp runs on CPUs and GPUs, with support for Apple silicon, x86, RISC-V, CUDA, HIP, Vulkan, SYCL, Metal, and hybrid CPU-GPU execution. Its main value is making practical LLM inference accessible across consumer machines, servers, and specialized deployment environments.

Features

  • C and C++ LLM inference engine
  • GGUF model format support
  • Command-line and server-based execution
  • Broad CPU and GPU acceleration support
  • Quantization for lower memory usage
  • Support for many text and multimodal model families

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow llama.cpp

llama.cpp Web Site

Other Useful Business Software
MongoDB Atlas runs apps anywhere Icon
MongoDB Atlas runs apps anywhere

Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of llama.cpp!

Additional Project Details

Operating Systems

Linux, Mac

Programming Language

C++

Related Categories

C++ Large Language Models (LLM)

Registered

15 hours ago