llama.cpp is a high-performance C and C++ project for running large language models locally and in the cloud with minimal setup. It is built around efficient inference, broad hardware support, and the GGUF model format. The project supports many model families and has become a major foundation for local AI tools, model serving, and embedded inference workflows. It provides command-line tools, a server mode with an OpenAI-compatible API style, model conversion utilities, and extensive backend acceleration options. llama.cpp runs on CPUs and GPUs, with support for Apple silicon, x86, RISC-V, CUDA, HIP, Vulkan, SYCL, Metal, and hybrid CPU-GPU execution. Its main value is making practical LLM inference accessible across consumer machines, servers, and specialized deployment environments.

Features

  • C and C++ LLM inference engine
  • GGUF model format support
  • Command-line and server-based execution
  • Broad CPU and GPU acceleration support
  • Quantization for lower memory usage
  • Support for many text and multimodal model families

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow llama.cpp

llama.cpp Web Site

Other Useful Business Software
Build Agents and Models on One Platform Icon
Build Agents and Models on One Platform

Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
Try It Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of llama.cpp!

Additional Project Details

Operating Systems

Linux, Mac

Programming Language

C++

Related Categories

C++ Large Language Models (LLM)

Registered

2026-05-20