LLM Inference Tools for Mac

View 17 business solutions

Browse free open source LLM Inference tools and projects for Mac below. Use the toggles on the left to filter open source LLM Inference tools by OS, license, language, programming language, and project status.

  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    whisper.cpp

    whisper.cpp

    Port of OpenAI's Whisper model in C/C++

    whisper.cpp is a lightweight, C/C++ reimplementation of OpenAI’s Whisper automatic speech recognition (ASR) model—designed for efficient, standalone transcription without external dependencies. The entire high-level implementation of the model is contained in whisper.h and whisper.cpp. The rest of the code is part of the ggml machine learning library. The command downloads the base.en model converted to custom ggml format and runs the inference on all .wav samples in the folder samples. whisper.cpp supports integer quantization of the Whisper ggml models. Quantized models require less memory and disk space and depending on the hardware can be processed more efficiently.
    Downloads: 421 This Week
    Last Update:
    See Project
  • 2
    GPT4All

    GPT4All

    Run Local LLMs on Any Device. Open-source

    GPT4All is an open-source project that allows users to run large language models (LLMs) locally on their desktops or laptops, eliminating the need for API calls or GPUs. The software provides a simple, user-friendly application that can be downloaded and run on various platforms, including Windows, macOS, and Ubuntu, without requiring specialized hardware. It integrates with the llama.cpp implementation and supports multiple LLMs, allowing users to interact with AI models privately. This project also supports Python integrations for easy automation and customization. GPT4All is ideal for individuals and businesses seeking private, offline access to powerful LLMs.
    Downloads: 155 This Week
    Last Update:
    See Project
  • 3
    llama.cpp

    llama.cpp

    Port of Facebook's LLaMA model in C/C++

    The llama.cpp project enables the inference of Meta's LLaMA model (and other models) in pure C/C++ without requiring a Python runtime. It is designed for efficient and fast model execution, offering easy integration for applications needing LLM-based capabilities. The repository focuses on providing a highly optimized and portable implementation for running large language models directly within C/C++ environments.
    Downloads: 139 This Week
    Last Update:
    See Project
  • 4
    Open WebUI

    Open WebUI

    User-friendly AI Interface

    Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with a built-in inference engine for Retrieval Augmented Generation (RAG), making it a powerful AI deployment solution. Key features include effortless setup via Docker or Kubernetes, seamless integration with OpenAI-compatible APIs, granular permissions and user groups for enhanced security, responsive design across devices, and full Markdown and LaTeX support for enriched interactions. Additionally, Open WebUI offers a Progressive Web App (PWA) for mobile devices, providing offline access and a native app-like experience. The platform also includes a Model Builder, allowing users to create custom models from base Ollama models directly within the interface. With over 156,000 users, Open WebUI is a versatile solution for deploying and managing AI models in a secure, offline environment.
    Downloads: 104 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 5
    ONNX Runtime

    ONNX Runtime

    ONNX Runtime: cross-platform, high performance ML inferencing

    ONNX Runtime is a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. ONNX Runtime training can accelerate the model training time on multi-node NVIDIA GPUs for transformer models with a one-line addition for existing PyTorch training scripts. Support for a variety of frameworks, operating systems and hardware platforms. Built-in optimizations that deliver up to 17X faster inferencing and up to 1.4X faster training.
    Downloads: 58 This Week
    Last Update:
    See Project
  • 6
    ncnn

    ncnn

    High-performance neural network inference framework for mobile

    ncnn is a high-performance neural network inference computing framework designed specifically for mobile platforms. It brings artificial intelligence right at your fingertips with no third-party dependencies, and speeds faster than all other known open source frameworks for mobile phone cpu. ncnn allows developers to easily deploy deep learning algorithm models to the mobile platform and create intelligent APPs. It is cross-platform and supports most commonly used CNN networks, including Classical CNN (VGG AlexNet GoogleNet Inception), Face Detection (MTCNN RetinaFace), Segmentation (FCN PSPNet UNet YOLACT), and more. ncnn is currently being used in a number of Tencent applications, namely: QQ, Qzone, WeChat, and Pitu.
    Downloads: 39 This Week
    Last Update:
    See Project
  • 7
    Gitleaks

    Gitleaks

    Protect and discover secrets using Gitleaks

    Gitleaks is a fast, lightweight, portable, and open-source secret scanner for git repositories, files, and directories. With over 6.8 million docker downloads, 11.2k GitHub stars, 1.7 million GitHub Downloads, thousands of weekly clones, and over 400k homebrew installs, gitleaks is the most trusted secret scanner among security professionals, enterprises, and developers. Gitleaks-Action is our official GitHub Action. You can use it to automatically run a gitleaks scan on all your team's pull requests and commits, or run on-demand scans. If you are scanning repos that belong to a GitHub organization account, then you'll have to obtain a license. Gitleaks can be installed using Homebrew, Docker, or Go. Gitleaks is also available in binary form for many popular platforms and OS types on the releases page. In addition, Gitleaks can be implemented as a pre-commit hook directly in your repo or as a GitHub action using Gitleaks-Action.
    Downloads: 31 This Week
    Last Update:
    See Project
  • 8
    OpenVINO

    OpenVINO

    OpenVINO™ Toolkit repository

    OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. Boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common tasks. Use models trained with popular frameworks like TensorFlow, PyTorch and more. Reduce resource demands and efficiently deploy on a range of Intel® platforms from edge to cloud. This open-source version includes several components: namely Model Optimizer, OpenVINO™ Runtime, Post-Training Optimization Tool, as well as CPU, GPU, MYRIAD, multi device and heterogeneous plugins to accelerate deep learning inferencing on Intel® CPUs and Intel® Processor Graphics. It supports pre-trained models from the Open Model Zoo, along with 100+ open source and public models in popular formats such as TensorFlow, ONNX, PaddlePaddle, MXNet, Caffe, Kaldi.
    Downloads: 22 This Week
    Last Update:
    See Project
  • 9
    LocalAI

    LocalAI

    The free, Open Source alternative to OpenAI, Claude and others

    LocalAI is an open-source platform that allows users to run large language models and other AI systems locally on their own hardware. It acts as a drop-in replacement for APIs such as OpenAI, enabling developers to build AI-powered applications without relying on external cloud services. The platform supports a wide range of model types, including text generation, image creation, speech processing, and embeddings. LocalAI can run on consumer-grade hardware and does not necessarily require a GPU, making it accessible for local development and private deployments. It integrates with multiple backends like llama.cpp, transformers, and diffusers to support different AI workloads. With its self-hosted architecture and OpenAI-compatible API, LocalAI enables developers to build secure, local-first AI applications.
    Downloads: 20 This Week
    Last Update:
    See Project
  • Fully Managed MySQL, PostgreSQL, and SQL Server Icon
    Fully Managed MySQL, PostgreSQL, and SQL Server

    Automatic backups, patching, replication, and failover. Focus on your app, not your database.

    Cloud SQL handles your database ops end to end, so you can focus on your app.
    Try Free
  • 10
    vLLM

    vLLM

    A high-throughput and memory-efficient inference and serving engine

    vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.
    Downloads: 19 This Week
    Last Update:
    See Project
  • 11
    DeepCamera

    DeepCamera

    Open-Source AI Camera. Empower any camera/CCTV

    DeepCamera empowers your traditional surveillance cameras and CCTV/NVR with machine learning technologies. It provides open-source facial recognition-based intrusion detection, fall detection, and parking lot monitoring with the inference engine on your local device. SharpAI-hub is the cloud hosting for AI applications that helps you deploy AI applications with your CCTV camera on your edge device in minutes. SharpAI yolov7_reid is an open-source Python application that leverages AI technologies to detect intruders with traditional surveillance cameras. The source code is here It leverages Yolov7 as a person detector, FastReID for person feature extraction, Milvus the local vector database for self-supervised learning to identify unseen persons, Labelstudio to host images locally and for further usage such as label data and train your own classifier. It also integrates with Home-Assistant to empower smart homes with AI technology.
    Downloads: 17 This Week
    Last Update:
    See Project
  • 12
    LLM.swift

    LLM.swift

    LLM.swift is a simple and readable library

    LLM.swift is a Swift package that enables developers to run Large Language Models (LLMs) directly on Apple devices, including iOS, macOS, and watchOS. By leveraging Apple's hardware and software optimizations, LLM.swift facilitates on-device natural language processing tasks, ensuring user privacy and reducing latency associated with cloud-based solutions.​
    Downloads: 16 This Week
    Last Update:
    See Project
  • 13
    whisper-timestamped

    whisper-timestamped

    Multilingual Automatic Speech Recognition with word-level timestamps

    Multilingual Automatic Speech Recognition with word-level timestamps and confidence. Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This repository proposes an implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models. Besides, a confidence score is assigned to each word and each segment.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 14
    OnnxStream

    OnnxStream

    Lightweight inference library for ONNX files, written in C++

    The challenge is to run Stable Diffusion 1.5, which includes a large transformer model with almost 1 billion parameters, on a Raspberry Pi Zero 2, which is a microcomputer with 512MB of RAM, without adding more swap space and without offloading intermediate results on disk. The recommended minimum RAM/VRAM for Stable Diffusion 1.5 is typically 8GB. Generally, major machine learning frameworks and libraries are focused on minimizing inference latency and/or maximizing throughput, all of which at the cost of RAM usage. So I decided to write a super small and hackable inference library specifically focused on minimizing memory consumption: OnnxStream. OnnxStream is based on the idea of decoupling the inference engine from the component responsible for providing the model weights, which is a class derived from WeightsProvider. A WeightsProvider specialization can implement any type of loading, caching, and prefetching of the model parameters.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 15
    Transformer Engine

    Transformer Engine

    A library for accelerating Transformer models on NVIDIA GPUs

    Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference. TE provides a collection of highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API that can be used seamlessly with your framework-specific code. TE also includes a framework-agnostic C++ API that can be integrated with other deep-learning libraries to enable FP8 support for Transformers. As the number of parameters in Transformer models continues to grow, training and inference for architectures such as BERT, GPT, and T5 become very memory and compute-intensive. Most deep learning frameworks train with FP32 by default. This is not essential, however, to achieve full accuracy for many deep learning models.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 16
    ChatGLM.cpp

    ChatGLM.cpp

    C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)

    ChatGLM.cpp is a C++ implementation of the ChatGLM-6B model, enabling efficient local inference without requiring a Python environment. It is optimized for running on consumer hardware.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 17
    WhisperKit

    WhisperKit

    On-device Speech Recognition for Apple Silicon

    WhisperKit is a Swift package that integrates OpenAI's popular Whisper speech recognition model with Apple's CoreML framework for efficient, local inference on Apple devices. Whisper has pulled the future forward when fast, free and virtually error-free translation and transcription will be ubiquitous. It inspired numerous developers to improve and deploy it with minimal friction and maximum performance. We founded Argmax in November 2023 to empower developers and enterprises everywhere to deploy commercial-scale inference workloads on user devices. The fast-growing need for Whisper inference in production convinced us to take it on as our first project.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 18
    Arize Phoenix

    Arize Phoenix

    Uncover insights, surface problems, monitor, and fine tune your LLM

    Phoenix provides ML insights at lightning speed with zero-config observability for model drift, performance, and data quality. Phoenix is an Open Source ML Observability library designed for the Notebook. The toolset is designed to ingest model inference data for LLMs, CV, NLP and tabular datasets. It allows Data Scientists to quickly visualize their model data, monitor performance, track down issues & insights, and easily export to improve. Deep Learning Models (CV, LLM, and Generative) are an amazing technology that will power many of future ML use cases. A large set of these technologies are being deployed into businesses (the real world) in what we consider a production setting.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 19
    Coqui STT

    Coqui STT

    The deep learning toolkit for speech-to-text

    Coqui STT is a fast, open-source, multi-platform, deep-learning toolkit for training and deploying speech-to-text models. Coqui STT is battle-tested in both production and research. Multiple possible transcripts, each with an associated confidence score. Experience the immediacy of script-to-performance. With Coqui text-to-speech, production times go from months to minutes. With Coqui, the post is a pleasure. Effortlessly clone the voices of your talent and have the clone handle the problems in post. With Coqui, dubbing is a delight. Effortlessly clone the voice of your talent into another language and let the clone do the dub. With text-to-speech, experience the immediacy of script-to-performance. Cast from a wide selection of high-quality, directable, emotive voices or clone a voice to suit your needs. With Coqui text-to-speech, production times go from months to minutes.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 20
    Genv

    Genv

    GPU environment management and cluster orchestration

    Genv is an open-source environment and cluster management system for GPUs. Genv lets you easily control, configure, monitor and enforce the GPU resources that you are using in a GPU machine or cluster. It is intended to ease up the process of GPU allocation for data scientists without code changes.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 21
    Oumi

    Oumi

    Everything you need to build state-of-the-art foundation models

    Oumi is an open-source framework that provides everything needed to build state-of-the-art foundation models, end-to-end. It aims to simplify the development of large-scale machine-learning models.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 22
    Turing.jl

    Turing.jl

    Bayesian inference with probabilistic programming

    Bayesian inference with probabilistic programming.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 23
    Xorbits Inference

    Xorbits Inference

    Replace OpenAI GPT with another LLM in your app

    Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. Xorbits Inference(Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 24
    huggingface_hub

    huggingface_hub

    The official Python client for the Huggingface Hub

    The huggingface_hub library allows you to interact with the Hugging Face Hub, a platform democratizing open-source Machine Learning for creators and collaborators. Discover pre-trained models and datasets for your projects or play with the thousands of machine-learning apps hosted on the Hub. You can also create and share your own models, datasets, and demos with the community. The huggingface_hub library provides a simple way to do all these things with Python.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 25
    spaGO

    spaGO

    Self-contained Machine Learning and Natural Language Processing lib

    A Machine Learning library written in pure Go designed to support relevant neural architectures in Natural Language Processing. Spago is self-contained, in that it uses its own lightweight computational graph both for training and inference, easy to understand from start to finish. The core module of Spago relies only on testify for unit testing. In other words, it has "zero dependencies", and we are committed to keeping it that way as much as possible. Spago uses a multi-module workspace to ensure that additional dependencies are downloaded only when specific features (e.g. persistent embeddings) are used. A good place to start is by looking at the implementation of built-in neural models, such as the LSTM. Except for a few linear algebra operations written in assembly for optimal performance (a bit of copying from Gonum), it's straightforward Go code, so you don't have to worry.
    Downloads: 5 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB