Showing 110 open source projects for "inference"

View related business solutions
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 1
    bitsandbytes

    bitsandbytes

    Accessible large language models via k-bit quantization for PyTorch

    ...The project includes specialized optimizers and quantized matrix operations that significantly reduce the memory footprint of training and inference workloads. By lowering the hardware requirements needed to work with large models, bitsandbytes helps make modern AI development more accessible to researchers and engineers. The library has become widely used in machine learning pipelines that rely on parameter-efficient training techniques and low-precision inference.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    GLM-4.6

    GLM-4.6

    Agentic, Reasoning, and Coding (ARC) foundation models

    ...The model achieves superior coding performance, excelling in benchmarks and practical coding assistants such as Claude Code, Cline, Roo Code, and Kilo Code. Its reasoning capabilities have been strengthened, including improved tool usage during inference and more effective integration within agent frameworks. GLM-4.6 also enhances writing quality, producing outputs that better align with human preferences and role-playing scenarios. Benchmark evaluations demonstrate that it not only outperforms GLM-4.5 but also rivals leading global models such as DeepSeek-V3.1-Terminus and Claude Sonnet 4.
    Downloads: 67 This Week
    Last Update:
    See Project
  • 3
    MLC LLM

    MLC LLM

    Universal LLM Deployment Engine with ML Compilation

    ...The project focuses on compiling models into optimized runtimes that can run natively on devices such as GPUs, mobile processors, browsers, and edge hardware. By leveraging machine learning compilation techniques, mlc-llm produces high-performance inference engines that maintain consistent APIs across platforms. The system supports deployment on environments including Linux, macOS, Windows, iOS, Android, and web browsers while utilizing different acceleration technologies such as CUDA, Vulkan, Metal, and WebGPU. It also provides OpenAI-compatible APIs that allow developers to integrate locally deployed models into existing AI applications without major code changes.
    Downloads: 28 This Week
    Last Update:
    See Project
  • 4
    LLaMA Models

    LLaMA Models

    Utilities intended for use with Llama models

    ...It complements separate repos that carry code and demos (for example inference kernels or cookbook content) by keeping authoritative metadata and specs here. Model lineages and size variants are documented externally (e.g., Llama 3.x and beyond), with this repo providing the “single source of truth” links and utilities. In practice, teams use llama-models as a reference when selecting variants, aligning licenses, and wiring in helper scripts for deployment.
    Downloads: 2 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 5
    Mosec

    Mosec

    A high-performance ML model serving framework, offers dynamic batching

    Mosec is a high-performance and flexible model-serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    Phi-3-MLX

    Phi-3-MLX

    Phi-3.5 for Mac: Locally-run Vision and Language Models

    Phi-3-Vision-MLX is an Apple MLX (machine learning on Apple silicon) implementation of Phi-3 Vision, a lightweight multi-modal model designed for vision and language tasks. It focuses on running vision-language AI efficiently on Apple hardware like M1 and M2 chips.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    LLaMA 3

    LLaMA 3

    The official Meta Llama 3 GitHub site

    ...Even as a deprecated repo, it documents the transition path and preserves references that clarify how Llama 3 releases map into the current ecosystem. Practically, it functioned as a bridge between Llama 2 and later Llama releases by standardizing distribution and starter code for inference and fine-tuning. Teams still treat it as historical reference material for version lineage and migration notes.
    Downloads: 15 This Week
    Last Update:
    See Project
  • 8
    Bespoke Curator

    Bespoke Curator

    Synthetic data curation for post-training and data extraction

    ...Curator includes tools for monitoring data generation processes and managing dataset quality while large batches of examples are being created. The framework also integrates with multiple inference systems and APIs, allowing users to generate data using different model providers or open-source inference engines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    SageAttention

    SageAttention

    NeurIPS2025 Spotlight] Quantized Attention

    ...The system achieves this by using low-precision numerical formats such as INT4, FP8, or INT8 to represent key matrices within the attention computation. These optimizations allow models to perform matrix operations faster and consume less memory during inference. SageAttention is designed to function as a plug-and-play replacement for standard attention implementations, enabling developers to accelerate existing models without modifying their architecture.
    Downloads: 0 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 10
    Qwen

    Qwen

    The official repo of Qwen chat & pretrained large language model

    Qwen is a series of large language models developed by Alibaba Cloud, consisting of various pretrained versions like Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B. These models, which range from smaller to larger configurations, are designed for a wide range of natural language processing tasks. They are openly available for research and commercial use, with Qwen's code and model weights shared on GitHub. Qwen's capabilities include text generation, comprehension, and conversation, making it a...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 11
    SWIFT LLM

    SWIFT LLM

    Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs

    ...The platform provides a full machine learning pipeline that supports tasks ranging from model pre-training to reinforcement learning alignment techniques. It integrates with popular inference engines such as vLLM and LMDeploy to accelerate deployment and runtime performance. The framework also includes support for many modern training strategies, including preference learning methods and parameter-efficient fine-tuning techniques. ms-swift is designed to work with hundreds of language and multimodal models, providing a unified environment for experimentation and production deployment.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 12
    Code World Model (CWM)

    Code World Model (CWM)

    Research code artifacts for Code World Model (CWM)

    ...It is explicitly trained on execution traces, action-observation trajectories, and agentic interactions in controlled environments. It has been developed to better capture how code, actions, and state interact over time. The repository provides inference code, reproducibility scripts, prompt guides, and more. It has model cards, utilities, demos, and evaluation artifacts. Inference scripts and utilities for code generation tasks. Evaluation benchmarks on code, mathematics, and reasoning tasks. Demos, serving code, and evaluation pipelines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Tencent-Hunyuan-Large

    Tencent-Hunyuan-Large

    Open-source large language model family from Tencent Hunyuan

    ...It is designed with long-context capabilities, quantization support, and high performance on benchmarks across general reasoning, mathematics, language understanding, and Chinese / multilingual tasks. It aims to provide competitive capability with efficient deployment and inference. FP8 quantization support to reduce memory usage (~50%) while maintaining precision. High benchmarking performance on tasks like MMLU, MATH, CMMLU, C-Eval, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    LLM Foundry

    LLM Foundry

    LLM training code for MosaicML foundation models

    Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Large language models (LLMs) are changing the world, but for those outside well-resourced industry labs, it can be extremely difficult to train and deploy...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    marqo

    marqo

    Tensor search for humans

    A tensor-based search and analytics engine that seamlessly integrates with your applications, websites, and workflows. Marqo is a versatile and robust search and analytics engine that can be integrated into any website or application. Due to horizontal scalability, Marqo provides lightning-fast query times, even with millions of documents. Marqo helps you configure deep-learning models like CLIP to pull semantic meaning from images. It can seamlessly handle image-to-image, image-to-text and...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    llmware

    llmware

    Unified framework for building enterprise RAG pipelines

    ...One of the framework’s defining characteristics is its collection of small specialized language models optimized for specific tasks such as summarization, classification, and document analysis. The system supports a wide range of inference backends including PyTorch, OpenVINO, ONNX Runtime, and other optimized runtimes, allowing developers to choose the most efficient execution environment for their hardware.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17
    GLM-4-Voice

    GLM-4-Voice

    GLM-4-Voice | End-to-End Chinese-English Conversational Model

    ...GLM-4-Voice builds upon the bilingual strengths of the GLM architecture, supporting both Chinese and English, and is designed to handle long-form conversations with context retention. The repository provides model weights, inference demos, and setup instructions for deploying speech-enabled AI systems.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    MiniMax-01

    MiniMax-01

    Large-language-model & vision-language-model based on Linear Attention

    ...It has 456 billion total parameters with 45.9 billion activated per token and is trained with advanced parallel strategies such as LASP+, varlen ring attention, and Expert Tensor Parallelism, enabling a training context of 1 million tokens and up to 4 million tokens at inference. MiniMax-VL-01 extends this core by adding a 303M-parameter Vision Transformer and a two-layer MLP projector in a ViT–MLP–LLM framework, allowing the model to process images at dynamic resolutions up to 2016×2016.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    ModernBERT

    ModernBERT

    Bringing BERT into modernity via both architecture changes and scaling

    ...The goal of the project is to bring BERT-style models up to date with the capabilities of modern large language models while preserving the strengths of bidirectional encoder architectures used for tasks such as classification, retrieval, and semantic search. ModernBERT introduces architectural improvements that enhance both training efficiency and inference performance, making the model more suitable for modern large-scale machine learning pipelines. The repository also includes FlexBERT, a modular framework that allows developers to experiment with different encoder building blocks and configurations when constructing new models.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    Purple Llama

    Purple Llama

    Set of tools to assess and improve LLM security

    Purple Llama is an umbrella safety initiative that aggregates tools, benchmarks, and mitigations to help developers build responsibly with open generative AI. Its scope spans input and output safeguards, cybersecurity-focused evaluations, and reference shields that can be inserted at inference time. The project evolves as a hub for safety research artifacts like Llama Guard and Code Shield, along with dataset specs and how-to guides for integrating checks into applications. CyberSecEval, one of its flagship components, provides repeatable evaluations for security risk, including agent-oriented tasks such as automated patching benchmarks. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    MING

    MING

    A large-scale model of medical consultation in Chinese

    ...This interactive capability makes it suitable for conversational health applications, patient triage scenarios, and educational demonstrations. The model is built on transformer-based architectures using frameworks such as PyTorch and integrates with Hugging Face tooling for training and inference workflows.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    CAG

    CAG

    Cache-Augmented Generation: A Simple, Efficient Alternative to RAG

    CAG, or Cache-Augmented Generation, is an experimental framework that explores an alternative architecture for integrating external knowledge into large language model responses. Traditional retrieval-augmented generation systems rely on real-time retrieval of documents from databases or vector stores during inference. CAG proposes a different approach by preloading relevant knowledge into the model’s context window and precomputing the model’s key-value cache before queries are processed. This strategy allows the model to generate responses using the cached context directly, eliminating the need for repeated retrieval operations during runtime. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    MatMul-Free LM

    MatMul-Free LM

    Implementation for MatMul-free LM

    ...The architecture relies on quantization-aware training and lightweight operations to replace conventional dense matrix multiplications with more efficient alternatives. These optimizations can significantly reduce memory consumption and potentially improve computational efficiency during both training and inference. The repository provides implementations of models at several parameter scales and includes tools for experimenting with the architecture using modern machine learning frameworks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Torch Pruning

    Torch Pruning

    DepGraph: Towards Any Structural Pruning

    ...This dependency analysis makes it possible to prune large networks such as transformers, convolutional networks, and diffusion models without breaking the computational graph. Torch-Pruning physically removes parameters rather than masking them, which results in smaller and faster models during both training and inference. The toolkit supports a wide variety of architectures used in computer vision and large language models, making it a flexible solution for model compression tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Ring

    Ring

    Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI

    ...It is built from or derived from Ling. Its design emphasizes reasoning, efficiency, and modular expert activation. In its “flash” variant (Ring-flash-2.0), it optimizes inference by activating only a subset of experts. It applies reinforcement learning/reasoning optimization techniques. Its architectures and training approaches are tuned to enable efficient and capable reasoning performance. Reasoning-optimized model with reinforcement learning enhancements. Efficient architecture and memory design for large-scale reasoning. ...
    Downloads: 0 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB