20 projects for "cpu memory usage" with 2 filters applied:

  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    GPU Hot

    GPU Hot

    Real-time NVIDIA GPU dashboard

    ...The project offers a self-hosted web interface that streams hardware metrics directly from GPU servers, enabling developers, ML engineers, and system administrators to observe GPU utilization and system behavior in real time through a browser. The dashboard collects and displays a wide range of performance metrics including temperature, memory usage, power consumption, clock speeds, fan speed, and active processes. It can scale from monitoring a single GPU workstation to large distributed environments with dozens or even hundreds of GPUs by running lightweight containers on each node and aggregating the data centrally.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    AirLLM

    AirLLM

    AirLLM 70B inference with single 4GB GPU

    AirLLM is an open source Python library that enables extremely large language models to run on consumer hardware with very limited GPU memory. The project addresses one of the main barriers to local LLM experimentation by introducing a memory-efficient inference technique that loads model layers sequentially rather than storing the entire model in GPU memory. This layer-wise inference approach allows models with tens of billions of parameters to run on devices with only a few gigabytes of...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 3
    KVCache-Factory

    KVCache-Factory

    Unified KV Cache Compression Methods for Auto-Regressive Models

    ...In large language models, the key-value cache stores intermediate attention states that enable efficient token generation during inference, but these caches can consume large amounts of GPU memory when handling long contexts. KVCache-Factory provides a platform for implementing and evaluating multiple compression strategies that reduce memory usage while preserving model performance. The framework integrates several state-of-the-art methods such as PyramidKV, SnapKV, H2O, and StreamingLLM, allowing researchers to compare and experiment with different approaches within the same environment. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    xLSTM

    xLSTM

    Neural Network architecture based on ideas of the original LSTM

    ...The architecture aims to provide competitive performance with transformer-based models while maintaining advantages such as linear computational scaling and efficient memory usage for long sequences. Researchers have demonstrated that xLSTM models can scale to billions of parameters and large training datasets while maintaining efficient inference speeds.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Fully Managed MySQL, PostgreSQL, and SQL Server Icon
    Fully Managed MySQL, PostgreSQL, and SQL Server

    Automatic backups, patching, replication, and failover. Focus on your app, not your database.

    Cloud SQL handles your database ops end to end, so you can focus on your app.
    Try Free
  • 5
    PicoLM

    PicoLM

    Run a 1-billion parameter LLM on a $10 board with 256MB RAM

    PicoLM is an open-source inference framework designed to run large language models on extremely constrained hardware environments such as inexpensive single-board computers and embedded systems. The project focuses on enabling efficient local inference by optimizing memory usage, computation, and system dependencies so that relatively large models can operate on devices with minimal RAM. It is written primarily in C and designed with a minimalist architecture that removes unnecessary dependencies and external libraries. The runtime is capable of running language models with billions of parameters on devices with only a few hundred megabytes of memory, which is significantly lower than typical LLM infrastructure requirements. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 6
    Tencent-Hunyuan-Large

    Tencent-Hunyuan-Large

    Open-source large language model family from Tencent Hunyuan

    ...It is designed with long-context capabilities, quantization support, and high performance on benchmarks across general reasoning, mathematics, language understanding, and Chinese / multilingual tasks. It aims to provide competitive capability with efficient deployment and inference. FP8 quantization support to reduce memory usage (~50%) while maintaining precision. High benchmarking performance on tasks like MMLU, MATH, CMMLU, C-Eval, etc.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    R-KV

    R-KV

    Redundancy-aware KV Cache Compression for Reasoning Models

    R-KV is an open-source research project that focuses on improving the efficiency of large language model inference through key-value cache compression techniques. Modern transformer models rely heavily on KV caches during autoregressive decoding, which store intermediate attention states to accelerate generation. However, these caches can consume large amounts of memory, especially in reasoning-oriented models with long context windows. R-KV introduces a method for compressing the KV cache...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    llmfit

    llmfit

    157 models, 30 providers, one command to find what runs on hardware

    llmfit is a terminal-based utility that helps developers determine which large language models can realistically run on their local hardware by analyzing system resources and model requirements. The tool automatically detects CPU, RAM, GPU, and VRAM specifications, then ranks available models based on performance factors such as speed, quality, and memory fit. It provides both an interactive terminal user interface and a traditional CLI mode, enabling flexible workflows for different user preferences. llmfit also supports advanced configurations including multi-GPU setups, mixture-of-experts architectures, and dynamic quantization recommendations. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 9
    mergekit

    mergekit

    Tools for merging pretrained large language models

    mergekit is an open-source toolkit designed to combine multiple pretrained language models into a single unified model through parameter merging techniques. The framework enables developers to merge model checkpoints so that the resulting model inherits capabilities from several source models without requiring additional training. This approach allows researchers to combine specialized models into a more versatile system capable of performing multiple tasks. mergekit implements a variety of...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 10
    AgentGuide

    AgentGuide

    AI Agent Development Guide, LangGraph in Action, Advanced RAG

    ...Instead of presenting scattered resources, the repository organizes them into a systematic learning roadmap that guides learners from foundational concepts to advanced AI agent systems. The guide covers topics such as agent frameworks, retrieval-augmented generation systems, multi-agent collaboration, memory management, and tool usage. It also includes practical projects, interview preparation materials, and curated research papers related to AI agents and LLM engineering. The project is designed not only for learning but also for career preparation, helping developers understand how to build portfolio projects and prepare for AI engineering roles.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    Chitu

    Chitu

    High-performance inference framework for large language models

    ...Chitu is designed to scale from small single-machine deployments to large distributed clusters that handle high volumes of concurrent inference requests. The system also includes performance optimizations for large models, including support for quantized formats and efficient computation operators that reduce memory usage and latency. Its architecture aims to support enterprise adoption by ensuring stable long-term operation under production workloads.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 12
    LLM-Pruner

    LLM-Pruner

    On the Structural Pruning of Large Language Models

    LLM-Pruner is an open-source framework designed to compress large language models through structured pruning techniques while maintaining their general capabilities. Large language models often require enormous computational resources, making them expensive to deploy and inefficient for many practical applications. LLM-Pruner addresses this issue by identifying and removing non-essential components within transformer architectures, such as redundant attention heads or feed-forward...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    Torch Pruning

    Torch Pruning

    DepGraph: Towards Any Structural Pruning

    Torch-Pruning is an open-source toolkit designed to optimize deep neural networks by performing structural pruning directly within PyTorch models. The library focuses on reducing the size and computational cost of neural networks by removing redundant parameters and channels while maintaining model performance. It introduces a graph-based algorithm called DepGraph that automatically identifies dependencies between layers, allowing parameters to be pruned safely across complex architectures....
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    Extractous

    Extractous

    Fast and efficient unstructured data extraction

    ...Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. For broader format support, the system combines its Rust core with ahead-of-time compiled Apache Tika shared libraries, which allows it to extend parsing coverage while still avoiding traditional server-based overhead. It also supports OCR for images and scanned documents through Tesseract, making it useful for document ingestion pipelines that include image-based or scanned inputs.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    LangChain for Java

    LangChain for Java

    LangChain4j is an open-source Java library

    LangChain for Java is an open-source Java framework designed to simplify the development of applications powered by large language models. The library provides a unified API that allows developers to connect Java applications to multiple AI providers and embedding databases without having to implement separate integrations for each service. Its architecture includes abstractions for prompts, chat interactions, document processing, embeddings, and vector storage, enabling developers to build...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Mixtral offloading

    Mixtral offloading

    Run Mixtral-8x7B models in Colab or consumer desktops

    Mixtral-Offloading is an open-source project designed to enable efficient inference of large Mixture-of-Experts language models such as Mixtral-8x7B on hardware with limited GPU memory. The project implements techniques that allow model components to be dynamically moved between CPU memory and GPU memory during inference, significantly reducing the amount of GPU VRAM required to run the model. This approach takes advantage of the sparse activation properties of mixture-of-experts architectures, where only a subset of expert networks are used for each token during generation. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Firefly LLM

    Firefly LLM

    A large model training tool that supports training large models

    Firefly is an open-source framework designed to simplify the training and fine-tuning of large language models through a unified and configurable workflow. The project provides a comprehensive environment where developers can perform tasks such as model pre-training, instruction tuning, and preference optimization using widely adopted machine learning techniques. Its architecture supports both full-parameter training and parameter-efficient strategies like LoRA and QLoRA, making it suitable...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Punica

    Punica

    Serving multiple LoRA finetuned LLM as one

    Punica is a system designed to efficiently serve multiple LoRA-fine-tuned large language models within a shared GPU environment. LoRA is a parameter-efficient fine-tuning method that allows developers to adapt large pretrained models to specific tasks by adding lightweight adapter layers rather than retraining the entire model. Punica introduces a serving architecture that allows multiple LoRA adapters to share the same base model during inference, significantly reducing memory consumption...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    gpu_poor

    gpu_poor

    Calculate token/s & GPU memory requirement for any LLM

    gpu_poor is an open-source tool designed to help developers determine whether their hardware is capable of running a specific large language model and to estimate the performance they can expect from it. The project focuses on calculating GPU memory requirements and predicted inference speed for different models, hardware configurations, and quantization strategies. By analyzing factors such as model size, context length, batch size, and GPU specifications, the system estimates how much VRAM...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    MiMo-V2.5-Pro

    MiMo-V2.5-Pro

    Flagship MoE model for long-context agents and complex coding

    ...The model supports a 1 million token context window, enabling it to maintain coherence across long workflows involving thousands of tool calls and multi-step reasoning chains. Architecturally, it uses a hybrid attention system combining Sliding Window Attention and Global Attention to significantly reduce memory usage while preserving long-context performance. It also integrates multi-token prediction modules that accelerate inference and improve reinforcement learning efficiency. Trained on around 27 trillion tokens with FP8 mixed precision and refined through supervised fine-tuning, large-scale agentic reinforcement learning, and distillation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB