loading free download

ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model

ChatGLM-6B is an open bilingual (Chinese + English) conversational language model based on the GLM architecture, with approximately 6.2 billion parameters. The project provides inference code, demos (command line, web, API), quantization support for lower memory deployment, and tools for finetuning (e.g., via P-Tuning v2). It is optimized for dialogue and question answering with a balance between performance and deployability in consumer hardware settings. Support for quantized inference...

Downloads: 4 This Week

Last Update: 2025-09-26

See Project

wllama

WebAssembly binding for llama.cpp - Enabling on-browser LLM inference

wllama is a WebAssembly-based library that enables large language model inference directly inside a web browser. Built as a binding for the llama.cpp inference engine, the project allows developers to run LLM models locally without requiring a server backend or dedicated GPU hardware. The library leverages WebAssembly SIMD capabilities to achieve efficient execution within modern browsers while maintaining compatibility across platforms. By running models locally on the user’s device, wllama...

Downloads: 1 This Week

Last Update: 2026-03-10

See Project

VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs)

...Instead of requiring complex data preparation pipelines or multiple repositories for each benchmark, the system enables evaluation through simple commands that automatically handle dataset loading, model inference, and metric computation. VLMEvalKit supports generation-based evaluation methods, allowing models to produce textual responses to visual inputs while measuring performance through techniques such as exact matching or language-model-assisted answer extraction.

Downloads: 1 This Week

Last Update: 2026-03-05

See Project

AirLLM

AirLLM 70B inference with single 4GB GPU

AirLLM is an open source Python library that enables extremely large language models to run on consumer hardware with very limited GPU memory. The project addresses one of the main barriers to local LLM experimentation by introducing a memory-efficient inference technique that loads model layers sequentially rather than storing the entire model in GPU memory. This layer-wise inference approach allows models with tens of billions of parameters to run on devices with only a few gigabytes of...

Downloads: 0 This Week

Last Update: 2026-03-10

See Project

DevDocs by CyberAGI

Completely free, private, UI based Tech Documentation MCP server

DevDocs is an open-source documentation server designed to provide developers with a private, structured interface for browsing and interacting with technical documentation using AI tools. The system functions as a Model Context Protocol (MCP) server that allows large language models and developer assistants to access technical documentation in a structured and efficient way. Instead of sending entire documents to a language model, DevDocs organizes documentation into sections so that only...

Downloads: 0 This Week

Last Update: 2026-03-06

See Project

PowerInfer

High-speed Large Language Model Serving for Local Deployment

PowerInfer is a high-performance inference engine designed to run large language models efficiently on personal computers equipped with consumer-grade GPUs. The project focuses on improving the performance of local AI inference by optimizing how neural network computations are distributed between CPU and GPU resources. Its architecture exploits the observation that only a subset of neurons in large models are frequently activated, allowing the system to preload frequently used neurons into...

Downloads: 0 This Week

Last Update: 2026-03-04

See Project

RAGxplorer

Open-source tool to visualise your RAG

RAGxplorer is an open-source visualization tool designed to help developers analyze and understand Retrieval-Augmented Generation (RAG) pipelines. Retrieval-augmented generation combines language models with external document retrieval systems in order to produce more accurate and grounded responses. However, RAG systems can be complex because they involve multiple components such as embedding models, vector databases, and retrieval algorithms. RAGxplorer provides visual tools that allow...

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

Mixtral offloading

Run Mixtral-8x7B models in Colab or consumer desktops

...This approach takes advantage of the sparse activation properties of mixture-of-experts architectures, where only a subset of expert networks are used for each token during generation. By selectively loading and caching the required experts, the system avoids keeping the entire model in GPU memory at once. The repository includes notebooks and code examples that demonstrate how to run large language models on consumer hardware such as personal GPUs or cloud notebook environments.

Downloads: 1 This Week

Last Update: 2026-03-06

See Project

Grok-1

Open-source, high-performance Mixture-of-Experts large language model

...In March 2024, xAI released Grok-1's model weights and architecture under the Apache 2.0 license, making them openly accessible to developers. The accompanying GitHub repository provides JAX example code for loading and running the model. Due to its substantial size, utilizing Grok-1 requires a machine with significant GPU memory. The repository's MoE layer implementation prioritizes correctness over efficiency, avoiding the need for custom kernels. This is a full repo snapshot ZIP file of the Grok-1 code.

1 Review

Downloads: 34 This Week

Last Update: 2025-02-27

See Project

LLaMA-MoE

Building Mixture-of-Experts from LLaMA with Continual Pre-training

...The project is not just a model release, but also a research framework that includes multiple expert construction methods, several gating strategies, and tooling for continual pre-training on filtered SlimPajama-based datasets. It also emphasizes training efficiency through features such as FlashAttention-v2 integration and fast streaming dataset loading, which are important for large-scale experimentation.

Downloads: 0 This Week

Last Update: 2026-03-10

See Project

react-llm

Easy-to-use headless React Hooks to run LLMs in the browser with WebGP

Easy-to-use headless React Hooks to run LLMs in the browser with WebGPU. As simple as useLLM().

Downloads: 0 This Week

Last Update: 2023-08-25

See Project

LLaMA.go

llama.go is like llama.cpp in pure Golang

llama.go is like llama.cpp in pure Golang. The code of the project is based on the legendary ggml.cpp framework of Georgi Gerganov written in C++ with the same attitude to performance and elegance. Both models store FP32 weights, so you'll needs at least 32Gb of RAM (not VRAM or GPU RAM) for LLaMA-7B. Double to 64Gb for LLaMA-13B.

Downloads: 1 This Week

Last Update: 2023-08-25

See Project

Search Results for "loading"

Showing 12 open source projects for "loading"

ChatGLM-6B

wllama

VLMEvalKit

AirLLM

DevDocs by CyberAGI

PowerInfer

RAGxplorer

Mixtral offloading

Grok-1

LLaMA-MoE

react-llm

LLaMA.go

Search Results for "loading"

Showing 12 open source projects for "loading"

ChatGLM-6B

wllama

VLMEvalKit

AirLLM

DevDocs by CyberAGI

PowerInfer

RAGxplorer

Mixtral offloading

Grok-1

LLaMA-MoE

react-llm

LLaMA.go

Related Searches

Related Categories