speed-dreeams free download

LightLLM

LightLLM is a Python-based LLM (Large Language Model) inference

LightLLM is a high-performance inference and serving framework designed specifically for large language models, focusing on lightweight architecture, scalability, and efficient deployment. The framework enables developers to run and serve modern language models with significantly improved speed and resource efficiency compared to many traditional inference systems. Built primarily in Python, the project integrates optimization techniques and ideas from several leading open-source implementations, including FasterTransformer, vLLM, and FlashAttention, to accelerate token generation and reduce latency. LightLLM is designed to handle large-scale model workloads in production environments, supporting efficient batching and GPU utilization for fast inference across multiple requests. ...

Downloads: 0 This Week

Last Update: 2026-03-05

See Project

Guidance

A guidance language for controlling large language models

Guidance is an efficient programming paradigm for steering language models. With Guidance, you can control how output is structured and get high-quality output for your use case—while reducing latency and cost vs. conventional prompting or fine-tuning. It allows users to constrain generation (e.g. with regex and CFGs) as well as to interleave control (conditionals, loops, tool use) and generation seamlessly.

Downloads: 1 This Week

Last Update: 2026-03-18

See Project

SentenceTransformers

Multilingual sentence & image embeddings with BERT

...Further, it is easy to fine-tune your own models. Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed.

Downloads: 7 This Week

Last Update: 6 days ago

See Project

how-to-optim-algorithm-in-cuda

How to optimize some algorithm in cuda

how-to-optim-algorithm-in-cuda is an open educational repository focused on teaching developers how to optimize algorithms for high-performance execution on GPUs using CUDA. The project combines technical notes, code examples, and practical experiments that demonstrate how common computational kernels can be optimized to improve speed and memory efficiency. Instead of presenting only theoretical explanations, the repository includes hand-written CUDA implementations of fundamental operations such as reductions, element-wise computations, softmax, and attention mechanisms. These examples show how different optimization techniques influence performance on modern GPU hardware and allow readers to experiment with real implementations. ...

Downloads: 1 This Week

Last Update: 4 days ago

See Project

ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM

...It upgrades the base model with GLM’s hybrid pretraining objective, 1.4 TB bilingual data, and preference alignment—delivering big gains on MMLU, CEval, GSM8K, and BBH. The context window extends up to 32K (FlashAttention), and Multi-Query Attention improves speed and memory use. The repo includes Python APIs, CLI & web demos, OpenAI-style/FASTAPI servers, and quantized checkpoints for lightweight local deployment on GPUs or CPU/MPS.

Downloads: 1 This Week

Last Update: 2 days ago

See Project

Mosec

A high-performance ML model serving framework, offers dynamic batching

Mosec is a high-performance and flexible model-serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

Downloads: 1 This Week

Last Update: 5 days ago

See Project

MobileLLM

MobileLLM Optimizing Sub-billion Parameter Language Models

...The framework integrates several architectural innovations—SwiGLU activation, deep and thin network design, embedding sharing, and grouped-query attention (GQA)—to achieve a superior trade-off between model size, inference speed, and accuracy. MobileLLM demonstrates remarkable performance, with the 125M and 350M variants outperforming previous state-of-the-art models of the same scale by up to 4.3% on zero-shot commonsense reasoning tasks.

Downloads: 1 This Week

Last Update: 2 days ago

See Project

OpenAI Forward

An efficient forwarding service designed for LLMs

OpenAI Forward is an open-source forwarding and reverse proxy service for large language model APIs, designed to sit between client applications and model providers. Its main purpose is to make model access more manageable and efficient by adding operational controls such as request rate limiting, token rate limiting, caching, logging, routing, and key management around existing LLM endpoints. The project can proxy both local and cloud-hosted language model services, which makes it useful...

Downloads: 0 This Week

Last Update: 2026-03-10

See Project

Engram

A New Axis of Sparsity for Large Language Models

...It provides utilities to generate embeddings from text or other structured data, index them using efficient approximate nearest neighbor algorithms, and perform real-time similarity queries even on large corpora. Engineered with speed and memory efficiency in mind, Engram supports batched indexing, incremental updates, and custom distance metrics so developers can tailor search behaviors to their domain’s needs. In addition to raw similarity search, the project includes tools for clustering, ranking, and filtering results, enabling richer user experiences like “related content”, semantic auto-completion, and contextual filtering.

Downloads: 0 This Week

Last Update: 2026-01-28

See Project

Ring

Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI

...Reasoning-optimized model with reinforcement learning enhancements. Efficient architecture and memory design for large-scale reasoning. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.

Downloads: 0 This Week

Last Update: 2025-09-30

See Project

GLM-V

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning

...GLM-4.5V builds on the flagship GLM-4.5-Air foundation (106B parameters, 12B active), achieving state-of-the-art results on 42 benchmarks across image, video, document, GUI, and grounding tasks. It introduces hybrid training for broad-spectrum reasoning and a Thinking Mode switch to balance speed and depth of reasoning. GLM-4.1V-9B-Thinking incorporates reinforcement learning with curriculum sampling (RLCS) and Chain-of-Thought reasoning, outperforming models much larger in scale (e.g., Qwen-2.5-VL-72B) across many benchmarks.

Downloads: 0 This Week

Last Update: 2 days ago

See Project

GLM-130B

GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)

...The model supports efficient inference via INT8 and INT4 quantization, reducing hardware requirements from 8× A100 GPUs to as little as a single server with 4× RTX 3090s. Built on the SwissArmyTransformer (SAT) framework and compatible with DeepSpeed and FasterTransformer, it supports high-speed inference (up to 2.5× faster) and reproducible evaluation across 30+ benchmark tasks.

Downloads: 2 This Week

Last Update: 8 hours ago

See Project

Search Results for "speed-dreeams"

Showing 12 open source projects for "speed-dreeams"

LightLLM

Guidance

SentenceTransformers

how-to-optim-algorithm-in-cuda

ChatGLM2-6B

Mosec

MobileLLM

OpenAI Forward

Engram

Ring

GLM-V

GLM-130B

Search Results for "speed-dreeams"

Showing 12 open source projects for "speed-dreeams"

LightLLM

Guidance

SentenceTransformers

how-to-optim-algorithm-in-cuda

ChatGLM2-6B

Mosec

MobileLLM

OpenAI Forward

Engram

Ring

GLM-V

GLM-130B

Related Searches

Related Categories