cuda free download - SourceForge

how-to-optim-algorithm-in-cuda

How to optimize some algorithm in cuda

how-to-optim-algorithm-in-cuda is an open educational repository focused on teaching developers how to optimize algorithms for high-performance execution on GPUs using CUDA. The project combines technical notes, code examples, and practical experiments that demonstrate how common computational kernels can be optimized to improve speed and memory efficiency. Instead of presenting only theoretical explanations, the repository includes hand-written CUDA implementations of fundamental operations such as reductions, element-wise computations, softmax, and attention mechanisms. ...

Downloads: 0 This Week

Last Update: 2026-06-08

See Project

vLLM

A high-throughput and memory-efficient inference and serving engine

vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.

Downloads: 16 This Week

Last Update: 5 days ago

See Project

RWKV Runner

A RWKV management and startup tool, full automation, only 8MB

...So it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free text embedding. Moreover it's 100% attention-free. Default configs has enabled custom CUDA kernel acceleration, which is much faster and consumes much less VRAM. If you encounter possible compatibility issues, go to the Configs page and turn off Use Custom CUDA kernel to Accelerate.

Downloads: 0 This Week

Last Update: 2026-05-08

See Project

Nano-vLLM

A lightweight vLLM implementation built from scratch

...The project recreates the core functionality of vLLM in a simplified architecture written in approximately a thousand lines of Python, making it easier for developers and researchers to understand how modern LLM inference systems work. Despite its compact design, nano-vllm incorporates advanced optimization techniques such as prefix caching, tensor parallelism, and CUDA graph execution to achieve high performance during model inference. The engine is intended primarily for educational use, experimentation, and lightweight deployments where a full production-grade inference stack may be unnecessary. Its API closely mirrors that of the original vLLM framework, allowing developers familiar with vLLM to adopt the tool with minimal changes.

Downloads: 0 This Week

Last Update: 2026-04-26

See Project

MLC LLM

Universal LLM Deployment Engine with ML Compilation

...The system supports deployment on environments including Linux, macOS, Windows, iOS, Android, and web browsers while utilizing different acceleration technologies such as CUDA, Vulkan, Metal, and WebGPU. It also provides OpenAI-compatible APIs that allow developers to integrate locally deployed models into existing AI applications without major code changes.

Downloads: 19 This Week

Last Update: 2026-03-09

See Project

llama.cpp

LLM inference in C/C++

...It provides command-line tools, a server mode with an OpenAI-compatible API style, model conversion utilities, and extensive backend acceleration options. llama.cpp runs on CPUs and GPUs, with support for Apple silicon, x86, RISC-V, CUDA, HIP, Vulkan, SYCL, Metal, and hybrid CPU-GPU execution. Its main value is making practical LLM inference accessible across consumer machines, servers, and specialized deployment environments.

Downloads: 8 This Week

Last Update: 1 hour ago

See Project

llm.c

LLM training in simple, raw C/CUDA

llm.c is a minimalist, systems-level implementation of a small transformer-based language model in C that prioritizes clarity and educational value. By stripping away heavy frameworks, it exposes the core math and memory flows of embeddings, attention, and feed-forward layers. The code illustrates how to wire forward passes, losses, and simple training or inference loops with direct control over arrays and buffers. Its compact design makes it easy to trace execution, profile hotspots, and...

Downloads: 0 This Week

Last Update: 2025-10-15

See Project

Infinity

Low-latency REST API for serving text-embeddings

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. Infinity is developed under MIT License. Infinity powers inference behind Gradient.ai and other Embedding API providers.

Downloads: 0 This Week

Last Update: 2025-08-22

See Project

Punica

Serving multiple LoRA finetuned LLM as one

...Punica introduces a serving architecture that allows multiple LoRA adapters to share the same base model during inference, significantly reducing memory consumption and computational overhead. The system includes specialized CUDA kernels that enable batched GPU operations across different LoRA models simultaneously. This design allows a single GPU cluster to host many task-specific models while maintaining high throughput and minimal latency. The architecture also includes scheduling mechanisms that coordinate requests from multiple tenants and distribute workloads efficiently across available resources.

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

CTransformers

Python bindings for the Transformer models implemented in C/C++

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Downloads: 0 This Week

Last Update: 2023-09-10

See Project

llm

An ecosystem of Rust libraries for working with large language models

llm is an ecosystem of Rust libraries for working with large language models - it's built on top of the fast, efficient GGML library for machine learning. The primary entry point for developers is the llm crate, which wraps the llm-base and the supported model crates. Documentation for the released version is available on Docs.rs. For end-users, there is a CLI application, llm-cli, which provides a convenient interface for interacting with supported models. Text generation can be done as a...

Downloads: 1 This Week

Last Update: 2023-08-21

See Project

LM Human Preferences

Code for the paper Fine-Tuning Language Models from Human Preferences

...The code is provided “as is” and explicitly says it may no longer run out-of-the-box due to dependencies or dataset migrations. It was tested on the smallest GPT-2 (124M parameters) under a specific environment (TensorFlow 1.x, specific CUDA / cuDNN combinations). It includes utilities for launching experiments, sampling from policies, and simple experiment orchestration.

Downloads: 0 This Week

Last Update: 2025-10-03

See Project

DomE

Implements a reference architecture for creating information systems

DomE Experiment is an implementation of a reference architecture for creating information systems from the automated evolution of the domain model. The architecture comprises elements that guarantee user access through automatically generated interfaces for various devices, integration with external information sources, data and operations security, automatic generation of analytical information, and automatic control of business processes. All these features are generated from the domain...

Downloads: 0 This Week

Last Update: 2023-03-22

See Project

Search Results for "cuda"

Showing 13 open source projects for "cuda"

how-to-optim-algorithm-in-cuda

vLLM

RWKV Runner

Nano-vLLM

MLC LLM

llama.cpp

llm.c

Infinity

Punica

CTransformers

llm

LM Human Preferences

DomE

Search Results for "cuda"

Showing 13 open source projects for "cuda"

how-to-optim-algorithm-in-cuda

vLLM

RWKV Runner

Nano-vLLM

MLC LLM

llama.cpp

llm.c

Infinity

Punica

CTransformers

llm

LM Human Preferences

DomE

Related Searches

Related Categories