cuda free download - SourceForge

Showing 95 open source projects for "cuda"

View related business solutions

Python Clear Filters & Widen Search

$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
Build Agents and Models on One Platform
Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free
1

CUDA Agent

Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDA Agent is a research-driven agentic reinforcement learning system designed to automatically generate and optimize high-performance CUDA kernels for GPU workloads. The project addresses the long-standing challenge that efficient CUDA programming typically requires deep hardware expertise by training an autonomous coding agent capable of iterative improvement through execution feedback.

Downloads: 2 This Week

Last Update: 2026-03-03
See Project
2

CUDA Python

Performance meets Productivity

CUDA Python is a unified Python interface for accessing and working with the NVIDIA CUDA platform, enabling developers to build GPU-accelerated applications entirely in Python. It acts as a metapackage composed of multiple submodules that provide both high-level and low-level access to CUDA functionality, including runtime APIs, driver APIs, and JIT compilation tools.

Downloads: 1 This Week

Last Update: 2026-07-21
See Project
3

Numba CUDA Target

The CUDA target for Numba

Numba CUDA Target is NVIDIA’s maintained CUDA backend for the Numba JIT compiler, enabling developers to write GPU-accelerated code directly in Python. It allows users to define CUDA kernels using Python syntax, which are then compiled into efficient GPU code at runtime using LLVM-based toolchains. This approach significantly lowers the barrier to entry for GPU programming by eliminating the need to write CUDA C++ while still delivering high performance. ...

Downloads: 0 This Week

Last Update: 2026-07-03
See Project
4

how-to-optim-algorithm-in-cuda

How to optimize some algorithm in cuda

how-to-optim-algorithm-in-cuda is an open educational repository focused on teaching developers how to optimize algorithms for high-performance execution on GPUs using CUDA. The project combines technical notes, code examples, and practical experiments that demonstrate how common computational kernels can be optimized to improve speed and memory efficiency. Instead of presenting only theoretical explanations, the repository includes hand-written CUDA implementations of fundamental operations such as reductions, element-wise computations, softmax, and attention mechanisms. ...

Downloads: 2 This Week

Last Update: 6 days ago
See Project
Save Up to 91% on Cloud Compute With Spot VMs
Automatic sustained-use discounts. One free VM per month. No negotiation needed.

Run batch jobs at 60-91% off with Spot VMs. Long-running workloads get automatic discounts with sustained use.

Try Free
5

CuPy

A NumPy-compatible array library accelerated by CUDA

CuPy is an open source implementation of NumPy-compatible multi-dimensional array accelerated with NVIDIA CUDA. It consists of cupy.ndarray, a core multi-dimensional array class and many functions on it. CuPy offers GPU accelerated computing with Python, using CUDA-related libraries to fully utilize the GPU architecture. According to benchmarks, it can even speed up some operations by more than 100X. CuPy is highly compatible with NumPy, serving as a drop-in replacement in most cases. ...

Downloads: 1 This Week

Last Update: 2026-06-01
See Project
6

CUDA Containers for Edge AI & Robotics

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T

CUDA Containers for Edge AI & Robotics is an open-source project that provides a modular container build system designed for running machine learning and AI workloads on NVIDIA Jetson devices. The repository contains container configurations that package the latest AI frameworks and dependencies optimized for Jetson hardware. These containers simplify the deployment of complex machine learning environments by bundling libraries such as CUDA, TensorRT, and deep learning frameworks into reproducible container images. ...

Downloads: 0 This Week

Last Update: 2026-06-25
See Project
7

AIMr

The best AI Aimbot for Fortnite, Valorant, CS2, R6, COD, Apex, & more

...AIMr also provides visual customization options like field-of-view displays and detection indicators, allowing players to tailor their interface. The system is compatible with games that use human-shaped models, and although it functions effectively out of the box, optimizing it with CUDA-accelerated OpenCV is recommended for maximum performance.

2 Reviews

Downloads: 385 This Week

Last Update: 2025-08-31
See Project
8

FlashKDA

High-performance Kimi Delta Attention kernels

...Builds can target the detected GPU architecture or multiple supported architectures for wheels and CI pipelines. The repository also provides correctness tests, benchmark material, a direct Python kernel API, and development helpers for CUDA and C++ tooling.

Downloads: 5 This Week

Last Update: 4 days ago
See Project
9

GPU Puzzles

Solve puzzles. Learn CUDA

...Instead of presenting traditional lecture-style explanations, the project immerses learners directly in hands-on programming tasks that demonstrate how GPU computation works. The exercises are implemented using Python with the Numba CUDA interface, which allows Python code to compile into GPU kernels that run on CUDA-enabled hardware. By solving progressively more complex puzzles, learners gain a practical understanding of how parallel algorithms operate on graphics processing units. The project emphasizes experimentation and problem solving, encouraging learners to discover GPU programming techniques through trial and exploration. ...

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
Build Securely on AWS with Proven Frameworks
Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.

Download Now
10

NVIDIA Warp

A Python framework for accelerated simulation, data generation

NVIDIA Warp is a high-performance Python framework developed by NVIDIA for building and accelerating simulation, graphics, and physics-based workloads using GPU computing. It enables developers to write kernel-level code in Python that is automatically compiled into efficient CUDA kernels, combining ease of use with near-native performance. The framework is designed for applications such as robotics, reinforcement learning, physical simulation, and differentiable computing, where performance and flexibility are critical. Warp provides a set of primitives for working with arrays, geometry, and physics operations, allowing users to implement complex simulations without writing low-level CUDA code directly. ...

Downloads: 1 This Week

Last Update: 2026-07-07
See Project
11

vLLM

A high-throughput and memory-efficient inference and serving engine

vLLM is a fast and easy-to-use library for LLM inference and serving. High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more.

Downloads: 73 This Week

Last Update: 7 days ago
See Project
12

Triton

Development repository for the Triton language and compiler

Triton is a programming language and compiler framework specifically designed for writing highly efficient custom deep learning operations, particularly for GPUs. It aims to bridge the gap between low-level GPU programming, such as CUDA, and higher-level abstractions by providing a more productive and flexible environment for developers. Triton enables users to write optimized kernels for machine learning workloads while maintaining readability and control over performance-critical aspects like memory access patterns and parallel execution. The project leverages LLVM and MLIR to compile code into efficient GPU instructions, supporting both NVIDIA and AMD hardware. ...

Downloads: 4 This Week

Last Update: 2026-06-18
See Project
13

VibeTensor

Our first fully AI generated deep learning system

VibeTensor is a groundbreaking open-source research system software stack for deep learning that was uniquely generated almost entirely by AI coding agents under guided human supervision, demonstrating a new frontier in AI-assisted software engineering. It implements a PyTorch-style eager tensor library with a modern C++20 core that supports both CPU and CUDA backends, giving it the ability to manage tensors, automatic differentiation (autograd), and complex computation flows similar to mainstream frameworks. What makes VibeTensor remarkable is that every major component, from core libraries and dispatch systems to CUDA runtime support, caching allocators, and language bindings, was created and validated by coding agents using automated builds and tests rather than manual line-by-line human coding. ...

Downloads: 0 This Week

Last Update: 2026-02-06
See Project
14

rembg

Rembg is a tool to remove images background

Rembg is a powerful tool that utilizes AI (specifically U^2-Net) to automatically remove backgrounds from images, offering a streamlined command-line interface and Docker support. It's ideal for batch processing and integrates smoothly into workflows

Downloads: 22 This Week

Last Update: 2026-07-18
See Project
15

Koila

Prevent PyTorch's `CUDA error: out of memory` in just 1 line of code

...The system acts as a thin wrapper around PyTorch tensors and operations, meaning that it integrates easily into existing PyTorch code without requiring major changes to model implementations. It is particularly useful in environments where GPU resources are limited or where models frequently encounter CUDA memory errors.

Downloads: 0 This Week

Last Update: 2026-07-09
See Project
16

Chatterbox TTS Server

Self-host the powerful Chatterbox TTS model

...It also includes OpenAI-compatible API behavior, which helps developers connect it to existing tools that already expect that style of endpoint. The server can run on NVIDIA CUDA, AMD ROCm, or CPU, giving it flexibility across different hardware setups. Its main value is packaging a powerful TTS workflow into a practical service that can be accessed through a browser or integrated into other software.

Downloads: 4 This Week

Last Update: 2026-06-08
See Project
17

Dream Textures

Stable Diffusion built-in to Blender

...Outpaint to increase the size of an image by extending it in any direction. Perform style transfer and create novel animations with Stable Diffusion as a post processing step. Dream Textures has been tested with CUDA and Apple Silicon GPUs. Over 4GB of VRAM is recommended.

Downloads: 16 This Week

Last Update: 2024-08-26
See Project
18

Stable Diffusion Version 2

High-Resolution Image Synthesis with Latent Diffusion Models

...The project sits within a larger ecosystem of Stability AI repositories (including inference-only reference implementations like SD3.5 and web UI projects) and the README points users toward compatible components, recommended CUDA/PyTorch versions.

Downloads: 5 This Week

Last Update: 2025-10-02
See Project
19

Nano-vLLM

A lightweight vLLM implementation built from scratch

...The project recreates the core functionality of vLLM in a simplified architecture written in approximately a thousand lines of Python, making it easier for developers and researchers to understand how modern LLM inference systems work. Despite its compact design, nano-vllm incorporates advanced optimization techniques such as prefix caching, tensor parallelism, and CUDA graph execution to achieve high performance during model inference. The engine is intended primarily for educational use, experimentation, and lightweight deployments where a full production-grade inference stack may be unnecessary. Its API closely mirrors that of the original vLLM framework, allowing developers familiar with vLLM to adopt the tool with minimal changes.

Downloads: 2 This Week

Last Update: 2026-04-26
See Project
20

autoresearch-mlx

Apple Silicon (MLX) port of Karpathy's autoresearch

autoresearch-mlx is an Apple Silicon–optimized implementation of the autoresearch framework that enables autonomous AI research loops to run natively on MLX without requiring PyTorch or CUDA dependencies. It maintains the core autoresearch structure, where an AI agent iteratively edits a training script, executes experiments under a fixed time budget, and evaluates results based on a defined metric such as validation bits per byte. The system is tailored for Apple hardware, leveraging unified memory and MLX capabilities to achieve efficient training on Mac devices. ...

Downloads: 0 This Week

Last Update: 2026-07-02
See Project
21

Bonsai 27B

Run Bonsai (1-bit) and Ternary-Bonsai language models locally

...It supports the 1-bit Bonsai and higher-quality Ternary-Bonsai families in 1.7B, 4B, 8B, and 27B sizes. The models can run on macOS, Linux, and Windows through CPU, Metal, CUDA, Vulkan, ROCm, llama.cpp, or MLX backends. Its 27B models process text, images, screenshots, and PDFs while supporting reasoning and long-context conversations. They also provide OpenAI-compatible tool calling and optional MCP server integration for agentic workflows. Automated setup scripts install dependencies, download models, obtain binaries, and configure interactive interfaces. ...

Downloads: 109 This Week

Last Update: 4 days ago
See Project
22

Nexa SDK

Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML

...Additionally, it offers an OpenAI-compatible API server with JSON schema mode for function calling and streaming support, and a user-friendly Streamlit UI. Users can run Nexa SDK in any device with Python environment, and GPU acceleration is supported, including CUDA, Metal, and ROCm. An executable version is also available.

Downloads: 20 This Week

Last Update: 2 days ago
See Project
23

PyTorch Geometric

Geometric deep learning extension library for PyTorch

...We have outsourced a lot of functionality of PyTorch Geometric to other packages, which needs to be additionally installed. These packages come with their own CPU and GPU kernel implementations based on C++/CUDA extensions. We do not recommend installation as root user on your system python. Please setup an Anaconda/Miniconda environment or create a Docker image. We provide pip wheels for all major OS/PyTorch/CUDA combinations.

Downloads: 1 This Week

Last Update: 2026-06-05
See Project
24

Cog

Package and deploy machine learning models using Docker containers

...Developers can define the runtime environment, dependencies, and Python versions required for their models, allowing Cog to build a consistent container environment that follows best practices. Cog also resolves compatibility issues between frameworks and GPU libraries by automatically selecting compatible combinations of CUDA, cuDNN, and machine learning frameworks such as PyTorch or TensorFlow. Cog automatically generates a RESTful HTTP API for running predictions, enabling models to be accessed programmatically through a built-in prediction server.

Downloads: 0 This Week

Last Update: 2026-06-17
See Project
25

Jittor

Jittor is a high-performance deep learning framework

...Module Design and Dynamic Graph Execution is used in the front-end, which is the most popular design for deep learning framework interface. The back-end is implemented by high-performance languages, such as CUDA, C++. Jittor'op is similar to NumPy. Let's try some operations. We create Var a and b via operation jt.float32, and add them. Printing those variables shows they have the same shape and dtype.

Downloads: 1 This Week

Last Update: 2025-07-28
See Project