Showing 51 open source projects for "gpu speed"

View related business solutions
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • 1
    ort

    ort

    Fast ML inference & training for ONNX models in Rust

    ort is a high-performance Rust library that provides bindings to ONNX Runtime, enabling developers to run machine learning inference and training workflows directly within Rust applications using the standardized ONNX model format. It is designed to bridge the gap between modern machine learning frameworks and systems programming by offering a safe, ergonomic API for executing models originally built in ecosystems like PyTorch, TensorFlow, or scikit-learn. The library emphasizes speed and...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    CUDA Agent

    CUDA Agent

    Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

    CUDA Agent is a research-driven agentic reinforcement learning system designed to automatically generate and optimize high-performance CUDA kernels for GPU workloads. The project addresses the long-standing challenge that efficient CUDA programming typically requires deep hardware expertise by training an autonomous coding agent capable of iterative improvement through execution feedback. Its architecture combines large-scale data synthesis, a skill-augmented CUDA development environment,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    nanochat

    nanochat

    The best ChatGPT that $100 can buy

    ...The repository stitches together every stage of the lifecycle: tokenizer training, pretraining a Transformer on a large web corpus, mid-training on dialogue and multiple-choice tasks, supervised fine-tuning, optional reinforcement learning for alignment, and finally efficient inference with caching. Its north star is approachability and speed: you can boot a fresh GPU box and drive the whole pipeline via a single script, producing a usable chat model in hours and a clear markdown report of what happened. The code is written to be read—concise training loops, transparent configs, and minimal wrappers—so you can audit each step, tweak it, and rerun without getting lost in framework indirection.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    DeepSpeed

    DeepSpeed

    Deep learning optimization library: makes distributed training easy

    DeepSpeed is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can: 1. Train/Inference dense or sparse models with billions or trillions of parameters 2. Achieve excellent system throughput and efficiently scale to thousands of GPUs 3. Train/Inference on resource constrained GPU systems 4. Achieve unprecedented low latency and high throughput for inference 5. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 5
    Mooncake

    Mooncake

    Mooncake is the serving platform for Kimi

    ...Its architecture centers on a high-performance transfer engine that provides unified data transfer across different storage and networking technologies. This engine enables efficient movement of tensors and model data across heterogeneous environments such as GPU memory, system memory, and distributed storage systems. Mooncake also introduces distributed key-value cache storage that allows inference systems to reuse previously computed attention states, significantly improving throughput in large-scale deployments. The system supports advanced networking technologies such as RDMA and NVMe over Fabric, enabling high-speed communication across clusters.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    MegEngine

    MegEngine

    Easy-to-use deep learning framework with 3 key features

    ...You can represent quantization/dynamic shape/image pre-processing and even derivation in one model. After training, just put everything into your model and inference it on any platform at ease. Speed and precision problems won't bother you anymore due to the same core inside. In training, GPU memory usage could go down to one-third at the cost of only one additional line, which enables the DTR algorithm. Gain the lowest memory usage when inferencing a model by leveraging our unique pushdown memory planner. NOTE: MegEngine now supports Python installation on Linux-64bit/Windows-64bit/MacOS(CPU-Only)-10.14+/Android 7+(CPU-Only) platforms with Python from 3.5 to 3.8. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 7
    AnyTXT Searcher

    AnyTXT Searcher

    A Powerful Desktop Full-Text Search Engine, Just Like Local Google.

    AnyTXT Searcher is a powerful file full-text search engine, a desktop search application for fast document retrieval. Just like a local disk Google search engine, much faster than Windows Search, it is your ideal desktop file content full-text search engine. It has a powerful document parsing engine built in, which extracts the text of commonly used file formats without installing any other software, and combines the built-in high-speed indexing system to store the metadata of the...
    Leader badge
    Downloads: 9,174 This Week
    Last Update:
    See Project
  • 8
    Bandicoot

    Bandicoot

    fast C++ library for GPU linear algebra & scientific computing

    * Fast GPU linear algebra library (matrix maths) for the C++ language, aiming towards a good balance between speed and ease of use * Provides high-level syntax and functionality deliberately similar to Matlab * Provides an API that is aiming to be compatible with Armadillo for easy transition between CPU and GPU linear algebra code * Useful for algorithm development directly in C++, or quick conversion of research code into production environments * Distributed under the permissive Apache 2.0 license, useful for both open-source and proprietary (closed-source) software * Can be used for machine learning, pattern recognition, computer vision, signal processing, bioinformatics, statistics, finance, etc * Downloads: http://coot.sourceforge.io/download.html * Documentation: http://coot.sourceforge.io/docs.html * Bug reports: http://coot.sourceforge.io/faq.html * Git repo: https://gitlab.com/conradsnicta/bandicoot-code
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9
    gpu_poor

    gpu_poor

    Calculate token/s & GPU memory requirement for any LLM

    gpu_poor is an open-source tool designed to help developers determine whether their hardware is capable of running a specific large language model and to estimate the performance they can expect from it. The project focuses on calculating GPU memory requirements and predicted inference speed for different models, hardware configurations, and quantization strategies. By analyzing factors such as model size, context length, batch size, and GPU specifications, the system estimates how much VRAM will be required and how fast tokens can be generated during inference. The tool also provides a detailed breakdown of where GPU memory is allocated, including model weights, KV cache, activations, and other runtime overhead. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • 10
    OptiMate

    OptiMate

    Libraries for optimizing AI models, inference speed, and GPU usage

    ...Another component, Nos, targets infrastructure optimization by improving GPU utilization in Kubernetes clusters through dynamic partitioning and elastic resource quotas.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    Point-E

    Point-E

    Point cloud diffusion for 3D model synthesis

    point-e is the official repository for Point-E, a generative model developed by OpenAI that produces 3D point clouds from textual (or image) prompts. Its principal advantage is speed: it can generate 3D assets in just 1–2 minutes on a single GPU, which is significantly faster than many competing text-to-3D models. The model works via a two-stage diffusion approach: first, it uses a text → image diffusion network to produce a synthetic 2D view consistent with the prompt; then a second diffusion model converts that image into a 3D point cloud. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    MXNet

    MXNet

    Lightweight, Portable, Flexible Distributed/Mobile Deep Learning

    Apache MXNet is a scalable, efficient open-source deep learning framework—offering a flexible hybrid programming model (symbolic + imperative) and supporting a wide array of languages—designed for training and deploying neural networks across heterogeneous systems. Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    MACE

    MACE

    Deep learning inference framework optimized for mobile platforms

    Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices. Runtime is optimized with NEON, OpenCL and Hexagon, and Winograd algorithm is introduced to speed up convolution operations. The initialization is also optimized to be faster. Chip-dependent power options like big.LITTLE scheduling, Adreno GPU hints are included as advanced APIs. UI responsiveness guarantee is sometimes obligatory when running a model. Mechanism like automatically breaking OpenCL kernel into small units is introduced to allow better preemption for the UI rendering task. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    YOLOv4-large

    YOLOv4-large

    Scaled-YOLOv4: Scaling Cross Stage Partial Network

    YOLOv4-large is an open-source implementation of the Scaled-YOLOv4 object detection architecture, designed to improve both the accuracy and scalability of real-time computer vision models. The project provides a PyTorch implementation of the Scaled-YOLOv4 framework, which extends the original YOLOv4 architecture using Cross Stage Partial (CSP) networks and new scaling techniques. Unlike earlier object detection systems that only scale depth or width, this architecture scales multiple aspects...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Minkowski Engine

    Minkowski Engine

    Auto-diff neural network library for high-dimensional sparse tensors

    The Minkowski Engine is an auto-differentiation library for sparse tensors. It supports all standard neural network layers such as convolution, pooling, unspooling, and broadcasting operations for sparse tensors. The Minkowski Engine supports various functions that can be built on a sparse tensor. We list a few popular network architectures and applications here. To run the examples, please install the package and run the command in the package root directory. Compressing a neural network to...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    HiFi-GAN

    HiFi-GAN

    Generative Adversarial Networks for Efficient and High Fidelity Speech

    ...It introduces a generator architecture tailored to model the periodic structure of speech and a set of discriminators that focus on different scales and periods of the waveform to better capture naturalness. The model targets a sweet spot between sample quality and generation speed, outperforming many previous GAN vocoders while being far faster than typical autoregressive models. In experiments on LJSpeech, HiFi-GAN was shown to achieve mean opinion scores close to human recordings while synthesizing 22.05 kHz audio up to ~168× faster than real time on an NVIDIA V100 GPU. A smaller configuration trades a bit of quality for even higher speed and can run more than 13× faster than real time on CPU, making it suitable for deployment scenarios without powerful GPUs.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    TFLearn

    TFLearn

    Deep learning library featuring a higher-level API for TensorFlow

    ...Powerful helper functions to train any TensorFlow graph, with support of multiple inputs, outputs, and optimizers. Easy and beautiful graph visualization, with details about weights, gradients, activations, and more. Effortless device placement for using multiple CPU/GPU. The high-level API currently supports the most of the recent deep learning models, such as Convolutions, LSTM, BiRNN, BatchNorm, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    BytePS

    BytePS

    A high performance and generic framework for distributed DNN training

    ...For example, on BERT-large training, BytePS can achieve ~90% scaling efficiency with 256 GPUs (see below), which is much higher than Horovod+NCCL. In certain scenarios, BytePS can double the training speed compared with Horovod+NCCL. We show our experiment on BERT-large training, which is based on GluonNLP toolkit. The model uses mixed precision. We use Tesla V100 32GB GPUs and set batch size equal to 64 per GPU. Each machine has 8 V100 GPUs (32GB memory) with NVLink-enabled. Machines are inter-connected with 100 Gbps RDMA network. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    maskrcnn-benchmark

    maskrcnn-benchmark

    Fast, modular reference implementation of Instance Segmentation

    ...The framework integrates critical components—region proposal networks (RPNs), RoIAlign layers, mask heads, and backbone architectures such as ResNet and FPN—optimized for both accuracy and speed. It supports multi-GPU distributed training, mixed precision, and custom data loaders for new datasets. Built as a reference implementation, it became a foundation for the next-generation Detectron2, yet remains widely used for research needing a stable, reproducible environment. Visualization tools, model zoo checkpoints, and benchmark scripts make it easy to replicate state-of-the-art results or fine-tune models for custom tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Tensorpack

    Tensorpack

    A Neural Net Training Interface on TensorFlow, with focus on speed

    Tensorpack is a neural network training interface based on TensorFlow v1. Uses TensorFlow in the efficient way with no extra overhead. On common CNNs, it runs training 1.2~5x faster than the equivalent Keras code. Your training can probably gets faster if written with Tensorpack. Scalable data-parallel multi-GPU / distributed training strategy is off-the-shelf to use. Squeeze the best data loading performance of Python with tensorpack.dataflow. Symbolic programming (e.g. tf.data) does not...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    OpenSeq2Seq

    OpenSeq2Seq

    Toolkit for efficient experimentation with Speech Recognition

    OpenSeq2Seq is a TensorFlow-based toolkit for efficient experimentation with sequence-to-sequence models across speech and NLP tasks. Its core goal is to give researchers a flexible, modular framework for building and training encoder–decoder architectures while fully leveraging distributed and mixed-precision training. The toolkit includes ready-made models for neural machine translation, automatic speech recognition, speech synthesis, language modeling, and additional NLP tasks such as...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Caffe2

    Caffe2

    Caffe2 is a lightweight, modular, and scalable deep learning framework

    Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the original Caffe, Caffe2 is designed with expression, speed, and modularity in mind. Caffe2 is a deep learning framework that provides an easy and straightforward way for you to experiment with deep learning and leverage community contributions of new models and algorithms. You can bring your creations to scale using the power of GPUs in the cloud or to the masses on mobile with Caffe2’s cross-platform...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Caffe

    Caffe

    A fast open framework for deep learning

    Caffe is an open source deep learning framework that’s focused on expression, speed and modularity. It’s got an expressive architecture that encourages application and innovation, and extensible code that’s great for active development. Caffe also offers great speed, capable of processing over 60M images per day with a single NVIDIA K40 GPU. It’s arguably one of the fastest convnet implementations around. Caffe is developed by the Berkeley AI Research (BAIR)/The Berkeley Vision and Learning Center (BVLC) and a great community of contributors that continue to make Caffe state-of-the-art in both code and models. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    gpt-oss-120b

    gpt-oss-120b

    OpenAI’s open-weight 120B model optimized for reasoning and tooling

    GPT-OSS-120B is a powerful open-weight language model by OpenAI, optimized for high-level reasoning, tool use, and agentic tasks. With 117B total parameters and 5.1B active parameters, it’s designed to fit on a single H100 GPU using native MXFP4 quantization. The model supports fine-tuning, chain-of-thought reasoning, and structured outputs, making it ideal for complex workflows. It operates in OpenAI’s Harmony response format and can be deployed via Transformers, vLLM, Ollama, LM Studio, and PyTorch. Developers can control the reasoning level (low, medium, high) to balance speed and depth depending on the task. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    DiffusionGemma

    DiffusionGemma

    NVFP4 DiffusionGemma model for fast multimodal text generation

    ...Built on the Gemma 4 26B A4B Mixture-of-Experts architecture, it has 25.2B total parameters and 3.8B active parameters, balancing capability with efficient inference. Its diffusion-based generation produces tokens in parallel 256-token blocks, enabling very high-speed output, with reported generation above 1,100 tokens per second on NVIDIA Hopper H100 in FP8. The model supports a 256K-token context window, configurable thinking mode, native function calling, structured JSON output, and multilingual inference across 35+ languages. The NVFP4 quantization reduces weights and activations from 16-bit to 4-bit, lowering disk size and GPU memory needs for vLLM deployment.
    Downloads: 0 This Week
    Last Update:
    See Project
Auth0 Logo