A library for accelerating Transformer models on NVIDIA GPUs
A high-throughput and memory-efficient inference and serving engine
950 line, minimal, extensible LLM inference engine built from scratch
TokenSpeed is a speed-of-light LLM inference engine
A lightweight vLLM implementation built from scratch
High-performance inference framework for large language models
Code for running inference and finetuning with SAM 3 model
Pruna is a model optimization framework built for developers
Offline inference engine for art, real-time voice conversations
Low-latency AI inference engine optimized for mobile devices
RGBD video generation model conditioned on camera input
Universal LLM Deployment Engine with ML Compilation
LightLLM is a Python-based LLM (Large Language Model) inference
Parallax is a distributed model serving framework
Tensor search for humans
Inference Llama 2 in one file of pure C
Multi-Agent daTa geneRation Infra and eXperimentation framework
Effortless data labeling with AI support from Segment Anything
Enables the best performance on NVIDIA RTX Graphics Cards
Superduper: Integrate AI models and machine learning workflows
Supercharge Your LLM with the Fastest KV Cache Layer
Running large language models on a single GPU
Toolbox of models, callbacks, and datasets for AI/ML researchers
A Strong and Easy-to-use Single View 3D Hand+Body Pose Estimator
Real-Time State-of-the-art Speech Synthesis for Tensorflow 2