How to optimize some algorithm in cuda
OpenLIT is an open-source LLM Observability tool
Fast and memory-efficient exact attention
AI agents running research on single-GPU nanochat training
GPU accelerated decision optimization
Performance-optimized AI inference on your GPUs
Python inference and LoRA trainer package for the LTX-2 audio–video
Supercharge Your LLM with the Fastest KV Cache Layer
The Modular Platform (includes MAX & Mojo)
Large Language Model Text Generation Inference
Unified KV Cache Compression Methods for Auto-Regressive Models
An opinionated CLI to transcribe Audio files w/ Whisper on-device
Bridging Reasoning and Action Prediction
Pruna is a model optimization framework built for developers
State-of-the-art Parameter-Efficient Fine-Tuning
Faster Whisper transcription with CTranslate2
Easily compute clip embeddings and build a clip retrieval system
Making large AI models cheaper, faster and more accessible
RL implementations
Lemonade helps users run local LLMs with the highest performance
Low-latency REST API for serving text-embeddings
ChatGLM-6B: An Open Bilingual Dialogue Language Model
A Python package for extending the official PyTorch
YOLOv5 is the world's most loved vision AI
Run Local LLMs on Any Device. Open-source