Bridging Reasoning and Action Prediction
Faster Whisper transcription with CTranslate2
Pruna is a model optimization framework built for developers
State-of-the-art Parameter-Efficient Fine-Tuning
Easily compute clip embeddings and build a clip retrieval system
C++ library for high performance inference on NVIDIA GPUs
Distributed AI Model Training and LLM Fine-Tuning on Kubernetes
Low-latency REST API for serving text-embeddings
FlashMLA: Efficient Multi-head Latent Attention Kernels
Lemonade helps users run local LLMs with the highest performance
Making large AI models cheaper, faster and more accessible
A GPU-accelerated library containing highly optimized building blocks
ChatGLM-6B: An Open Bilingual Dialogue Language Model
RL implementations
Voice Recognition to Text Tool
lightweight, standalone C++ inference engine for Google's Gemma models
Run Local LLMs on Any Device. Open-source
A Python package for extending the official PyTorch
3D reconstruction software
Unified web UI for training and running open models locally
A nearly-live implementation of OpenAI's Whisper
The official repo of Qwen chat & pretrained large language model
Run AI models locally on your machine with node.js bindings for llama
Elegant and Performant Deep Learning
Large-Scale Agentic RL for High-Performance CUDA Kernel Generation