Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
How to optimize some algorithm in cuda
Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
Solve puzzles. Learn CUDA
Please do not feed the models
Our first fully AI generated deep learning system
Prevent PyTorch's `CUDA error: out of memory` in just 1 line of code
Clean and efficient FP8 GEMM kernels with fine-grained scaling
Fast LLM speculative inference server for consumer hardware
A lightweight vLLM implementation built from scratch
User-friendly AI Interface
Interface for OuteTTS models
Universal LLM Deployment Engine with ML Compilation
C++ inference library for multiple SVC/TTS
An experimental version of DeepSeek model
Instant neural graphics primitives: lightning fast NeRF and more
4M: Massively Multimodal Masked Modeling
Synchronized Translation for Videos
A Conversational Speech Generation Model
fast C++ library for GPU linear algebra & scientific computing
Serving multiple LoRA finetuned LLM as one
FAIR's research platform for object detection research
Rust language bindings for TensorFlow
Code for the paper Fine-Tuning Language Models from Human Preferences
Transformer related optimization, including BERT, GPT