FastKoko is a self-hosted text-to-speech server built around the Kokoro-82M model and exposed through a FastAPI backend. It is designed to be easy to deploy via Docker, with separate CPU and GPU images so that users can choose between pure CPU inference and NVIDIA GPU acceleration. The project exposes an OpenAI-compatible speech endpoint, which means existing code that talks to the OpenAI audio API can often be pointed at a Kokoro-FastAPI instance with minimal changes. It supports multiple languages and voicepacks and allows phoneme based generation for more accurate pronunciation and prosody. The server also offers per-word timestamped captions, which makes it useful for creating subtitles or aligning audio with text. A built in web UI, API documentation, and debug endpoints for monitoring system status help users explore voices, test requests, and integrate the service into larger systems.
Features
- OpenAI compatible speech API for drop in replacement of cloud TTS
- Docker images for both CPU and NVIDIA GPU with multi arch support
- Web UI and interactive API docs hosted on the same FastAPI server
- Multi language voicepacks with phoneme based synthesis and captions
- Voice mixing with weighted combinations and reusable custom voicepacks
- Simple quick start scripts and Docker Compose setups for local hosting