Bailing is an open-source voice-dialogue assistant designed to deliver natural voice-based conversations by combining automatic speech recognition (ASR), voice activity detection (VAD), a large language model (LLM), and text-to-speech (TTS) in a single pipeline. Its goal is to offer a “voice-first” chat experience similar to what one might expect from a system like GPT-4o, but fully open and deployable by users. The project is modular: each core function — ASR, VAD, LLM, TTS — exists as a separately replaceable component, which allows flexibility in picking your preferred models depending on resources or languages. It aims to be light enough to run without a GPU, making it usable on modest hardware or edge devices, while still maintaining low latency and smooth interaction. Bailing includes a memory system, giving the assistant the ability to remember user preferences and context across sessions, which enables more personalized and context-aware conversations.
Features
- Full voice-dialogue stack combining ASR, VAD, LLM, and TTS for natural spoken conversation
- Modular architecture allowing substitution of any component (e.g. different ASR or TTS)
- Memory and context tracking to remember user preferences and past interactions
- Tool invocation support for actions, reminders, information retrieval via voice commands
- Designed to run without GPU — suitable for low-resource machines or edge deployment
- Low end-to-end latency (≈ 800 ms) aimed at conversational responsiveness