Official inference repo for FLUX.1 models
Audiocraft is a library for audio processing and generation
Automatic Speech Recognition with Word-level Timestamps
Library for OCR-related tasks powered by Deep Learning
Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model
Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML
State-of-the-art TTS model under 25MB
CLIP, Predict the most relevant text snippet given an image
Python library and CLI tool to interface with Google Translate
Easy-to-use and powerful NLP library with Awesome model zoo
A TTS that fits in your CPU (and pocket)
Framework for building realtime multimodal voice AI agents apps
Official MiniMax Model Context Protocol (MCP) server
A fast TTS architecture with conditional flow matching
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Generating Immersive, Explorable, and Interactive 3D Worlds
Qwen3-omni is a natively end-to-end, omni-modal LLM
Clone a voice in 5 seconds to generate arbitrary speech in real-time
Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles
Voice Recognition to Text Tool
An open-source toolkit for monitoring Language Learning Models (LLMs)
Accurate × Fast × Comprehensive
Qwen-Image is a powerful image generation foundation model
The simplest, fastest repository for training/finetuning models
Implementation of Phenaki Video, which uses Mask GIT