Streaming Real-time Audio-Driven Avatar Generation
Capable of understanding text, audio, vision, video
The Triton Inference Server provides an optimized cloud
Document Image Parsing via Heterogeneous Anchor Prompting”
Qwen3-omni is a natively end-to-end, omni-modal LLM
Dumb downloader that scrapes the web
Open-Source Low-Latency Accelerated Linux WebRTC HTML5 Remote Desktop
Easy-to-use Speech Toolkit including Self-Supervised Learning model
Video editing with Python
Swing Music is a beautiful, self-hosted music player
HunyuanVideo: A Systematic Framework For Large Video Generation Model
Free, high-quality text-to-speech API endpoint to replace OpenAI
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
WhatsApp MCP server enabling AI access to chats and messaging
Taming Stable Diffusion for Lip Sync
A lightweight text-to-speech model with zero-shot voice cloning
Build Vision Agents quickly with any model or video provider
Large Multimodal Models for Video Understanding and Editing
Automated Music Discovery and Collection Manager
Generate high-definition story short videos with one click using AI
Official MiniMax Model Context Protocol (MCP) server
Code and models for ICML 2024 paper, NExT-GPT
Python inference and LoRA trainer package for the LTX-2 audio–video
A TTS model capable of generating ultra-realistic dialogue
A feature packed DJ console and internet radio client for Linux users