Capable of understanding text, audio, vision, video
The Triton Inference Server provides an optimized cloud
Document Image Parsing via Heterogeneous Anchor Prompting”
Qwen3-omni is a natively end-to-end, omni-modal LLM
Easy-to-use Speech Toolkit including Self-Supervised Learning model
HunyuanVideo: A Systematic Framework For Large Video Generation Model
Free, high-quality text-to-speech API endpoint to replace OpenAI
Taming Stable Diffusion for Lip Sync
WhatsApp MCP server enabling AI access to chats and messaging
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
A lightweight text-to-speech model with zero-shot voice cloning
Build Vision Agents quickly with any model or video provider
Large Multimodal Models for Video Understanding and Editing
Generate high-definition story short videos with one click using AI
Official MiniMax Model Context Protocol (MCP) server
Code and models for ICML 2024 paper, NExT-GPT
Python inference and LoRA trainer package for the LTX-2 audio–video
A TTS model capable of generating ultra-realistic dialogue
Voice Recognition to Text Tool
The data structure for multimodal data
Unofficial Python API and agentic skill for Google NotebookLM
The most powerful and modular diffusion model GUI, api and backend
Instant voice cloning by MIT and MyShell. Audio foundation model
Build cross-modal and multimodal applications on the cloud
Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model