State-of-the-art (SoTA) text-to-video pre-trained model
Capable of understanding text, audio, vision, video
Qwen3-omni is a natively end-to-end, omni-modal LLM
Generate short videos with one click using AI LLM
RGBD video generation model conditioned on camera input
Document Image Parsing via Heterogeneous Anchor Prompting”
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Python inference and LoRA trainer package for the LTX-2 audio–video
Tokenizer-Free TTS for Multilingual Speech Generation
A python tool that uses GPT-4, FFmpeg, and OpenCV
GPT4V-level open-source multi-modal model based on Llama3-8B
Taming Stable Diffusion for Lip Sync
Large Audio Language Model built for natural interactions
Multimodal-Driven Architecture for Customized Video Generation
Topic Modelling for Humans
A Multi-Modal World Model for Reconstructing, Generating, Simulation
Build Vision Agents quickly with any model or video provider
ComfyUI wrapper nodes for HunyuanVideo
Code and models for ICML 2024 paper, NExT-GPT
Python Stream Processing
NVR with realtime local object detection for IP cameras
A Web UI for easy subtitle using whisper model
Recovering the Visual Space from Any Views
Free, high-quality text-to-speech API endpoint to replace OpenAI
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning