HunyuanVideo: A Systematic Framework For Large Video Generation Model
AI framework for automated short video creation and editing tools
Qwen3-omni is a natively end-to-end, omni-modal LLM
Code for running inference and finetuning with SAM 3 model
Code and models for ICML 2024 paper, NExT-GPT
Multimodal embedding and reranking models built on Qwen3-VL
Label Studio is a multi-type data labeling and annotation tool
Build multimodal language agents for fast prototype and production
Use Microsoft Edge's online text-to-speech service from Python
Data Infrastructure providing an approach to multimodal AI workloads
Qwen2.5-VL is the multimodal large language model series
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
A Web UI for easy subtitle using whisper model
Public opinion analysis system
Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD
Generating Immersive, Explorable, and Interactive 3D Worlds
Multimodal Diffusion with Representation Alignment
OCR expert VLM powered by Hunyuan's native multimodal architecture
A Pioneering Open-Source Alternative to GPT-4o
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Automatically translates the text of a video based on a subtitle file
AI Slack bot for reading, summarizing, and chatting with content
Build AI-powered semantic search applications
The data structure for multimodal data
Dealing with all unstructured data, such as reverse image search