Visual intelligence for your home.
Benchmark LLMs by fighting in Street Fighter 3
"VideoRAG: Chat with Your Videos
Uses Qwen3-ASR, local LLM, Whisper, TEN-VAD
Generate short videos with one click using AI LLM
Generate blog articles from video or audio
text and image to video generation: CogVideoX (2024) and CogVideo
All-in-one WebUI for AI generative image and video creation
Search all of YouTube from the command line
Capable of understanding text, audio, vision, video
GPT4V-level open-source multi-modal model based on Llama3-8B
Qwen3-omni is a natively end-to-end, omni-modal LLM
Lightweight Python library for adding real-time multi-object tracking
From nobody to big model (LLM) hero
Code and models for ICML 2024 paper, NExT-GPT
Data Infrastructure providing an approach to multimodal AI workloads
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
Build multimodal language agents for fast prototype and production
Adversarial Robustness Toolbox (ART) - Python Library for ML security
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA
Qwen2.5-VL is the multimodal large language model series
Multi-Modal Neural Networks for Semantic Search, based on Mid-Fusion
A lightweight vision library for performing large object detection
The Cradle framework is a first attempt at General Computer Control
A Pioneering Open-Source Alternative to GPT-4o