Tencent Hunyuan Multimodal diffusion transformer (MM-DiT) model
Lets make video diffusion practical
GPT4V-level open-source multi-modal model based on Llama3-8B
An unsupervised and free tool for image and video dataset analysis
Implementation of a U-net complete with efficient attention
Label Studio is a multi-type data labeling and annotation tool
Recovering the Visual Space from Any Views
Generating Immersive, Explorable, and Interactive 3D Worlds
InvokeAI is a leading creative engine for Stable Diffusion models
A general fine-tuning kit geared toward image/video/audio diffusion
The most powerful and modular diffusion model GUI, api and backend
Sharp Monocular Metric Depth in Less Than a Second
A Multi-Modal World Model for Reconstructing, Generating, Simulation
Dealing with all unstructured data, such as reverse image search
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA
Qwen3-omni is a natively end-to-end, omni-modal LLM
We write your reusable computer vision tools
A Pioneering Open-Source Alternative to GPT-4o
21 Lessons, Get Started Building with Generative AI
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
A Telegram bot that integrates with OpenAI's official ChatGPT APIs
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Convert AI papers to GUI
Code for running inference and finetuning with SAM 3 model
The data structure for multimodal data