VGGSfM: Visual Geometry Grounded Deep Structure From Motion
Reference PyTorch implementation and models for DINOv3
The most powerful Android RPA agent framework
Official implementation of Watermark Anything with Localized Messages
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Python inference and LoRA trainer package for the LTX-2 audio–video
Lets make video diffusion practical
Qwen3-omni is a natively end-to-end, omni-modal LLM
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Label Studio is a multi-type data labeling and annotation tool
Contexts Optical Compression
The library to build & auto-optimize LLM applications
Modular quant framework
Data manipulation and transformation for audio signal processing
An open phone agent model & framework
PyTorch3D is FAIR's library of reusable components for deep learning
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
Generate audiobooks from e-books
InvokeAI is a leading creative engine for Stable Diffusion models
Chinese and English multimodal conversational language model
Benchmarking Multimodal Agents for Open-Ended Tasks
OCR expert VLM powered by Hunyuan's native multimodal architecture
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Browse the web, directly from Cursor etc.