VGGSfM: Visual Geometry Grounded Deep Structure From Motion
Reference PyTorch implementation and models for DINOv3
Lets make video diffusion practical
The most powerful Android RPA agent framework
Official implementation of Watermark Anything with Localized Messages
Python inference and LoRA trainer package for the LTX-2 audio–video
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Label Studio is a multi-type data labeling and annotation tool
Contexts Optical Compression
The library to build & auto-optimize LLM applications
Modular quant framework
Data manipulation and transformation for audio signal processing
Generate audiobooks from e-books
PyTorch3D is FAIR's library of reusable components for deep learning
Qwen3-omni is a natively end-to-end, omni-modal LLM
InvokeAI is a leading creative engine for Stable Diffusion models
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
Chinese and English multimodal conversational language model
Benchmarking Multimodal Agents for Open-Ended Tasks
An open phone agent model & framework
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
OCR expert VLM powered by Hunyuan's native multimodal architecture
PaddlePaddle End-to-End Development Toolkit