GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Generating Immersive, Explorable, and Interactive 3D Worlds
Weaving the Digital Agent Galaxy
The most powerful Android RPA agent framework
Multimodal Diffusion with Representation Alignment
Automate native Android apps with AI using accessibility APIs
VGGSfM: Visual Geometry Grounded Deep Structure From Motion
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
Wan2.1: Open and Advanced Large-Scale Video Generative Model
Lets make video diffusion practical
Reference PyTorch implementation and models for DINOv3
The library to build & auto-optimize LLM applications
Foundation model for image generation
CogView4, CogView3-Plus and CogView3(ECCV 2024)
All-in-one AI productivity platform with agents, workflows, and IM
Agent S: an open agentic framework that uses computers like a human
Video Object and Interaction Deletion
Master the fundamentals of machine learning, deep learning
Open-source evaluation toolkit of large multi-modality models (LMMs)
VMZ: Model Zoo for Video Modeling
Gemma open-weight LLM library, from Google DeepMind
Generate audiobooks from e-books
[NeurIPS 2023] ImageReward: Learning and Evaluating Human Preferences
Benchmarking Multimodal Agents for Open-Ended Tasks
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA