Refer and Ground Anything Anywhere at Any Granularity
Self-supervised visual learning using momentum contrast in PyTorch
Automate native Android apps with AI using accessibility APIs
Weaving the Digital Agent Galaxy
Multimodal Diffusion with Representation Alignment
Lets make video diffusion practical
Generating Immersive, Explorable, and Interactive 3D Worlds
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
The library to build & auto-optimize LLM applications
Wan2.1: Open and Advanced Large-Scale Video Generative Model
All-in-one AI productivity platform with agents, workflows, and IM
Taming Stable Diffusion for Lip Sync
Agent S: an open agentic framework that uses computers like a human
Video Object and Interaction Deletion
Master the fundamentals of machine learning, deep learning
Open-source evaluation toolkit of large multi-modality models (LMMs)
The most powerful Android RPA agent framework
Official implementation of Watermark Anything with Localized Messages
Foundation model for image generation
[NeurIPS 2023] ImageReward: Learning and Evaluating Human Preferences
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA
Browse the web, directly from Cursor etc.
Phi-3.5 for Mac: Locally-run Vision and Language Models
Extension of Google Research’s PaperBanana
Multimodal Agents as Smartphone Users, an LLM-based multimodal agent