A state-of-the-art open visual language model
Sample code and notebooks for Generative AI on Google Cloud
Marrying Grounding DINO with Segment Anything & Stable Diffusion
Examples and guides for using the Gemini API
A framework to enable multimodal models to operate a computer
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
Run Claude Code, Gemini, Codex in a clean, isolated sandbox
Refer and Ground Anything Anywhere at Any Granularity
TorchMultimodal is a PyTorch library
Foundational Models for State-of-the-Art Speech and Text Translation
Qwen2.5-VL is the multimodal large language model series
Agent S: an open agentic framework that uses computers like a human
Real-World Centric Foundation GUI Agents
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
Large Multimodal Models for Video Understanding and Editing
Chat & pretrained large vision language model
Towards Real-World Vision-Language Understanding
Harness LLMs with Multi-Agent Programming
Moonshot's most powerful AI model
A Python library for extracting structured information
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
LLM framework for document understanding and semantic retrieval
Foundation Models for Time Series
A Pragmatic VLA Foundation Model
Learn Go with test-driven development