Open-AutoGLM is an open-source framework and model designed to empower autonomous mobile intelligent assistants by enabling AI agents to understand and interact with phone screens in a multimodal manner, blending vision and language capability to control real devices. It aims to create an “AI phone agent” that can perceive on-screen content, reason about user goals, and execute sequences of taps, swipes, and text input via automated device control interfaces like ADB, enabling hands-off completion of multi-step tasks such as navigating apps, filling forms, and more. Unlike traditional automation scripts that depend on brittle heuristics, Open-AutoGLM uses pretrained large language and vision-language models to interpret visual context and natural language instructions, giving the agent robust adaptability across apps and interfaces.
Features
- Multimodal phone screen understanding (vision + language)
- Autonomous control of smartphone actions (tap, swipe, type)
- Framework for scripting and deploying mobile AI agents
- Integration with device automation layers like ADB
- Example demos for real apps to quickly prototype agents
- Open framework for research and custom workflows