The Self-Operating Computer Framework is an innovative system that enables multimodal models to autonomously operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specified objectives. This framework is compatible with various multimodal models and currently integrates with GPT-4o, o1, Gemini Pro Vision, Claude 3, and LLaVa. Notably, it was the first known project to implement a multimodal model capable of viewing and controlling a computer screen. The framework supports features like Optical Character Recognition (OCR) and Set-of-Mark (SoM) prompting to enhance visual grounding capabilities. It is designed to be compatible with macOS, Windows, and Linux (with X server installed), and is released under the MIT license.
Features
- Autonomous Computer Control: Enables multimodal models to operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specific tasks.
- Multimodal Model Compatibility: Supports models such as GPT-4 Vision, Gemini Pro Vision, Claude 3, and LLaVa for diverse applications.
- Optical Character Recognition (OCR): Integrates OCR capabilities for extracting text from the computer screen for enhanced visual processing.
- Set-of-Mark (SoM) Prompting: Utilizes SoM prompting to improve visual grounding and contextual understanding during interactions.
- Cross-Platform Support: Works seamlessly on macOS, Windows, and Linux (with X server installed).
- Open Source and Flexible Licensing: Released under the MIT license, encouraging community contributions and customizable use cases.
License
MIT LicenseFollow Self-Operating Computer
User Reviews
-
Really awesome to use an AI agent and get it to operate your computer