The Self-Operating Computer Framework is an innovative system that enables multimodal models to autonomously operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specified objectives. This framework is compatible with various multimodal models and currently integrates with GPT-4o, o1, Gemini Pro Vision, Claude 3, and LLaVa. Notably, it was the first known project to implement a multimodal model capable of viewing and controlling a computer screen. The framework supports features like Optical Character Recognition (OCR) and Set-of-Mark (SoM) prompting to enhance visual grounding capabilities. It is designed to be compatible with macOS, Windows, and Linux (with X server installed), and is released under the MIT license.

Features

  • Autonomous Computer Control: Enables multimodal models to operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specific tasks.
  • Multimodal Model Compatibility: Supports models such as GPT-4 Vision, Gemini Pro Vision, Claude 3, and LLaVa for diverse applications.
  • Optical Character Recognition (OCR): Integrates OCR capabilities for extracting text from the computer screen for enhanced visual processing.
  • Set-of-Mark (SoM) Prompting: Utilizes SoM prompting to improve visual grounding and contextual understanding during interactions.
  • Cross-Platform Support: Works seamlessly on macOS, Windows, and Linux (with X server installed).
  • Open Source and Flexible Licensing: Released under the MIT license, encouraging community contributions and customizable use cases.

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow Self-Operating Computer

Self-Operating Computer Web Site

Other Useful Business Software
Gen AI apps are built with MongoDB Atlas Icon
Gen AI apps are built with MongoDB Atlas

The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.
Start Free
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
1
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5

User Reviews

  • Really awesome to use an AI agent and get it to operate your computer
Read more reviews >

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Intelligent Agents, Python Agentic AI Tool, Python AI Agent Frameworks, Python AI Agents

Registered

2025-01-27