The Self-Operating Computer Framework is an innovative system that enables multimodal models to autonomously operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specified objectives. This framework is compatible with various multimodal models and currently integrates with GPT-4o, o1, Gemini Pro Vision, Claude 3, and LLaVa. Notably, it was the first known project to implement a multimodal model capable of viewing and controlling a computer screen. The framework supports features like Optical Character Recognition (OCR) and Set-of-Mark (SoM) prompting to enhance visual grounding capabilities. It is designed to be compatible with macOS, Windows, and Linux (with X server installed), and is released under the MIT license.

Features

  • Autonomous Computer Control: Enables multimodal models to operate a computer by interpreting the screen and executing mouse and keyboard actions to achieve specific tasks.
  • Multimodal Model Compatibility: Supports models such as GPT-4 Vision, Gemini Pro Vision, Claude 3, and LLaVa for diverse applications.
  • Optical Character Recognition (OCR): Integrates OCR capabilities for extracting text from the computer screen for enhanced visual processing.
  • Set-of-Mark (SoM) Prompting: Utilizes SoM prompting to improve visual grounding and contextual understanding during interactions.
  • Cross-Platform Support: Works seamlessly on macOS, Windows, and Linux (with X server installed).
  • Open Source and Flexible Licensing: Released under the MIT license, encouraging community contributions and customizable use cases.

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow Self-Operating Computer

Self-Operating Computer Web Site

Other Useful Business Software
$300 Free Credits for Your Google Cloud Projects Icon
$300 Free Credits for Your Google Cloud Projects

Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
Start Free Trial
Rate This Project
Login To Rate This Project

User Ratings

★★★★★
★★★★
★★★
★★
1
0
0
0
0
ease 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
features 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
design 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5
support 1 of 5 2 of 5 3 of 5 4 of 5 5 of 5 5 / 5

User Reviews

  • Really awesome to use an AI agent and get it to operate your computer
Read more reviews >

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Intelligent Agents, Python Agentic AI Tool, Python AI Agent Frameworks, Python AI Agents

Registered

2025-01-27