A framework to enable multimodal models to operate a computer
...Notably, it was the first known project to implement a multimodal model capable of viewing and controlling a computer screen. The framework supports features like Optical Character Recognition (OCR) and Set-of-Mark (SoM) prompting to enhance visual grounding capabilities. It is designed to be compatible with macOS, Windows, and Linux (with X server installed), and is released under the MIT license.
Python SDK for the Computer Use model Lux, developed by OpenAGI
...Multiple installation flavors let you choose between a minimal oagi-core package or variants that bundle desktop automation and FastAPI/Socket.IO server capabilities.