Midscene.js is an open source AI-driven UI automation framework designed to control user interfaces across multiple platforms using natural language instructions. Instead of relying on traditional selectors, DOM structures, or accessibility attributes, it uses a vision-first approach where screenshots are analyzed by visual-language models to identify interface elements and perform actions. It allows developers to automate interactions on web applications, desktop software, and mobile devices without needing platform-specific automation logic. Developers can describe tasks such as clicking buttons, filling forms, or extracting information, and the system interprets these commands to interact with the interface accordingly. Midscene.js includes SDKs, scripting options, and integration capabilities that allow automation workflows to be written in JavaScript, TypeScript, or YAML-based scripts. Midscene also provides debugging and development tools.
Features
- Vision-based UI element detection using screenshots instead of DOM selectors
- Natural language automation that interprets user instructions into UI actions
- Cross-platform support for web browsers, desktop applications, and mobile devices
- JavaScript and YAML scripting support for building automation workflows
- Built-in debugging tools including visual replay reports and playground tools
- Integration with automation tools and frameworks for browser and device control