UI-TARS is an open-source multimodal “GUI agent” created by ByteDance: a model designed to perceive raw screenshots (or rendered UI frames), reason about what needs to be done, and then perform real interactions with graphical user interfaces (GUIs) — like clicking, typing, navigating menus — across desktop, browser, mobile, or game environments. Rather than relying on rigid, manually scripted UI automation, UI-TARS uses a unified vision-language model (VLM) that integrates perception, reasoning, grounding, and action into one end-to-end framework: it “thinks before acting,” enabling flexible, general-purpose automation. This allows it to perform complex, multi-step tasks such as filling forms, downloading files, navigating applications, and even controlling in-game actions — all by understanding the UI as a human would. The project is open-source, supports deployment locally or remotely, and offers a foundation for building GUI automation agents that are more robust, and adaptable.
Features
- Vision-language model-based GUI agent: perceives raw screenshots and reasons about UI context
- Unified action space: supports clicks, typing, gestures, hotkeys across desktop, browser, mobile, and games
- “Think-then-act” decision-making: performs internal reasoning (task decomposition, planning, reflection) before executing actions
- Cross-platform GUI control: works across different operating systems, browsers, and application contexts
- End-to-end automation: capable of carrying out full workflows (forms, downloads, navigation, game controls) without custom scripts per UI
- Open-source with published inference scripts and models — enabling reproducibility and customization