OmniParser

OmniParser is a comprehensive method for parsing user interface screenshots into structured elements, significantly enhancing the ability of multimodal models like GPT-4 to generate actions accurately grounded in corresponding regions of the interface. It reliably identifies interactable icons within user interfaces and understands the semantics of various elements in a screenshot, associating intended actions with the correct screen regions. To achieve this, OmniParser curates an interactable icon detection dataset containing 67,000 unique screenshot images labeled with bounding boxes of interactable icons derived from DOM trees. Additionally, a collection of 7,000 icon-description pairs is used to fine-tune a caption model that extracts the functional semantics of detected elements. Evaluations on benchmarks such as SeeClick, Mind2Web, and AITW demonstrate that OmniParser outperforms GPT-4V baselines, even when using only screenshot inputs without additional information.

Features

Parse user interface screenshots into structured and easy-to-understand elements
Examples available
Enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface
Ensure you have the V2 weights downloaded in weights folder
Model Weights License

Project Samples

Project Activity

See All Activity >

License

Creative Commons Attribution License

Follow OmniParser

OmniParser Web Site

Other Useful Business Software

Our Free Plans just got better! | Auth0

With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now

Rate This Project

User Reviews

Be the first to post a review of OmniParser!

Additional Project Details

Operating Systems

Windows

Programming Language

Python

Related Categories

Python Agentic AI Tool, Python AI Agent Frameworks, Python AI Agents

Registered

2025-02-18

Similar Business Software

OmniParser

OmniParser is a comprehensive method for parsing user interface screenshots into structured elements, significantly enhancing the ability of multimodal models like GPT-4 to generate actions accurately grounded in corresponding regions of the interface. It reliably identifies interactable icons...

See Software
UI-TARS

UI-TARS is an advanced vision-language model designed for seamless interaction with graphical user interfaces (GUIs) by integrating perception, reasoning, grounding, and memory into a unified system. It processes multimodal inputs, such as text and images, to understand interfaces and execute...

See Software
Vertex AI

Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery...

See Software

Report inappropriate content

OmniParser

A simple screen parsing tool towards pure vision based GUI agent

Get an email when there's a new version of OmniParser

Features

Project Samples

Project Activity

Categories

License

Follow OmniParser

User Reviews

Additional Project Details

Operating Systems

Programming Language

Related Categories

Registered