The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2023-11-11	3.8 kB	0
v0.3.1 - Initial Release source code.tar.gz	2023-11-11	559.9 kB	0
v0.3.1 - Initial Release source code.zip	2023-11-11	568.9 kB	0
Totals: 3 Items		1.1 MB	0

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

🔗 Main site • 🐦 Twitter • 📢 Discord

Announcing Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like: - How do you map LLM responses back into web elements? - How can you mark up a page for an LLM better understand its action space? - How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Usage

Visit our cookbook for agent examples using Tarsier: - An autonomous LangChain web agent 🦜⛓️ - An autonomous LlamaIndex web agent 🦙

Otherwise, basic Tarsier usage might look like the following:

:::python
import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {}

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Supported OCR Services

[x] Google Cloud Vision
[ ] Amazon Textract (Coming Soon)
[ ] Microsoft Azure Computer Vision (Coming Soon)

Special shoutout to @KhoomeiK for making this happen! ❤️

Source: README.md, updated 2023-11-11

Tarsier Files

Vision utilities for web interaction agents

Get an email when there's a new version of Tarsier

Announcing Tarsier

How does it work?

Usage

Supported OCR Services