LLM Scraper is a TypeScript library designed to extract structured data from webpages using large language models. Instead of relying on fragile HTML selectors or manual parsing rules, the tool interprets webpage content with language models and converts it into structured data according to a defined schema. Developers can specify the data structure using tools such as Zod or JSON Schema, enabling the model to extract relevant information directly into typed objects. LLM Scraper integrates browser automation through Playwright, allowing it to load webpages and process their content before sending it to a language model for interpretation. Multiple content processing modes are supported, including raw HTML, cleaned HTML, Markdown, extracted text, screenshots, and custom inputs, making it adaptable to a wide range of scraping scenarios. LLM Scraper also provides streaming output and code generation capabilities that help developers build reusable scraping workflows.
Features
- Extracts structured data from webpages using large language models
- Supports multiple LLM providers including GPT, Gemini, Llama, Qwen, and Sonnet
- Schema-based data extraction using Zod or JSON Schema
- Built on Playwright for automated webpage loading and interaction
- Streaming mode for receiving partial structured outputs in real time
- Code generation for creating reusable Playwright scraping scripts