data extraction free download

Showing 19 open source projects for "data extraction"

View related business solutions

TypeScript Clear Filters & Widen Search

Stop Storing Third-Party Tokens in Your Database
Auth0 Token Vault handles secure token storage, exchange, and refresh for external providers so you don't have to build it yourself.

Rolling your own OAuth token storage can be a security liability. Token Vault securely stores access and refresh tokens from federated providers and handles exchange and renewal automatically. Connected accounts, refresh exchange, and privileged worker flows included.

Try Auth0 for Free
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
1

Firecrawl

Turn entire websites into LLM-ready markdown or structured data

Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap is required.

Downloads: 5 This Week

Last Update: 2026-02-02
See Project
2

OCRBase

MD/.JSON Document OCR and structured data extraction API

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...

Downloads: 0 This Week

Last Update: 2026-02-27
See Project
3

X-Crawl

Flexible Node.js AI-assisted crawler library

A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.

Downloads: 4 This Week

Last Update: 2025-04-06
See Project
4

watercrawl

AI-ready web crawler that extracts and structures website content

...WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. WaterCrawl also offers real-time monitoring capabilities, allowing users to track crawling progress, performance metrics, and errors during large data collection jobs. Developers can integrate the tool into applications through a REST API and multiple client SDKs, enabling automated data pipelines and AI data preparation workflows.

Downloads: 4 This Week

Last Update: 4 days ago
See Project
Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
5

DocStrange

Extract and convert data from any document, images, pdfs, word doc

DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files,...

Downloads: 3 This Week

Last Update: 6 days ago
See Project
6

Documind

Open-source platform for extracting structured data from documents

Documind is an advanced document processing tool that leverages AI to extract structured data from PDFs. It is built to handle PDF conversions, extract relevant information, and format results as specified by customizable schemas.

Downloads: 4 This Week

Last Update: 2025-02-21
See Project
7

KaraKeep

A self-hostable bookmark-everything app

...Automatic fetching of link titles, descriptions, and images streamlines saving content without manual edits, while rule-based management lets users define customized workflows. With support for image OCR and structured data extraction, Karakeep functions as a flexible personal knowledge base for researchers, content creators, and heavy bookmarkers.

Downloads: 2 This Week

Last Update: 2026-02-22
See Project
8

nw_wrld

nw_wrld is an event-driven sequencer for triggering visuals

nw_wrld is a procedurally generated world-building engine tailored for game developers and interactive storytellers who want to craft rich, random yet coherent environments without hand-crafting every detail. It uses noise functions and modular terrain algorithms to generate expansive maps, diverse biomes, and layered features like rivers, mountain ranges, forests, and resource nodes. The system is designed to be extensible, letting developers plug in new generation rules or tweak parameters...

Downloads: 3 This Week

Last Update: 2026-02-17
See Project
9

GenAIScript

Automatable GenAI Scripting

JavaScript-ish environment with convenient tooling for file ingestion, prompt development, and structured data extraction. A Microsoft tool that generates AI-powered text based on prompts, useful for content creation and automation.

Downloads: 0 This Week

Last Update: 2025-09-26
See Project
AI-generated apps that pass security review
Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.

Try Retool free
10

HeadlessX

The undetected self-hosted browser automation platform

...The tool can perform tasks such as HTML extraction, screenshot generation, content parsing, and search result scraping while appearing like a normal user browser. Because it is self-hosted, organizations can run the platform on their own infrastructure to maintain privacy and control over automation workflows.

Downloads: 1 This Week

Last Update: 2 days ago
See Project
11

HyperAgent

AI Browser Automation

...Built on top of Playwright, the framework allows developers to automate complex browser interactions using natural language commands rather than fragile selectors or hard-coded scripts. Instead of manually writing logic for clicking elements, extracting data, or navigating web pages, developers can instruct the agent in plain language and allow the AI layer to interpret and execute the task. This approach reduces the brittleness commonly associated with traditional automation scripts that break when the DOM structure changes. HyperAgent includes APIs such as page.ai() and page.extract() that allow structured data extraction and dynamic task execution through AI reasoning.

Downloads: 4 This Week

Last Update: 6 days ago
See Project
12

Actors MCP Server

Model Context Protocol (MCP) Server for Apify's Actors

The Apify Actors MCP Server is a Model Context Protocol (MCP) server that enables AI assistants to interact with Apify Actors. This integration allows AI models to utilize various web scraping and automation tools provided by Apify, facilitating tasks such as data extraction and web automation.

Downloads: 2 This Week

Last Update: 2 days ago
See Project
13

GPT Crawler

Crawl a site to generate knowledge files to create your own custom GPT

GPT Crawler is an open-source tool designed to automatically crawl websites and generate structured knowledge that can be used to build AI assistants and retrieval systems. It focuses on extracting high-quality textual content from web pages and preparing it in formats suitable for embedding, indexing, or fine-tuning workflows. The project is especially useful for teams that want to turn documentation sites or knowledge bases into conversational AI backends without building custom scrapers...

Downloads: 5 This Week

Last Update: 2026-03-02
See Project
14

nhentai

A library for interacting with the nhentai API

nhentai is a JavaScript and TypeScript library designed to interact with the nhentai API and retrieve doujinshi metadata and content information. It enables developers to programmatically access galleries, titles, tags, covers, and page URLs from the nhentai platform. The library supports both CommonJS and ES6 module imports, making it easy to integrate into different Node.js projects. Developers can use it to fetch specific doujin entries, explore associated metadata, and process gallery...

Downloads: 0 This Week

Last Update: 19 hours ago
See Project
15

BrowserNode

Make websites accessible for AI agents. Automate tasks online

Browsernode is an open-source TypeScript framework that allows AI agents to interact directly with web browsers in order to automate tasks and gather information from websites. The project acts as a bridge between AI models and browser automation tools, enabling language models to control web pages programmatically. Built as an implementation compatible with the Browser-use ecosystem, Browsernode allows agents to perform actions such as navigating pages, extracting information, filling...

Downloads: 1 This Week

Last Update: 6 days ago
See Project
16

Memori

SQL-native memory layer enabling persistent context for AI agents

Memori is an open source SQL-native memory engine designed to add persistent memory capabilities to AI applications, large language models, and multi-agent systems. It provides a memory layer that automatically captures conversations and interactions between users and AI models, allowing systems to retain knowledge across sessions instead of operating statelessly. It extracts structured information such as facts, preferences, rules, and summaries from interactions and stores them in standard...

Downloads: 1 This Week

Last Update: 3 days ago
See Project
17

Reader LLM

Convert any URL to an LLM-friendly input with a simple prefix

Reader LLM is an open-source tool designed to convert web content into formats that are easier for large language models to process. The system works by transforming a webpage into a clean text or Markdown representation that removes unnecessary formatting and highlights the core information within the page. Developers can use a simple URL prefix to retrieve a version of a webpage that has been optimized for machine consumption, making it suitable for use in AI agents or retrieval-augmented...

Downloads: 1 This Week

Last Update: 2026-03-04
See Project
18

Ayakashi

The next generation web scraping framework

...Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page's structure is. Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction.

Downloads: 0 This Week

Last Update: 2023-06-29
See Project
19

Scylla

Intelligent proxy pool for collecting and managing public proxies

Scylla is an open source proxy pool system designed to collect, validate, and manage large numbers of public proxy servers for use in web scraping and data extraction workflows. It automatically crawls the internet to discover proxy IP addresses and evaluates their availability and reliability before adding them to a usable pool. It includes a JSON API that allows developers and applications to retrieve proxy information programmatically, making it easier to integrate proxy rotation into scraping tools or automation scripts. ...

Downloads: 6 This Week

Last Update: 5 days ago
See Project