data extraction free download

Showing 463 open source projects for "data extraction"

View related business solutions

Build Agents and Models on One Platform
Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
1

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 1 This Week

Last Update: 2025-10-13
See Project
2

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 0 This Week

Last Update: 2025-06-09
See Project
3

webclaw

Fast, local-first web content extraction for LLMs

webclaw is a high-performance web content extraction tool designed specifically for AI agents and large language models, focusing on delivering clean, structured data instead of raw HTML. It is built in Rust and operates without a headless browser, using advanced techniques such as TLS fingerprinting to bypass common scraping barriers and mimic real browser behavior. The tool addresses a major inefficiency in AI workflows by removing irrelevant elements like navigation menus, ads, and scripts, significantly reducing token usage when feeding data into language models. ...

Downloads: 1 This Week

Last Update: 2026-07-22
See Project
4

Sparrow

Structured data extraction and instruction calling with ML, LLM

...The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.

Downloads: 0 This Week

Last Update: 2026-06-05
See Project
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime
General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.

Try Free
5

Scanopy

Clean network diagrams, One-time setup, zero upkeep

Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. ...

Downloads: 18 This Week

Last Update: 20 hours ago
See Project
6

FModel

Unreal Engine Archives Explorer

FModel is a freeware application designed to explore Unreal Engine games archives, allowing users to delve into the assets and structures of games developed with Unreal Engine.

Downloads: 106 This Week

Last Update: 2 hours ago
See Project
7

OCRBase

MD/.JSON Document OCR and structured data extraction API

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...

Downloads: 0 This Week

Last Update: 2026-04-16
See Project
8

Firecrawl

Turn entire websites into LLM-ready markdown or structured data

Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap is required.

Downloads: 1 This Week

Last Update: 2026-06-18
See Project
9

Unstract

No-code LLM Platform to launch APIs and ETL Pipelines

Unstract is a powerful open-source, no-code platform built to automate the extraction and structuring of unstructured documents using large language models and flexible workflows, enabling developers and data teams to turn messy files into organized JSON content without complex coding. It integrates a visual Prompt Studio environment where users can iteratively design extraction schemas, compare outputs from different models, and monitor costs and accuracy side by side, making it easier to refine prompts and extraction logic before deploying at scale. ...

Downloads: 0 This Week

Last Update: 6 days ago
See Project
$300 Free Credits to Build on Google Cloud
New to Google Cloud? Get $300 in credits to explore Compute Engine, BigQuery, Cloud Run, Gemini Enterprise Agent Platform, and more.

Start your next project with $300 in free Google Cloud credit. Spin up VMs, run containers, query petabytes in BigQuery, or build agents with Gemini Enterprise Agent Platform. Once your credits are used, keep building with 20+ always-free tier products including Compute Engine, Cloud Storage, GKE, and Cloud Run functions. No commitment required—just sign up and start building.

Claim $300 Free
10

AI-Crawler

Crawl a website starting from a URL, find relevant pages

AI Crawler is an experimental AI-powered web crawling and data extraction tool that uses natural language prompts to guide the discovery and retrieval of relevant information across websites. Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure.

Downloads: 0 This Week

Last Update: 2026-04-02
See Project
11

kimuraframework

AI-first Ruby framework for building fast, flexible web scraping spide

Kimurai is an open source web scraping framework written in Ruby that simplifies the process of building automated data extraction tools. It provides a clean domain-specific language that allows developers to define scraping logic and data schemas with minimal boilerplate code. Kimurai can use AI-assisted extraction to identify where data resides in HTML pages, automatically generating selectors that are cached for future use so subsequent scraping runs operate with pure Ruby performance. ...

Downloads: 2 This Week

Last Update: 2026-07-21
See Project
12

OpenDataLoader PDF

PDF Parser for AI-ready data. Automate PDF accessibility

OpenDataLoader PDF is an open-source document processing system designed to convert complex PDF files into structured, AI-ready formats such as Markdown, JSON, and HTML while preserving layout, hierarchy, and semantic meaning. It focuses on enabling downstream use cases like retrieval-augmented generation (RAG), knowledge extraction, and document intelligence pipelines by maintaining accurate reading order and spatial metadata through bounding boxes. The tool combines deterministic parsing...

Downloads: 4 This Week

Last Update: 2026-07-14
See Project
13

AgentQL MCP

Model Context Protocol server that integrates AgentQL's data

The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts.

Downloads: 0 This Week

Last Update: 2025-04-08
See Project
14

Extractous

Fast and efficient unstructured data extraction

Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
15

X-Crawl

Flexible Node.js AI-assisted crawler library

A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.

Downloads: 0 This Week

Last Update: 2025-04-06
See Project
16

book-to-skill

Turn any technical book PDF into a Claude Code skill

book-to-skill is a Claude Code skill that turns technical books and documents into reusable AI reference skills. It extracts content from PDFs and EPUBs, then organizes the material so an assistant can study, reference, and apply it while working. The project is useful for transforming dense manuals, textbooks, internal documentation, or technical guides into practical agent-accessible knowledge. It includes an extraction script and a SKILL.md workflow that guides how the resulting content...

Downloads: 9 This Week

Last Update: 3 days ago
See Project
17

tsfresh

Automatic extraction of relevant features from time series

tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. tsfresh is used to to extract characteristics from time series. Without tsfresh, you would have to calculate all characteristics by hand. With tsfresh this process is automated and all your features can be calculated automatically. Further tsfresh is compatible with pythons pandas and scikit-learn APIs, two important packages for Data Science endeavours in python....

Downloads: 0 This Week

Last Update: 2026-05-31
See Project
18

ContextGem

ContextGem: Effortless LLM extraction from documents

ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.

Downloads: 0 This Week

Last Update: 5 days ago
See Project
19

watercrawl

AI-ready web crawler that extracts and structures website content

...WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. WaterCrawl also offers real-time monitoring capabilities, allowing users to track crawling progress, performance metrics, and errors during large data collection jobs. Developers can integrate the tool into applications through a REST API and multiple client SDKs, enabling automated data pipelines and AI data preparation workflows.

Downloads: 0 This Week

Last Update: 2026-05-20
See Project
20

BlockArrays.jl

BlockArrays for Julia

A block array is a partition of an array into blocks or subarrays, see Wikipedia for a more extensive description. This package has two purposes. Firstly, it defines an interface for an AbstractBlockArray block arrays that can be shared among types representing different types of block arrays. The advantage to this is that it provides a consistent API for block arrays. Secondly, it also implements two different types of block arrays that follow the AbstractBlockArray interface. The type...

Downloads: 1 This Week

Last Update: 2026-07-24
See Project
21

Ferret

Declarative web scraping

A web scraping system aiming to simplify data extraction from the web. ferret has a declarative query language that makes it easy to focus on the data that you need to get. ferret has the ability to scrape JS rendered pages, handle all page events, and emulate user interactions. the ferret was designed as a library from the ground up. it can be easily embedded into any Go application. ferret helps you to focus on the data you need using an easy-to-learn declarative language. ferret uses Chrome/Chromium via Chrome Devtools Protocol to handle dynamically rendered web pages. ferret is extremely extensible, and creating custom functions and types is super easy. ferret allows users to focus on the data. ...

Downloads: 1 This Week

Last Update: 2026-03-05
See Project
22

MinerU

A high-quality tool for convert PDF to Markdown and JSON

MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.

Downloads: 11 This Week

Last Update: 2026-07-10
See Project
23

DocStrange

Extract and convert data from any document, images, pdfs, word doc

DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files,...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
24

Web RPA

Web Robotics Process Automation Tool

Web RPA is a browser automation framework designed to perform robotic process automation tasks directly within web environments. It enables users to automate repetitive actions such as form filling, data extraction, and workflow execution through programmable scripts. The system focuses on simplicity and flexibility, allowing automation without requiring complex infrastructure. It supports interaction with web elements, navigation flows, and dynamic content handling, making it suitable for scraping and automation scenarios. WebRPA can be integrated into larger systems or used as a standalone tool for automating browser-based operations. ...

Downloads: 6 This Week

Last Update: 4 days ago
See Project
25

newpipeextractor

Library for extracting streaming site data without official APIs

...It handles many low-level tasks involved in web data extraction, including parsing responses, managing platform-specific logic, and handling errors, allowing developers to focus on implementing application features rather than scraping mechanics. Each supported service is implemented through its own extractor components that conform to a common interface, enabling consistent access to data across different platforms.

Downloads: 0 This Week

Last Update: 2026-07-21
See Project