data extraction free download

Showing 377 open source projects for "data extraction"

View related business solutions

Linux Clear Filters & Widen Search

Build Agents and Models on One Platform
Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free
$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
1

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 1 This Week

Last Update: 2025-10-13
See Project
2

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 0 This Week

Last Update: 2025-06-09
See Project
3

webclaw

Fast, local-first web content extraction for LLMs

webclaw is a high-performance web content extraction tool designed specifically for AI agents and large language models, focusing on delivering clean, structured data instead of raw HTML. It is built in Rust and operates without a headless browser, using advanced techniques such as TLS fingerprinting to bypass common scraping barriers and mimic real browser behavior. The tool addresses a major inefficiency in AI workflows by removing irrelevant elements like navigation menus, ads, and scripts, significantly reducing token usage when feeding data into language models. ...

Downloads: 1 This Week

Last Update: 2026-07-22
See Project
4

Sparrow

Structured data extraction and instruction calling with ML, LLM

...The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.

Downloads: 0 This Week

Last Update: 2026-06-05
See Project
Custom VMs From 1 to 96 vCPUs With 99.95% Uptime
General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.

Try Free
5

Scanopy

Clean network diagrams, One-time setup, zero upkeep

Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. ...

Downloads: 18 This Week

Last Update: 21 hours ago
See Project
6

FModel

Unreal Engine Archives Explorer

FModel is a freeware application designed to explore Unreal Engine games archives, allowing users to delve into the assets and structures of games developed with Unreal Engine.

Downloads: 106 This Week

Last Update: 3 hours ago
See Project
7

OCRBase

MD/.JSON Document OCR and structured data extraction API

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...

Downloads: 0 This Week

Last Update: 2026-04-16
See Project
8

Firecrawl

Turn entire websites into LLM-ready markdown or structured data

Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap is required.

Downloads: 1 This Week

Last Update: 2026-06-18
See Project
9

Unstract

No-code LLM Platform to launch APIs and ETL Pipelines

Unstract is a powerful open-source, no-code platform built to automate the extraction and structuring of unstructured documents using large language models and flexible workflows, enabling developers and data teams to turn messy files into organized JSON content without complex coding. It integrates a visual Prompt Studio environment where users can iteratively design extraction schemas, compare outputs from different models, and monitor costs and accuracy side by side, making it easier to refine prompts and extraction logic before deploying at scale. ...

Downloads: 0 This Week

Last Update: 6 days ago
See Project
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
10

AI-Crawler

Crawl a website starting from a URL, find relevant pages

AI Crawler is an experimental AI-powered web crawling and data extraction tool that uses natural language prompts to guide the discovery and retrieval of relevant information across websites. Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure.

Downloads: 0 This Week

Last Update: 2026-04-02
See Project
11

kimuraframework

AI-first Ruby framework for building fast, flexible web scraping spide

Kimurai is an open source web scraping framework written in Ruby that simplifies the process of building automated data extraction tools. It provides a clean domain-specific language that allows developers to define scraping logic and data schemas with minimal boilerplate code. Kimurai can use AI-assisted extraction to identify where data resides in HTML pages, automatically generating selectors that are cached for future use so subsequent scraping runs operate with pure Ruby performance. ...

Downloads: 2 This Week

Last Update: 2026-07-21
See Project
12

AgentQL MCP

Model Context Protocol server that integrates AgentQL's data

The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts.

Downloads: 0 This Week

Last Update: 2025-04-08
See Project
13

Extractous

Fast and efficient unstructured data extraction

Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
14

X-Crawl

Flexible Node.js AI-assisted crawler library

A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.

Downloads: 0 This Week

Last Update: 2025-04-06
See Project
15

book-to-skill

Turn any technical book PDF into a Claude Code skill

book-to-skill is a Claude Code skill that turns technical books and documents into reusable AI reference skills. It extracts content from PDFs and EPUBs, then organizes the material so an assistant can study, reference, and apply it while working. The project is useful for transforming dense manuals, textbooks, internal documentation, or technical guides into practical agent-accessible knowledge. It includes an extraction script and a SKILL.md workflow that guides how the resulting content...

Downloads: 9 This Week

Last Update: 3 days ago
See Project
16

tsfresh

Automatic extraction of relevant features from time series

tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. tsfresh is used to to extract characteristics from time series. Without tsfresh, you would have to calculate all characteristics by hand. With tsfresh this process is automated and all your features can be calculated automatically. Further tsfresh is compatible with pythons pandas and scikit-learn APIs, two important packages for Data Science endeavours in python....

Downloads: 0 This Week

Last Update: 2026-05-31
See Project
17

ContextGem

ContextGem: Effortless LLM extraction from documents

ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.

Downloads: 0 This Week

Last Update: 5 days ago
See Project
18

watercrawl

AI-ready web crawler that extracts and structures website content

...WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. WaterCrawl also offers real-time monitoring capabilities, allowing users to track crawling progress, performance metrics, and errors during large data collection jobs. Developers can integrate the tool into applications through a REST API and multiple client SDKs, enabling automated data pipelines and AI data preparation workflows.

Downloads: 0 This Week

Last Update: 2026-05-20
See Project
19

BlockArrays.jl

BlockArrays for Julia

A block array is a partition of an array into blocks or subarrays, see Wikipedia for a more extensive description. This package has two purposes. Firstly, it defines an interface for an AbstractBlockArray block arrays that can be shared among types representing different types of block arrays. The advantage to this is that it provides a consistent API for block arrays. Secondly, it also implements two different types of block arrays that follow the AbstractBlockArray interface. The type...

Downloads: 1 This Week

Last Update: 2026-07-24
See Project
20

Ferret

Declarative web scraping

A web scraping system aiming to simplify data extraction from the web. ferret has a declarative query language that makes it easy to focus on the data that you need to get. ferret has the ability to scrape JS rendered pages, handle all page events, and emulate user interactions. the ferret was designed as a library from the ground up. it can be easily embedded into any Go application. ferret helps you to focus on the data you need using an easy-to-learn declarative language. ferret uses Chrome/Chromium via Chrome Devtools Protocol to handle dynamically rendered web pages. ferret is extremely extensible, and creating custom functions and types is super easy. ferret allows users to focus on the data. ...

Downloads: 1 This Week

Last Update: 2026-03-05
See Project
21

MinerU

A high-quality tool for convert PDF to Markdown and JSON

MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.

Downloads: 11 This Week

Last Update: 2026-07-10
See Project
22

DocStrange

Extract and convert data from any document, images, pdfs, word doc

DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files,...

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
23

mtail

Extract internal monitoring data from application logs

Extract internal monitoring data from application logs for collection in a time-series database. mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding. It fills a monitoring niche by being the glue between applications that do not export their own internal state (other than via logs) and existing monitoring systems, such that system operators do not need to patch those applications to instrument them or writing custom extraction code for every such application. ...

Downloads: 3 This Week

Last Update: 2024-08-08
See Project
24

NeMo Retriever Library

Document content and metadata extraction microservice

...It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient handling of large datasets. It supports multiple extraction strategies for different document formats, balancing accuracy and throughput depending on the use case. Additionally, it can generate embeddings for extracted content and integrate with vector databases like Milvus, making it well-suited for retrieval-augmented generation pipelines.

Downloads: 0 This Week

Last Update: 2026-05-29
See Project
25

text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API

...The project focuses on converting complex files such as PDFs, images, scanned documents, and office files into structured plain text that can be processed by downstream applications or language models. Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction capabilities into a unified API that standardizes the output. The platform supports automated processing pipelines that detect file types and apply the appropriate extraction method to obtain the most accurate text representation possible. It can be integrated into document analysis systems, knowledge retrieval tools, and AI pipelines that rely on clean textual data.

Downloads: 2 This Week

Last Update: 2026-03-05
See Project