extraction free download

Showing 26 open source projects for "extraction"

View related business solutions

Formats and Protocols Mac Clear Filters & Widen Search

Fully Managed MySQL, PostgreSQL, and SQL Server
Automatic backups, patching, replication, and failover. Focus on your app, not your database.

Cloud SQL handles your database ops end to end, so you can focus on your app.

Try Free
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
1

OCRBase

MD/.JSON Document OCR and structured data extraction API

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput.

Downloads: 1 This Week

Last Update: 2026-04-16
See Project
2

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 0 This Week

Last Update: 2025-10-13
See Project
3

warp

A super-easy, composable, web server framework for warp speeds

The fundamental building block of warp is the Filter, they can be combined and composed to express rich requirements on requests. A Filter in warp is essentially a function that can operate on some input, either something from a request, or something from a previous Filter, and returns some output, which could be some app-specific type you wish to pass around, or can be some reply to send back as an HTTP response. That might sound simple, but the exciting part is the combinators that exist...

Downloads: 7 This Week

Last Update: 2025-08-06
See Project
4

Nano PDF Editor

Edit PDF files with Nano Banana

Nano PDF Editor is a minimalist, portable PDF viewer and toolkit that focuses on simplicity, speed, and ease of integration for applications that need basic PDF rendering without heavy dependencies. It provides core functionality such as page navigation, zooming, text selection, and rendering directly to native graphics surfaces, making it suitable for lightweight PDF viewing scenarios on desktop or embedded platforms. Designed to be easily embedded into larger software projects, Nano-PDF...

Downloads: 17 This Week

Last Update: 2026-02-05
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
5

Unredact

A simple tool for reading in poorly redacted documents

Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and...

Downloads: 15 This Week

Last Update: 2026-02-03
See Project
6

py-pdf-parser

A Python tool to help extracting information from structured PDFs

py-pdf-parser is a Python tool designed to help extract information from structured PDFs. It provides a simple interface to define parsing rules and extract data from PDF documents.

Downloads: 0 This Week

Last Update: 2025-04-28
See Project
7

WebHarvest - web data extraction tool

Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.

14 Reviews

Downloads: 4 This Week

Last Update: 2025-10-27
See Project
8

ldif-extract

Extrect selected entries from LDIF files like grep

ldif-extract is a small 'grep' like tool to extract and convert data from LDIF files. It could be used standalone or also in a pipe together with other tools like ldapsearch.

Downloads: 0 This Week

Last Update: 2026-01-10
See Project
9

QXmlEdit

Simple XML editor and XSD viewer

QXmlEdit is a simple XML editor written in qt. Its main features are unusual data visualization modes, nice XML manipulation and presentation and it is multi platform. It can split very big XML files into fragments, compare XML and XSD files, and has a graphical XSD viewers. Project site: http://qxmledit.org Source code hosted at GitHub (moved from Google Code) https://github.com/lbellonda/qxmledit Report issues at: https://github.com/lbellonda/qxmledit/issues Discussion...

4 Reviews

Downloads: 84 This Week

Last Update: 2023-02-09
See Project
Go From AI Idea to AI App Fast
One platform to build, fine-tune, and deploy ML models. No MLOps team required.

Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.

Try Free
10

cantools

Access and convert ASC, BLF, DBC, and MDF files

cantools is a set of libraries and command line tools for handling ASC, BLF, CLG, VSB, MDF, and DBC files. The tools can be used to analyze and convert the data to other formats. Shared libraries for parsing and accessing these files are also provided.

5 Reviews

Downloads: 1 This Week

Last Update: 2019-06-05
See Project
11

iText®, a JAVA PDF library

PDF Library for Developers

iText is an open-source PDF library available for Java and .NET (C#). iText allows you to effortlessly generate and manipulate standards-compliant PDF documents with a powerful and feature-rich SDK. With iText, you can create archivable and accessible PDFs, split and merge documents, fill and flatten forms, digitally sign documents, and more. iText add-ons enable additional functionality, such as PDF creation from HTML templates, secure redaction, OCR, and much more. The latest...

Downloads: 163 This Week

Last Update: 2024-06-01
See Project
12

sMeta

Simple general-purpose metadata extraction API with support for popular multimedia metadata formats such as EXIF and ID3.

Downloads: 0 This Week

Last Update: 2015-01-05
See Project
13

Row-Bean

CSV reader writer - bean mapping - easy bean extraction from CSV file

Row-Bean is a CSV-Bean JAVA API . Row-Bean provides CSV reader an writer. More ever provides a mechanism to map csv file content to java beans and revers. For each use, a XML description must describe the wished mapping. Another possibility consists in use Annotations. Use under maven : <!-- row bean with annotations...

Downloads: 0 This Week

Last Update: 2015-09-13
See Project
14

Metadata Extraction Tool

The National Library of New Zealand's Metadata Extraction Tool automatically extracts preservation-related metadata from digital files, then output that metadata in XML formats. It can be used through a graphical user interface or command-line interface. Please take the latest code from 'https://github.com/DIA-NZ/Metadata-Extraction-Tool.git'. The code on source forge will not be updated henceforth as it is moved to github.

19 Reviews

Downloads: 6 This Week

Last Update: 2016-02-11
See Project
15

Detexter

Detexter is an app designed to extract text from PDF files.

Detexter lets you extract text from multiple PDF files. Detexter uses the PDFBox library for its text extraction.

Downloads: 0 This Week

Last Update: 2015-09-01
See Project
16

Large Text File converter

Java Based Heavy-duty utilitity to process large delimited text files

...Another strength of this tool is in its configurability, it's design allows to generate as many output files as required from one input file, and at every row of input file validation, extraction, conversion can be applied. Use case Example: legacy system is to be replaced with new advanced system with different DB schema, and the data provided as 100GB size of delimited text data which is to be inserted in 10 different tables of new system DB after validation,date format conversion, rearrangements, and MD5 hashing implementation.

Downloads: 0 This Week

Last Update: 2015-05-31
See Project
17

bint

Converts intensity text files to binary for fast subsetting

...Extracting the data for individual SNP/CNV markers or individual samples was slow grep/awk'ing the text files exported from the genotyping run (e.g. Illumina final report files). bint converts the text representation of the intensity float data to into a IEEE754 indexed binary file for rapid extraction of subsets of the data. In theory bint could be used for any large tables of float data.

Downloads: 0 This Week

Last Update: 2015-03-29
See Project
18

PDF Extraction Toolkit

This project provides a toolkit and framework based on PDFBox for document analysis of PDF files and performing custom conversion tasks and is published under the Apache licence. A GUI is also included, and is published using the GPL licence.

Downloads: 0 This Week

Last Update: 2013-04-25
See Project
19

EMET

EMET is an image metadata extraction tool intended to facilitate the management and preservation of digital images and their incorporation into external databases and applications. EMET was created by ARTstor through funding from NDIIPP.

1 Review

Downloads: 0 This Week

Last Update: 2015-10-17
See Project
20

JBiblex

Cross-platform explorer of ZIP archives with FB2 books.

Downloads: 0 This Week

Last Update: 2013-04-19
See Project
21

Cairo tool

Cairo (Complex Archive Ingest for Repository Objects) is a tool for processing digital archives prior to submitting them to archival storage for long-term preservation; among other features, this includes format identification and metadata extraction.

Downloads: 0 This Week

Last Update: 2013-04-12
See Project
22

Scan

Scan, the Semantic Content ANnotator, is a semantic pipeline that helps connecting information extraction tools to semantic database. UIMA-based, it allows easy plugin-writing: information extraction, ontology control, store in RDF Repositories.

Downloads: 0 This Week

Last Update: 2014-03-20
See Project
23

Wex

Software for web pages data extraction.

Downloads: 0 This Week

Last Update: 2013-04-24
See Project
24

jumbles

jumbles (Java Unified Metadata Basic Library for Extracting and Storing) is a library that enables the extraction and storing of multimedia metadata. Currently wraps "jaudiotagger" (MP3 ID3 tags) and "metadata extractor" (EXIF, et al.).

Downloads: 0 This Week

Last Update: 2012-07-19
See Project
25

FOXY

FOXY is a filtering web proxy. Originally designed to provide device-independent access to the World Wide Web, it may also be used for HTTP-filtering, extraction and reauthoring of existing web content or as security device against web based attacks.

1 Review

Downloads: 6 This Week

Last Update: 2013-03-12
See Project