Showing 41 open source projects for "extraction"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 1
    pdfly

    pdfly

    CLI tool to extract (meta)data from PDF and manipulate PDF files

    A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 2
    OCRBase

    OCRBase

    MD/.JSON Document OCR and structured data extraction API

    OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    OpenDataLoader PDF

    OpenDataLoader PDF

    PDF Parser for AI-ready data. Automate PDF accessibility

    OpenDataLoader PDF is an open-source document processing system designed to convert complex PDF files into structured, AI-ready formats such as Markdown, JSON, and HTML while preserving layout, hierarchy, and semantic meaning. It focuses on enabling downstream use cases like retrieval-augmented generation (RAG), knowledge extraction, and document intelligence pipelines by maintaining accurate reading order and spatial metadata through bounding boxes. The tool combines deterministic parsing methods with an optional hybrid AI-powered mode that improves extraction quality for difficult layouts such as multi-column documents, scanned files, and scientific papers. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 4
    py-pdf-parser

    py-pdf-parser

    A Python tool to help extracting information from structured PDFs

    py-pdf-parser is a Python tool designed to help extract information from structured PDFs. It provides a simple interface to define parsing rules and extract data from PDF documents. ​
    Downloads: 8 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 5
    Unredact

    Unredact

    A simple tool for reading in poorly redacted documents

    Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 6
    Nano PDF Editor

    Nano PDF Editor

    Edit PDF files with Nano Banana

    Nano PDF Editor is a minimalist, portable PDF viewer and toolkit that focuses on simplicity, speed, and ease of integration for applications that need basic PDF rendering without heavy dependencies. It provides core functionality such as page navigation, zooming, text selection, and rendering directly to native graphics surfaces, making it suitable for lightweight PDF viewing scenarios on desktop or embedded platforms. Designed to be easily embedded into larger software projects, Nano-PDF...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 7
    wal2json

    wal2json

    JSON output plugin for changeset extraction

    wal2json is an output plugin for logical decoding. It means that the plugin have access to tuples produced by INSERT and UPDATE. Also, UPDATE/DELETE old row versions can be accessed depending on the configured replica identity. Changes can be consumed using the streaming protocol (logical replication slots) or by a special SQL API. format version 1 produces a JSON object per transaction. All of the new/old tuples are available in the JSON object. Also, there are options to include properties...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    WebHarvest - web data extraction tool
    Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 9
    TextExtractor

    TextExtractor

    Extracts plain text from a variety of different file types

    TextExtractor extracts plain text from hundreds of different file types, storing the text extracted in suitably named text files. TextExtractor 1.10 works in six different modes :- Instant Mode - Just select any file and extract the text from it. Batch Mode - Select a group of files and extract the text from all of them in one go. Polling Mode - Watch a folder location, processing new files as they appear there. Hierarchical Mode - Extract Text from files in a directory...
    Downloads: 8 This Week
    Last Update:
    See Project
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 10

    ldif-extract

    Extrect selected entries from LDIF files like grep

    ldif-extract is a small 'grep' like tool to extract and convert data from LDIF files. It could be used standalone or also in a pipe together with other tools like ldapsearch.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    QXmlEdit

    QXmlEdit

    Simple XML editor and XSD viewer

    QXmlEdit is a simple XML editor written in qt. Its main features are unusual data visualization modes, nice XML manipulation and presentation and it is multi platform. It can split very big XML files into fragments, compare XML and XSD files, and has a graphical XSD viewers. Project site: http://qxmledit.org Source code hosted at GitHub (moved from Google Code) https://github.com/lbellonda/qxmledit Report issues at: https://github.com/lbellonda/qxmledit/issues Discussion...
    Leader badge
    Downloads: 118 This Week
    Last Update:
    See Project
  • 12
    MBZ Moodle Restore

    MBZ Moodle Restore

    Restore name of original files from MZB Moodle Backup file

    Restore name of original files from MBZ Moodle Backup file. MBZ Restore is an application that performs the extraction and restoration of the original name of the Moodle backup file. Newer versions of Moodle backup generate file names that do not easily identify with the original files. Files can be renamed by hand but you have to find out what their original name is. This application does this for you.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 13

    PDFtk Bookmarks Editor

    GUI for updating PDF bookmarks using PDF Toolkit (PDFtk) on Windows

    Free and open source GUI application for updating bookmarks in a PDF document using the PDF Toolkit command line tool, PDFtk Server. User selects the PDF via drag and drop and then edits the bookmark entries in a text file using a simple, 1-line data format. Program handles everything else in response to a few user button clicks. OS: Windows. Author: David King. License: GPLv3.
    Downloads: 26 This Week
    Last Update:
    See Project
  • 14

    cantools

    Access and convert ASC, BLF, DBC, and MDF files

    cantools is a set of libraries and command line tools for handling ASC, BLF, CLG, VSB, MDF, and DBC files. The tools can be used to analyze and convert the data to other formats. Shared libraries for parsing and accessing these files are also provided.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 15
    iText®, a JAVA PDF library

    iText®, a JAVA PDF library

    PDF Library for Developers

    iText is an open-source PDF library available for Java and .NET (C#). iText allows you to effortlessly generate and manipulate standards-compliant PDF documents with a powerful and feature-rich SDK. With iText, you can create archivable and accessible PDFs, split and merge documents, fill and flatten forms, digitally sign documents, and more. iText add-ons enable additional functionality, such as PDF creation from HTML templates, secure redaction, OCR, and much more. The latest...
    Leader badge
    Downloads: 131 This Week
    Last Update:
    See Project
  • 16
    I present to you BSA extractor. I know a lot of them, but I wanted to write his own, to study the structure of *.BSA Supported games: - TES3: Morrowind; - TES4: Oblivion; - TES5: Skyrim; - Fallout 3; - Fallout 3: New Vegas; The program is simple, and should not cause problems of its use.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 17
    unfit

    unfit

    Extract useful information from FIT files.

    Extract as CSV, heart-rate, speed, distance etc from FIT files.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    OpenSearchServer Extractor

    OpenSearchServer Extractor

    A RESTFul/JSON Web Service for text and metata extraction

    An open source RESTFul Web Service for text , meta-data extraction and analysis. oss-text-extractor supports various binary formats: Word processor (doc, docx, odt, rtf) Spreadsheet (xls, xlsx, ods) Presentation (ppt, pptx, odp) Publishing (pdf, pub) Web (rss, html/xhtml) Medias (audio, images) Others (vsd, text)
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Simple general-purpose metadata extraction API with support for popular multimedia metadata formats such as EXIF and ID3.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    Row-Bean

    Row-Bean

    CSV reader writer - bean mapping - easy bean extraction from CSV file

    Row-Bean is a CSV-Bean JAVA API . Row-Bean provides CSV reader an writer. More ever provides a mechanism to map csv file content to java beans and revers. For each use, a XML description must describe the wished mapping. Another possibility consists in use Annotations. Use under maven : <!-- row bean with annotations...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    The National Library of New Zealand's Metadata Extraction Tool automatically extracts preservation-related metadata from digital files, then output that metadata in XML formats. It can be used through a graphical user interface or command-line interface. Please take the latest code from 'https://github.com/DIA-NZ/Metadata-Extraction-Tool.git'. The code on source forge will not be updated henceforth as it is moved to github.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 22
    pdf2xml convertor based on Xpdf library (http://www.foolabs.com/xpdf/home.html). It converts information contained in a PDF file into XML. First, you need to install xpdf and libxml2 (see documentation). Hervé Déjean Xerox Research Centre Europe http://www.xrce.xerox.com/About-XRCE/People/Herve-Dejean
    Downloads: 10 This Week
    Last Update:
    See Project
  • 23

    Detexter

    Detexter is an app designed to extract text from PDF files.

    Detexter lets you extract text from multiple PDF files. Detexter uses the PDFBox library for its text extraction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24

    Large Text File converter

    Java Based Heavy-duty utilitity to process large delimited text files

    ...Another strength of this tool is in its configurability, it's design allows to generate as many output files as required from one input file, and at every row of input file validation, extraction, conversion can be applied. Use case Example: legacy system is to be replaced with new advanced system with different DB schema, and the data provided as 100GB size of delimited text data which is to be inserted in 10 different tables of new system DB after validation,date format conversion, rearrangements, and MD5 hashing implementation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25

    bint

    Converts intensity text files to binary for fast subsetting

    ...Extracting the data for individual SNP/CNV markers or individual samples was slow grep/awk'ing the text files exported from the genotyping run (e.g. Illumina final report files). bint converts the text representation of the intensity float data to into a IEEE754 indexed binary file for rapid extraction of subsets of the data. In theory bint could be used for any large tables of float data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB