Search Results for "pdf data mining" - Page 2

Showing 893 open source projects for "pdf data mining"

View related business solutions
  • Ship Agents Faster Icon
    Ship Agents Faster

    Transform your applications and workflows into powerful agentic systems at global scale.

    Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
    Get Started Free
  • Compliant and Reliable File Transfers Backed by Top Security Certifications Icon
    Compliant and Reliable File Transfers Backed by Top Security Certifications

    Cerberus FTP Server delivers SOC 2 Type II certified security and FIPS 140-2 validated encryption.

    Stop relying on non-certified, legacy file transfer tools that creak under the weight of modern security demands. Get full audit trails, advanced access controls and more supported by an award-winning team of experts. Start your free 25-day trial today.
    Start Free Trial
  • 1
    Pandoc

    Pandoc

    The universal markup converter

    Pandoc is a universal document converter able to convert files from a multitude of markup formats into another. With Pandoc, you have a swiss-army knife of a converter, able to convert practically any markup format into any other. Pandoc contains a Haskell library for conversions as well as a command-line tool that uses this library. It can convert to and from just about anything-- lightweight markup formats, HTML formats, documentation formats, ebooks, TeX formats, word processor formats...
    Downloads: 236 This Week
    Last Update:
    See Project
  • 2
    PDFPatcher

    PDFPatcher

    A versatile toolkit for PDF manipulation

    PDFPatcher (aka “PDF补丁丁”) is a versatile toolkit for PDF manipulation—editing document metadata, bookmarks, page layout, content restrictions, rotation, compression, merging/splitting, image extraction, and more, all within an intuitive interface. Merge/split PDFs or images, preserve or add bookmarks, and set page dimensions. Batch style/color/target changes, regex/XPath search/replace, mid‑page positioning. Modify PDF metadata, page numbers, links, initial view mode, and remove open actions.
    Downloads: 36 This Week
    Last Update:
    See Project
  • 3
    pdfmake

    pdfmake

    Client/server side PDF printing in pure JavaScript

    Print PDFs directly in the browser or delegate it to your NodeJS backend. Use the same document definition in both cases. Forget about manual x, y calculations. Declare document structure and let pdfmake do the rest. Use paragraphs, columns, lists, tables, canvas, etc. Declare your own styles, use custom fonts, build a DSL and extend the framework. Provides a set of options to disable font layout cache and to control when pages are flushed to the output file. Pdfmake is runnable in browser...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 4
    jsPDF

    jsPDF

    HTML5 client solution for generating PDFs

    The leading HTML5 client solution for generating PDFs. Perfect for event tickets, reports, certificates, you name it! PDFs are ubiquitous across the web, with virtually every enterprise relying on them to share documents. We created jsPDF to solve a major problem with how pdf files were being generated. We decided to make it open-source to allow a community of developers to expand on it.
    Downloads: 26 This Week
    Last Update:
    See Project
  • Error to trace to log to deploy. One click. No SSH. Icon
    Error to trace to log to deploy. One click. No SSH.

    Catch the cause before the pager goes off.

    AppSignal links every error to the trace, the trace to the log, the log to the deploy that shipped it.
    Free 30 days.
  • 5
    MinerU

    MinerU

    A high-quality tool for convert PDF to Markdown and JSON

    MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 6
    Vanilla.PDF

    Vanilla.PDF

    Cross-platform SDK for creating and modifying PDF documents

    Vanilla.PDF is a modern, high-performance, open-source C++17 SDK designed for creating, editing, signing, and analyzing PDF documents across multiple platforms. It requires no external runtime dependencies, making it lightweight and ideal for embedding into desktop applications, servers, or automation pipelines. The SDK offers full cross-platform support including Windows, Linux, macOS, and Android, with builds available for major compilers and architectures. Vanilla.PDF supports advanced...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    PyPDF

    PyPDF

    A pure-python PDF library capable of splitting, merging, cropping

    pypdf is a pure Python library for working with PDF files, allowing developers to split, merge, rotate, encrypt, and extract content from PDFs. It’s an actively maintained fork of PyPDF2, improving performance, compatibility, and support for modern PDF standards. Suitable for both automation scripts and full-featured applications, pypdf handles PDFs without requiring external dependencies.
    Downloads: 12 This Week
    Last Update:
    See Project
  • 8
    tinypdf

    tinypdf

    Minimal PDF creation library

    tinypdf is a minimal, zero-dependency PDF generation library that focuses on the core “put content on a page” use case while intentionally skipping heavyweight features. It is designed to be extremely small and approachable, making it a good fit when you want to generate real PDFs in Node/TypeScript without pulling in a large toolkit. The library supports essential primitives like writing text, drawing basic shapes, and placing JPEG images, which covers common needs such as invoices,...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    pandoc-crossref filter

    pandoc-crossref filter

    Pandoc filter for cross-references

    pandoc-crossref is a pandoc filter for numbering figures, equations, tables and cross-references to them. The input file (like demo.md) can be converted into HTML, LaTeX, PDF, Markdown or other formats. Optionally, you can use cleveref for LaTeX/PDF output, e.g. cleveref PDF, cleveref LaTeX, and listings package, e.g. listings PDF, listings LaTeX. This package tries to use LaTeX labels and references if output type is LaTeX. It also tries to supplement rudimentary LaTeX configuration that...
    Downloads: 10 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    ShredOS

    ShredOS

    ShredOS Disk Eraser 64 bit for all Intel 64 bit processors

    ShredOS is a lightweight, bootable Linux-based operating system designed specifically for secure disk erasure and data destruction. It enables users to permanently wipe hard drives, SSDs, and NVMe devices using the powerful nwipe utility and multiple industry-recognized wiping methods. Compatible with both BIOS and UEFI systems, ShredOS supports PCs, servers, and Intel-based Macs running on 32-bit and 64-bit processors. The platform can erase multiple drives simultaneously while generating detailed PDF certificates and logs for compliance and auditing purposes. ...
    Downloads: 457 This Week
    Last Update:
    See Project
  • 11
    WebViewer UI

    WebViewer UI

    WebViewer UI built in React

    WebViewer UI sits on top of WebViewer, a powerful JavaScript-based PDF Library that's part of the PDFTron PDF SDK. Built in React, WebViewer UI provides a slick out-of-the-box responsive UI that interacts with the core library to view, annotate and manipulate PDFs that can be embedded into any web project. This repo is specifically designed for any users interested in advanced customizations. With the source code access, it gives developers full control to customize & style the UI, build...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 12
    TikZ

    TikZ

    TikZ figures for concepts in physics/chemistry/ML

    Collection of 111 standalone TikZ figures for illustrating concepts in physics, chemistry, and machine learning. Check out janosh.github.io to search, sort, open in Overleaf, and download figures (PDF/SVG/PNG) from this collection.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 13
    pdfme

    pdfme

    A TypeScript based PDF generator library, made with React

    TypeScript base PDF generator and React-based UI. Open source, developed by the community, and completely free to use under the MIT license. No complex operations are required. Just bring your favorite template and generate all the PDFs you need. Works on node and the browser. Anyone can easily create and modify templates using Designer (UI template editor). Templates have a JSON document representation, which makes theme easy to understand and easy to work with.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 14
    Gaston.jl

    Gaston.jl

    A julia front-end for gnuplot

    Gaston is a Julia package for plotting. It provides an interface to gnuplot, a powerful plotting package available on all major platforms. The current stable release is v1.1.0, and it has been tested with Julia LTS (1.6) and stable (1.8), on Linux. Gaston should work on any platform that runs gnuplot.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 15
    book-to-skill

    book-to-skill

    Turn any technical book PDF into a Claude Code skill

    book-to-skill is a Claude Code skill that turns technical books and documents into reusable AI reference skills. It extracts content from PDFs and EPUBs, then organizes the material so an assistant can study, reference, and apply it while working. The project is useful for transforming dense manuals, textbooks, internal documentation, or technical guides into practical agent-accessible knowledge. It includes an extraction script and a SKILL.md workflow that guides how the resulting content...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    Gotenberg

    Gotenberg

    A Docker-powered stateless API for PDF files

    Gotenberg provides a developer-friendly API to interact with powerful tools like Chromium and LibreOffice for converting numerous document formats (HTML, Markdown, Word, Excel, etc.) into PDF files, and more! Thanks to Docker, you don't have to install each tool in your environments; drop the Docker image in your stack, and you're good to go! The webhook feature allows you to upload the output file to the destination of your choice. There are many options to fit your requirements, from the...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    carbone

    carbone

    Fast and simple report generator, from JSON to pdf, xslx, docx, odt

    Turn your JSON into PDF, DOCX, XLSX, PPTX, ODS and many more. Fast, Simple and Powerful report generator in any format PDF, DOCX, XLSX, ODT, PPTX, ODS, XML, CSV using templates and your JSON data as input.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 18
    Distributions.jl

    Distributions.jl

    A Julia package for probability distributions and associated functions

    A Julia package for probability distributions and associated functions.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 19
    Crowbook

    Crowbook

    Converts books written in Markdown to HTML, LaTeX/PDF and EPUB

    Crowbook's aim is to allow you to write a book in Markdown without worrying about formatting or typography and let the program generate HTML, PDF and EPUB output for you. Its focus is novels and fiction, and the default settings should (hopefully) generate readable books with correct typography without requiring you to worry about it. To see what Crowbook's output looks like, you can read the Crowbook guide rendered in HTML, PDF or EPUB. Crowbook will parse this file and generate HTML, EPUB,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    TeXShop

    TeXShop

    TeX previewer for Mac OS X

    TeXShop is a TeX previewer for Mac OS X, written in Cocoa. Since pdf is a native file format on OS X, TeXShop uses "pdftex" and "pdflatex" rather than "tex" and "latex" to typeset in its default configuration; these programs in the standard TeX Live distribution of TeX produce pdf output instead of dvi output. TeXShop uses TeX Live, a standard distribution of Tex programs maintained by the TeX Users Group (TUG) for Mac OS X, Windows, Linux, and various other Unix machines. The distribution...
    Downloads: 11 This Week
    Last Update:
    See Project
  • 21
    pagedown

    pagedown

    Paginate the HTML Output of R Markdown with CSS for Print

    Paginate the HTML Output of R Markdown with CSS for Print. You only need a modern web browser (e.g., Google Chrome or Microsoft Edge) to generate PDF. No need to install LaTeX to get beautiful PDFs. This R package stands on the shoulders of two giants to support typesetting with CSS for R Markdown documents: Paged.js and ReLaXed (we only borrowed some CSS from the ReLaXed repo and didn't really use the Node package).
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    Unredact

    Unredact

    A simple tool for reading in poorly redacted documents

    Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and...
    Downloads: 32 This Week
    Last Update:
    See Project
  • 23
    Open Semantic Search

    Open Semantic Search

    Open source semantic search and text analytics for large document sets

    ...It provides an integrated search server combined with a document processing pipeline that supports crawling, text extraction, and automated analysis of content from many different sources. Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and enrich the data with extracted information such as named entities and metadata. It also supports optical character recognition to extract text from images and scanned documents, including images embedded inside PDF files. It integrates text mining and analytics capabilities that allow users to examine relationships, topics, and structured data within document collections.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    beautiful-mermaid

    beautiful-mermaid

    Render Mermaid diagrams as beautiful SVGs or ASCII art

    beautiful-mermaid is a styling and rendering toolkit built to produce visually enhanced diagrams from Mermaid syntax, aiming to bridge the gap between simple technical diagrams and rich, presentation-ready visualizations, all while preserving the lightweight text-to-diagram workflow that Mermaid offers. Instead of plain, utilitarian shapes and lines, Beautiful Mermaid applies themes, typography enhancements, color palettes, and layout optimizations so diagrams look polished and professional...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 25
    RenderCV

    RenderCV

    LaTeX CV generator from a YAML/JSON input file

    RenderCV is a LaTeX CV/resume framework. It allows you to create a high-quality CV as a PDF from a YAML file with full Markdown syntax support and complete control over the LaTeX code. RenderCV offers built-in LaTeX and Markdown templates ready to produce high-quality CVs. However, the templates are entirely arbitrary and can easily be updated to leverage RenderCV's capabilities with your custom CV themes.
    Downloads: 9 This Week
    Last Update:
    See Project
Auth0 Logo