Search Results for "pdf data mining" - Page 3

Showing 893 open source projects for "pdf data mining"

View related business solutions
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • Atera - an All-in-one platform for IT management Icon
    Atera - an All-in-one platform for IT management

    Ideal for IT departments and MSPs (managed service providers)

    Your IT essentials, integrated & elevated. Take your IT management from automated to autonomous, download Atera's agent to start your free trial!
    Try Atera now
  • 1
    Umbrel

    Umbrel

    A beautiful personal server OS for Raspberry Pi or any Linux distro

    ...They’re a part of your private life, and now they can all be stored by you, in your home, on your Umbrel. The Bitcoin network is made up of thousands of nodes that verify every single transaction in the blockchain. Some of them mine Bitcoin too, but unlike a mining node, running a non-mining node doesn’t require expensive hardware. Achieve unparalleled privacy by connecting your wallet directly to the Bitcoin node on your Umbrel.
    Downloads: 25 This Week
    Last Update:
    See Project
  • 2
    tableExport.jquery.plugin

    tableExport.jquery.plugin

    jQuery plugin to export a html table to JSON, XML, CSV, TSV, TXT, SQL

    jQuery plugin to export an html table to JSON, XML, CSV, TSV, TXT, SQL, Word, Excel, PNG, and PDF.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    dvisvgm

    dvisvgm

    A fast DVI, EPS, and PDF to SVG converter

    The command-line utility dvisvgm is a tool for TEX/LATEX users. It converts DVI, EPS, and PDF files to the XML-based vector graphics format SVG. In contrast to bitmap graphics, vector graphics are arbitrarily scalable without loss of quality. All modern web browsers support a large amount of the current SVG standard 1.1. Furthermore, SVG files can also be displayed with the Java-based Squiggle SVG browser which is part of the Apache Batik project, and the free vector graphics editor Inkscape.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    Compose.jl

    Compose.jl

    Declarative vector graphics

    Compose is a vector graphics library for Julia. It forms the basis for the statistical graphics system Gadfly. Compose is a declarative vector graphics system written in Julia. It's designed to simplify the creation of complex graphics and serves as the basis of the Gadfly data visualization package.
    Downloads: 9 This Week
    Last Update:
    See Project
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • 5
    OmniTools

    OmniTools

    Self-hosted collection of powerful web-based tools for everyday tasks

    ...It’s designed to replace the random assortment of “free online tools” people use for quick tasks, while avoiding ads, tracking, and the need to upload sensitive files to unknown servers. A key design choice is that file processing happens entirely on the client side, meaning your data stays in your browser instead of being sent to the backend. The tool catalog spans both technical and non-technical needs, including image, video, audio, PDF, text, date/time, math, and data format utilities like JSON/CSV/XML helpers. It’s also packaged for straightforward self-hosting, with a lightweight Docker image and simple run commands, so it can be deployed quickly on a homelab or internal network.
    Downloads: 14 This Week
    Last Update:
    See Project
  • 6
    Shower Presentation Template

    Shower Presentation Template

    Shower HTML presentation engine

    Shower Presentation Template is a shower HTML presentation engine. Built on HTML, CSS and vanilla JavaScript, works in all modern browsers. Themes are separated from engine, and comes with fully keyboard accessible. Printable to PDF and includes Ribbon and Material themes, and core with plugins. You’ll need Node.js installed on your computer. Latest stable versions of Chrome, Edge, Firefox, and Safari are supported.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    Colly

    Colly

    Elegant Scraper and Crawler Framework for Golang

    Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving. Clean API. Fast (>1k request/sec on a single core) Manages request delays and maximum concurrency per domain. Automatic cookie and session handling. Sync/async/parallel scraping. Distributed scraping. Caching, automatic encoding of non-unicode responses. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 8
    zpdf

    zpdf

    Zero-copy PDF text extraction library written in Zig

    zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    Crowbook LaTeX

    Crowbook LaTeX

    Converts books written in Markdown to HTML, LaTeX/PDF and EPUB

    Crowbook's aim is to allow you to write a book in Markdown without worrying about formatting or typography and let the program generate HTML, PDF and EPUB output for you. Its focus is novels and fiction, and the default settings should (hopefully) generate readable books with correct typography without requiring you to worry about it.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Compliant and Reliable File Transfers Backed by Top Security Certifications Icon
    Compliant and Reliable File Transfers Backed by Top Security Certifications

    Cerberus FTP Server delivers SOC 2 Type II certified security and FIPS 140-2 validated encryption.

    Stop relying on non-certified, legacy file transfer tools that creak under the weight of modern security demands. Get full audit trails, advanced access controls and more supported by an award-winning team of experts. Start your free 25-day trial today.
    Start Free Trial
  • 10
    Luxor

    Luxor

    Simple drawings using vector graphics; Cairo "for tourists!"

    Luxor is a Julia package for drawing simple static 2D vector graphics. It provides basic drawing functions and utilities for working with shapes, polygons, clipping masks, PNG and SVG images, turtle graphics, and simple animations. The focus of Luxor is on simplicity and ease of use: it should be easier to use than plain Cairo.jl, with shorter names, fewer underscores, default contexts, and simplified functions. For more complex and sophisticated graphics in 2D and 3D, Makie.jl is the best...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    Docling

    Docling

    Get your documents ready for gen AI

    Docling is an open-source document processing toolkit built to prepare diverse content types for modern generative AI and data workflows. The project focuses on converting and parsing many document formats into a unified structured representation that downstream systems can easily consume. It supports advanced PDF understanding, including layout detection, table extraction, and reading order analysis, enabling high-fidelity document intelligence pipelines.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 12
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    ...It can be used for data mining, monitoring and automated testing.
    Downloads: 20 This Week
    Last Update:
    See Project
  • 13
    KOReader

    KOReader

    An ebook reader application supporting PDF, DjVu, EPUB, FB2, etc.

    KOReader is a document viewer for E Ink devices. Supported fileformats include EPUB, PDF, DjVu, XPS, CBT, CBZ, FB2, PDB, TXT, HTML, RTF, CHM, DOC, MOBI and ZIP files. It’s available for Kindle, Kobo, PocketBook, Android and desktop Linux. Runs on embedded devices (Cervantes, Kindle, Kobo, PocketBook, reMarkable), Android and Linux computers. Developers can run a KOReader emulator in Linux and MacOS. Multi-lingual user interface with a highly customizable reader view and many typesetting...
    Downloads: 106 This Week
    Last Update:
    See Project
  • 14
    circuitikz

    circuitikz

    CircuiTikZ TeX/LaTeX package for drawing circuits

    This package provides a set of macros on top of TikZ for naturally typesetting electrical and electronic networks. It was born mainly for writing Massimo Redaelli's exercise book and exam sheets for the Elettrotecnica courses at Politecnico di Milano, Italy. He wanted a tool that was easy to use, with a lean syntax, native to LaTeX, and supporting direct PDF output format. circuitikz is included with the most common LaTeX systems, so it should work out of the box. Anyway, the main dependency...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 15
    node-canvas

    node-canvas

    Node canvas is a Cairo backed Canvas implementation for NodeJS

    ...For API documentation, please visit Mozilla Web Canvas API. (See Compatibility Status for the current API compliance.) All utility methods and non-standard APIs are documented. When MIME data is tracked, PDF canvases can embed JPEGs directly into the output, rather than re-encoding into PNG. This can drastically reduce filesize and speed up rendering. If working with a non-PDF canvas, image data must be tracked, otherwise the output will be junk.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 16
    OCRBase

    OCRBase

    MD/.JSON Document OCR and structured data extraction API

    OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 17
    Career-Ops

    Career-Ops

    AI-powered job search system built on Claude Code

    Career Ops is an open-source platform designed to help individuals manage their job search process with a structured, operations-style approach that treats career development like a pipeline. It provides a system for organizing job applications, tracking progress across different stages, and maintaining visibility into opportunities, much like a lightweight CRM tailored for job seekers. The project emphasizes clarity and accountability, enabling users to monitor applications, follow-ups, and...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 18
    PGFPlotsX.jl

    PGFPlotsX.jl

    Plots in Julia using the PGFPlots LaTeX package

    PGFPlotsX is a Julia package to generate publication quality figures using the LaTeX library PGFPlots. It is similar in spirit to the package PGFPlots.jl but it tries to have a very close mapping to the PGFPlots API as well as minimize the number of dependencies. The fact that the syntax is similar to the TeX version means that examples from Stack Overflow and the PGFPlots manual can easily be incorporated in the Julia code.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 19
    Geziyor

    Geziyor

    Blazing fast Go framework for web crawling and data scraping tasks

    ...It is designed to help developers crawl websites and extract structured information from web pages efficiently. It focuses on speed and scalability, allowing large numbers of requests to be processed concurrently. Geziyor supports use cases such as data mining, monitoring web content, and automated testing workflows. It provides a flexible architecture where developers define parsing functions that process responses and extract the desired data. Geziyor includes features for managing requests, handling cookies, respecting robots rules, and exporting collected data in multiple formats. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 20
    Holochain

    Holochain

    The current, performant & industrial strength version of Holochain

    Holochain is a post-blockchain framework for building agent-centric, distributed applications. Instead of using global consensus, Holochain enables each agent (user) to maintain their own local state while validating actions with a shared set of rules. This allows for scalable, secure, and resilient apps where data is owned and controlled by users. Ideal for social apps, cooperatives, and data sovereignty platforms, Holochain focuses on enabling collaboration without central servers or...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 21
    Dawarich

    Dawarich

    Self-hostable alternative to Google Timeline

    Dawarich is a command-line tool (likely Ruby-based) for transforming and analyzing Arabic text data with normalization, diacritic handling, segmentation, and morphological tokenization. Designed for text mining and NLP workflows in Arabic-language contexts.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 22
    changedetection.io

    changedetection.io

    The best free open source website change detection and restock service

    Loved by smart shoppers, data journalists, research engineers, data scientists, security researchers, and more. From simply monitoring website pages that have a change (such as watching prices, and restocking notifications), to deep inspection such as PDF text support, JSON and XML monitoring, and extensive text triggers. Monitor out-of-stock products and get alerts when those products are back in stock, get restock alerts via Discord, Slack, email, and many other platforms. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 23
    Income Tax Portal

    Income Tax Portal

    An automated tool to fetch data from income tax websites

    ...Basically, anyone who wants to view all the information about multiple PANs in one Unified Dashboard. Fast, intuitive search. All the reporting needs are covered. One-click data fetching from the Income tax portal for all PAN. Including all PDF files (i.e. Notices, Challans, Attachments). Super simple and easy-to-use interface to track Demand, e-Proceeding, Return Status, and Notices. Inbuilt help on the Jargons used by the income tax portal.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    Documind

    Documind

    Open-source platform for extracting structured data from documents

    Documind is an advanced document processing tool that leverages AI to extract structured data from PDFs. It is built to handle PDF conversions, extract relevant information, and format results as specified by customizable schemas.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 25
    ustcthesis

    ustcthesis

    LaTeX template for USTC thesis

    Official LaTeX thesis template maintained by USTC TeX Users Group for undergraduate and graduate theses at University of Science and Technology of China, strictly adhering to formatting guidelines updated as of December 2024. Compatible with major TeX distributions.
    Downloads: 9 This Week
    Last Update:
    See Project
Auth0 Logo