pdf data mining free download

53 projects for "pdf data mining" with 2 filters applied:

Artificial Intelligence BSD Clear Filters & Widen Search

Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
Earn up to 16% annual interest with Nexo.
Access competitive interest rates on your digital assets.

Generate interest, borrow against your crypto, and trade a range of cryptocurrencies — all in one platform. Geographic restrictions, eligibility, and terms apply.

Get started with Nexo.
1

Open Semantic Search

Open source semantic search and text analytics for large document sets

...It provides an integrated search server combined with a document processing pipeline that supports crawling, text extraction, and automated analysis of content from many different sources. Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and enrich the data with extracted information such as named entities and metadata. It also supports optical character recognition to extract text from images and scanned documents, including images embedded inside PDF files. It integrates text mining and analytics capabilities that allow users to examine relationships, topics, and structured data within document collections.

Downloads: 3 This Week

Last Update: 11 hours ago
See Project
2

JimuReport

Open source drag-and-drop reporting and dashboard builder platform

JimuReport is an open source data visualization and reporting platform designed to help developers and organizations build reports, dashboards, and large screen data displays through a visual interface. It provides an online report designer that uses an Excel-like editing experience, allowing users to construct reports with drag-and-drop components and cell-based layouts. It focuses on simplifying complex report development by enabling visual configuration instead of manual coding....

Downloads: 7 This Week

Last Update: 2026-05-23
See Project
3

canvas-editor

Canvas-based WYSIWYG rich text editor with advanced layout tools

...It is designed to provide a WYSIWYG editing experience similar to word processors, enabling precise control over layout, rendering, and document structure. canvas-editor supports a wide range of formatting and document features, including text styling, tables, images, and embedded elements, all managed through a structured data model. Its architecture is modular, allowing developers to extend functionality through plugins, custom commands, and event hooks. It includes support for page-based layouts with headers, footers, pagination, and print-ready output, including PDF generation. It also provides interactive components such as form controls and context menus, making it suitable for building complex document editing systems.

Downloads: 0 This Week

Last Update: 1 day ago
See Project
4

Extractous

Fast and efficient unstructured data extraction

Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
Error to trace to log to deploy. One click. No SSH.
Catch the cause before the pager goes off.

AppSignal links every error to the trace, the trace to the log, the log to the deploy that shipped it.

Free 30 days.
5

text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API

text-extract-api is an open-source service designed to extract readable text from a wide variety of document formats through a simple API interface. The project focuses on converting complex files such as PDFs, images, scanned documents, and office files into structured plain text that can be processed by downstream applications or language models. Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction...

Downloads: 5 This Week

Last Update: 2026-03-05
See Project
6

DeepSeek Prover V2

Advancing Formal Mathematical Reasoning via Reinforcement Learning

...It also includes a PDF of the paper or project overview and sample formalization datasets. Because theorem proving is a cutting-edge area in LLM research, Prover-V2 is positioned as a pushing-forward effort in formal reasoning for LLMs.

Downloads: 0 This Week

Last Update: 2025-10-03
See Project
7

GeoDMA

Geographic feature extraction and data mining

GeoDMA is a plugin for TerraView software, used for geographical data mining. With a single image, the user can perform segmentation, attributes extraction, normalization and classification.

1 Review

Downloads: 3 This Week

Last Update: 2026-01-20
See Project
8

LangChain Extract

Did you say you like data?

LangChain Extract is an open-source reference application designed to demonstrate how large language models can be used to extract structured data from unstructured text and document files. The project implements a lightweight web service that allows developers to define extraction schemas and apply them to various sources such as plain text, HTML, or PDF documents. Built using FastAPI and the LangChain framework, the application exposes a REST API that can process documents and return structured outputs that match user-defined JSON schemas. ...

Downloads: 1 This Week

Last Update: 2026-03-09
See Project
9

stkpp

C++ Statistical ToolKit

...At a convenience, we propose the source packages on sourceforge. The library offers a dense set of (mostly) template classes in C++ and is suitable for projects ranging from small one-off projects to complete data mining application suites.

Downloads: 0 This Week

Last Update: 2026-03-20
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
10

UnBBayes

Framework & GUI for Bayes Nets and other probabilistic models.

UnBBayes is a probabilistic network framework written in Java. It has both a GUI and an API with inference, sampling, learning and evaluation. It supports Bayesian networks, influence diagrams, MSBN, OOBN, HBN, MEBN/PR-OWL, PRM, structure, parameter and incremental learning. Please, visit our wiki (https://sourceforge.net/p/unbbayes/wiki/Home/) for more information. Check out the license section (https://sourceforge.net/p/unbbayes/wiki/License/) for our licensing policy.

7 Reviews

Downloads: 8 This Week

Last Update: 2025-11-25
See Project
11

General Knowledge Machine Project

Intellect Modeling Kit: assisting research, diagnostics, consulting

...Intellect Modeling Kit (IMK) is intended to build knowledge machines (KM) assisting experts on the steps of activity: * Observation; * Producing propositions based on knowledge; * Elimination of impossible propositions; * Selection and verification of the most appropriate propositions; * Memorizing - new knowledge item creation; * Abstraction – building objects representing typical signs of similar objects groups, data mining. KM is not intended to replace human experts, it is built to multiply abilities. Machine should not be responsible for decisions. The IMK is designed to create ready-to-use software applications using simple text files. Any human knowledge can be uploaded to KM by expert not familiar with software coding. Demos present in kit. ...

1 Review

Downloads: 0 This Week

Last Update: 2025-07-27
See Project
12

ADAMS

ADAMS is a workflow engine for building complex knowledge workflows.

ADAMS is a flexible workflow engine aimed at quickly building and maintaining data-driven, reactive workflows, easily integrated into business processes. Instead of placing operators on a canvas and manually connecting them, a tree structure and flow control operators determine how data is processed (sequentially/parallel). This allows rapid development and easy maintenance of large workflows, with hundreds or thousands of operators. Operators include machine learning (WEKA, MOA, MEKA)...

Downloads: 2 This Week

Last Update: 2024-03-21
See Project
13

Common Resource Grep - crgrep

Common Resource Grep

CRGREP searches for matching text in databases, various document formats, archives and other difficult to access resources. A command line tool for name and content text matching in database tables, plain files, MS Office documents, PDF, archives, MP3 audio, image meta-data, scanned documents, maven dependencies and web resources. CRGREP will search resources within resources of any arbitrary combination or depth, so text within a document within a zip archive, and so on. Here you will find binary downloads and discussion (https://sourceforge.net/p/crgrep/discussion/) . ...

3 Reviews

Downloads: 8 This Week

Last Update: 2023-04-23
See Project
14

MANTI

MANTI - Mastering Advanced N-Termini Interpretation

...For a very detailed explanation of script parameters and the evaluation strategy, please consult the extensive manual PDF

Downloads: 1 This Week

Last Update: 2022-12-01
See Project
15

DynaQ

Innovative text document search. http://dynaq.opendfki.de for details.

The goal of DynaQ is to develop an inquiry system to explore the personal information space, supporting you with the searching paradigm 'orienteering'. DynaQ is a (desktop)search engine with enhanced functionality for file, email and blog search. Look at our GitLab homepage for sourcecode and documentation: http://dynaq.opendfki.de

Downloads: 0 This Week

Last Update: 2021-08-05
See Project
16

VIKAMINE

VIKAMINE is a flexible environment for visual analytics, data mining and business intelligence - implemented in pure Java. It features several powerful visualization and mining methods, and can utilize background knowledge.

Downloads: 2 This Week

Last Update: 2021-03-09
See Project
17

MANTI.pl / muda.pl

muda.pl - MQ unified data assembler

...For a more thorough explanation of script parameters and evaluation strategy, please consult the extensive manual PDF.

Downloads: 0 This Week

Last Update: 2020-11-11
See Project
18

AI Cheatsheets

Essential Cheat Sheets for deep learning and machine learning research

cheatsheets-ai is an open-source repository that collects essential cheat sheets covering many tools and concepts used in machine learning, deep learning, and data science. The project aims to provide quick-reference materials that help engineers, researchers, and students review key techniques and frameworks without reading extensive documentation. It compiles cheat sheets for widely used libraries and technologies such as TensorFlow, Keras, NumPy, Pandas, Scikit-learn, Matplotlib, and...

Downloads: 0 This Week

Last Update: 2026-03-10
See Project
19

Siamese and triplet learning

Siamese and triplet networks with online triplet mining in PyTorch

...The repository demonstrates how to train these models using contrastive loss and triplet loss functions, which encourage embeddings of similar samples to be close while pushing dissimilar samples farther apart. It includes data loaders, training scripts, neural network architectures, and evaluation metrics that allow researchers to experiment with different embedding learning strategies. The project also implements online pair and triplet mining techniques to efficiently generate training examples during model training.

Downloads: 0 This Week

Last Update: 2026-03-15
See Project
20

Fast Frequent Subgraph Mining (FFSM)

This project aims to develop and share fast frequent subgraph mining and graph learning algorithms. Currently we release the frequent subgraph mining package FFSM and later we will include new functions for graph regression and classification package

Downloads: 0 This Week

Last Update: 2017-11-12
See Project
21

BioRec:Bird Census field data annotation

Recognizing biological data from a notebook.

This project helps to digitize field data for a certain Bird Census method. Namely, bird census based on personal inspection or small (~10 km^2) regions with recording birds' position and behaviour on paper. This project makes it easy to annotate such field data and to make this data available for statistical analysis.

Downloads: 0 This Week

Last Update: 2017-09-25
See Project
22

Bolt ML

10x faster matrix and vector operations

Bolt is an open-source research project focused on accelerating machine learning and data mining workloads through efficient vector compression and approximate computation techniques. The core idea behind Bolt is to compress large collections of dense numeric vectors and perform mathematical operations directly on the compressed representations instead of decompressing them first. This approach significantly reduces both memory usage and computational overhead when working with high-dimensional data commonly used in machine learning systems. ...

Downloads: 0 This Week

Last Update: 2026-03-15
See Project
23

MYRA

A collection of ACO algorithms for the data mining classification task

MYRA is a collection of Ant Colony Optimization (ACO) algorithms for the data mining classification task. It includes popular rule induction and decision tree induction algorithms. The algorithms are ready to be used from the command line or can be easily called from your own Java code. They are build using a modular architecture, so they can be easily extended to incorporate different procedures and/or use different parameter values.

Downloads: 7 This Week

Last Update: 2017-06-22
See Project
24

PyDaMelo

Python-compatible Data mining elementary objects

An attempt at offering machine learning and data mining algorithms at the finest grain we are able to, easy to combine together through Python scripting to glue together the Lego-like bricks.

Downloads: 0 This Week

Last Update: 2019-02-19
See Project
25

GUI Ant-Miner

GUI Ant-Miner is a tool for extracting classification rules from data. It is an updated version of a data mining algorithm called Ant-Miner (Ant Colony-based Data Miner).

Downloads: 1 This Week

Last Update: 2016-09-17
See Project