Showing 26 open source projects for "pdf data mining"

View related business solutions
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • Stop vibe-debugging. Icon
    Stop vibe-debugging.

    Plug Claude into your app's actual errors.

    AppSignal's MCP server hands Claude, Cursor, or Zed your real errors, traces, and the deploy that shipped them. AI writes the fix; you review the diff.
    Free 30 days.
  • 1
    ProM is the comprehensive, extensible framework for process mining. Process Mining deals with the a-posteriori analysis of (business) processes using enactment logs.
    Leader badge
    Downloads: 45 This Week
    Last Update:
    See Project
  • 2
    WebHarvest - web data extraction tool
    Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 3
    The Lemur Project

    The Lemur Project

    Search engine and data mining applications and ClueWeb datasets.

    The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine in C++, the Galago search engine research framework in Java, the RankLib learning to rank library, ClueWeb09 and ClueWeb12 datasets and the Sifaka data mining application.
    Downloads: 143 This Week
    Last Update:
    See Project
  • 4

    ConcatPDF

    PDF Concatenation Tool

    ConcatPDF is the tool to concatenate PDF files. It can concatenate, extract, encrypt, decrypt, configure PDF files, convert image files to PDF. GUI version and CUI version are both available. iText.NET is iText porting on .NET Framework by J#. This library allows you to generate PDF, (X)HTML, XML, RTF files on Microsoft.NET Framework including ASP.NET.
    Downloads: 39 This Week
    Last Update:
    See Project
  • Train ML Models With SQL You Already Know Icon
    Train ML Models With SQL You Already Know

    BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

    Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.
    Try Free
  • 5
    iText®, a JAVA PDF library

    iText®, a JAVA PDF library

    PDF Library for Developers

    iText is an open-source PDF library available for Java and .NET (C#). iText allows you to effortlessly generate and manipulate standards-compliant PDF documents with a powerful and feature-rich SDK. With iText, you can create archivable and accessible PDFs, split and merge documents, fill and flatten forms, digitally sign documents, and more. iText add-ons enable additional functionality, such as PDF creation from HTML templates, secure redaction, OCR, and much more. The latest...
    Leader badge
    Downloads: 134 This Week
    Last Update:
    See Project
  • 6

    eXtensible Text Framework (XTF)

    Framework for search and display of heterogenous document collections.

    ...Please visit https://github.com/cdlib/xtf for the latest updates. Obsolete Description: The eXtensible Text Framework (XTF) is an architecture that supports searching across collections of heterogeneous textual data (XML, PDF, HTML, text, and more), and the presentation of results and documents in a highly configurable manner. Includes highly customized versions of the proven open-source components Lucene and Saxon.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Aspose Java for Liferay

    Aspose Java for Liferay

    Provides export options for blogs, journals and dynamic lists

    This is Liferay CMS / Portal plugin released by Aspose pty ltd. Aspose.Total Java for Liferay (hook plugin app) provides options for exporting web-contents and blogs created in html to MS-WORD, MS-EXCEL and PDF file formats using Aspose.Total Java APIs. (Aspose.Words, Aspose.Cells and Aspose.PDF) The Plugin also provides very useful functionality / options for exporting the Dynamic Data Lists to MS-WORD, MS-EXCEL and PDF file formats using Aspose.Total Java APIs. (Aspose.Words, Aspose.Cells and Aspose.PDF)
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Regain is a Java search engine based on Jakarta Lucene. It provides indexing and searching files for plenty of formats (HTML,XML,doc(x),xls(x),ppt(x),oo,PDF,RTF,mp3,mp4,Java). A TagLibrary eases integrating search results in your JSP based web page.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 9
    Framework for text mining, data integration and data analysis. Keywords: ontology and graph alignment, relation mining, warehouse, semantic database integration, bioinformatics, systems biology, microarray, Java.
    Downloads: 1 This Week
    Last Update:
    See Project
  • $300 Free Credits to Build on Google Cloud Icon
    $300 Free Credits to Build on Google Cloud

    New to Google Cloud? Get $300 in credits to explore Compute Engine, BigQuery, Cloud Run, Gemini Enterprise Agent Platform, and more.

    Start your next project with $300 in free Google Cloud credit. Spin up VMs, run containers, query petabytes in BigQuery, or build agents with Gemini Enterprise Agent Platform. Once your credits are used, keep building with 20+ always-free tier products including Compute Engine, Cloud Storage, GKE, and Cloud Run functions. No commitment required—just sign up and start building.
    Claim $300 Free
  • 10

    SquidCube

    Squid log data warehouse

    Feed Squid logs into PostgreSQL database, then use Pentaho BI server for data mining.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    File Type Checker checks the file data to determine the actual file type. As of this writing filetypechecker supports doc, rtf, xls, pdf, jpg, jpeg, and gif. more file support will be added soon.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Calenco XML CMS
    Calenco is a Web collaborative platform that enable remote teams of writers, proofreader, graphic designers, translators, etc. to produce together XML documents like user guides, security procedures, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Inforama - Document Automation. Document templates, generation and distribution. Create letter templates using OpenOffice and import existing Acrobat forms. Merge data to produce high quality PDF documents and automatically email, print and view.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    TAXOMO
    Data mining tool for sequences (e.g. trajectories on a map, visited web pages, etc.) that creates a succinct description of the sequences, given a taxonomy (e.g. regions and sub-regions in the map, categories and sub-categories of pages, etc.).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    WSRF-compliant tools and services for data mining in grid computing environments, based on: Globus Toolkit 4, Condor and Triana workflow system. Learn more at: http://www.datamininggrid.org Copyright (c) 2008 DataMiningGrid Consortium.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Automatically embed Wikipedia topic information into PDF documents via pop up annotations. This relies on the Wikipedia Miner service that is also available on Sourceforge.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Webstats Solr is an attempt to make Apache Access log easier to Data Mine. By adding a powerful Search Engine (SOLR) as a Backend and using Java Script and HTML and maybe PHP I hope to out date AWStats.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    The ProM Import Framework allows to extract process enactment event logs from a set of information systems. These can be exported in the MXML format, which is the standard event log data format for Process Mining analysis techniques.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    JODConverter automates conversions between office document formats using OpenOffice.org. Supported formats include OpenDocument, PDF, RTF, Word, Excel, PowerPoint, and Flash. It can be used as a Java library, a command line tool, or a Web application.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 20
    Crow - Computational Representation Of Whatever. A platform for the integration and mining of complex and distributed data. Represents cross-linked semantic web documents as a network of software objects and offers easy ways to filter, and sort them.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    This project intends to create an indexing search engine, for knowledge management. The primary object is to apply an information retrieval core. And implement a knowledge data discovery theory such as data mining algorithm, text mining.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    The Nheengatu Project is a Java library that provides HTML markup abstraction allowing you to reutilize it to generate PDF files, OpenOffice documents, image files, etc. The goal of this project is to maximize the use of HTML markup procedures.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Yawda is an MVC Web development application framework based on Struts. With it, you can easily output HTML,SVG,PNG,JPEG,RTF and PDF with data from several sources. It uses cayenne,rhino,itext,Batik,Struts,Velocity,jakarta commons.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    webExtractor is a Java application that is used for extracting specific content from web based HTML, XML, CSV, and free form text. The extracted data can be used for data gathering and mining purposes.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 25
    Harvestman is a context aware metasearch engine which functions as a universal infromation gatherer and data mining system for the internet.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next