This project is a compilation of tools/libraries to help with tasks related to Text Analytics mainly in Java. These tools range from simple wrappers to sophisticated mining tasks that can improve the productivity of researchers and engineers.
OpenDMAP (Open Source Direct Memory Access Parser) is a natural language processing (text mining) application: a semantic parser for information extraction.
Open extensible system analysis report tool for Java, based on numerous open source analysis initiatives. The XML/XSL batch-processing framework produces integrated HTML/SVG reports of the systems current state and the development over time.
OCR c++ library. Include: contour recognition; vectorisation; matrix letter feature recognition; auto page segmentation and detect rotation; SS3 ASM core; XML base; web-based GUI; 99,6% printed Unicode text recognition; letter base up to 1200 letters.
Run everything from popular models with on-demand NVIDIA L4 GPUs to web apps without infrastructure management.
Run frontend and backend services, batch jobs, host LLMs, and queue processing workloads without the need to manage infrastructure. Cloud Run gives you on-demand GPU access for hosting LLMs and running real-time AI—with 5-second cold starts and automatic scale-to-zero so you only pay for actual usage. New customers get $300 in free credit to start.
We are using a large archive of newspaper stories(GigaWordCorpus) as input to a parallel MPI program, and produce from that a list of top R terms of varying lengths M through N that are especially interesting.
The program is done in C using MPI.
TetraPack is a package with Delphi components for the TextTransformer by Dr. Detlef Meyer-Eltz. The components make it easy to parse and transform strings and files, or to build an parse tree from them.
n-squared is a light weight, super powered note pad application that stores notes in an embedded database for easy searching. It has a tabbed interface, syntax highlighting, encryption, and more!
The Java Text Categorizing Library (JTCL) is a pure java implementation of libTextCat which in turn is "a library that was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy."
hypKNOWsys aims at developing a Java-based workbench for knowledge discovery and knowledge management. Currently, hypKNOWsys has released two intermediate tools: DIAsDEM Workbench (text mining for semantic tagging) and WUMprep (Web mining pre-processing)
Java Expert Rule Based Inference Language. Jerbil is an open source rule processing engine written in Java. Currently Jerbil supports a full set of processing functions with text-based and XML interfaces; a Java interface is planned.
Flesh is a Java application designed to analyze a document (plain text, rich text, Word documents, and PDFs) and display the difficulty associated with comprehending using the Flesch-Kincaid Grade Level and the Flesch Reading Ease Score.
The Word Vector Tool is a simple but flexible Java library to create word vector representations of text documents. Word vectors can be used for various textprocessing tasks, as text classification, text clustering or information retrieval.
JTextPro: A Java-based TextProcessing tool that includes sentence boundary detection (using maximum entropy classifier), word tokenization (following Penn conventions), part-of-speech tagging (using CRFTagger), and phrase chunking (using CRFChunker).
Refal.NET - Versatile, Compact yet Powfull Text Transformer and Compiler-Writing System. Based on Refal.NET Virtual Machine (+Refal.NET Compiler), this RAD-tool might be used for rapid prototyping, decreasing up to 10 times development efforts.
AutoSummary uses Natural Language Processing to generate a contextually-relevant synopsis of plain text. It uses statistical and rule-based methods for part-of-speech tagging, word sense disambiguation, sentence deconstruction and semantic analysis.
The "Universal Content Evaluation and Categorisation Software" is a program for analysing a websites, or more generally, a texts content. The text is arranged in dozens of categories, permitting more efficient web searches and information processing.
a cross-platform application to decode, search, browse, view, print, and export TLG/PHI BetaCode texts. Project is currently being ported from wxWindows to Java. (For more info, see the project homepage at http://wxtlg.sourceforge.net)
A knowledgment management system written in Java under JBoss 4.2.3 Server, with richfaces 3.3.0BETA4. Including fileconversion from html to pdf and rich:editor component without special syntaxing.