Search engine and data mining applications and ClueWeb datasets.
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine in C++, the Galago search engine research framework in Java, the RankLib learning to rank library, ClueWeb09 and ClueWeb12 datasets and the Sifaka data mining application.
Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (file systems, web sites, mail boxes, ...) and the file formats (documents, images, ...) occurring in these systems.
Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.
Simple yet feature-rich Document Management System
XODA is a KISSed (Keep Simple and Stupid) System for Organizing Documents using AJAX. This is a Document Management System without backend database, though making possible organizing files/directories by descriptions, filters and more. Visit xoda.org
A collection of Dokuwiki plugins that will enable the user to spatially enable and use the wiki, currently we have: openlayersmap (a map), geotag (ways of geotagging a page)
Framework for search and display of heterogenous document collections.
The eXtensible Text Framework (XTF) is an architecture that supports searching across collections of heterogeneous textual data (XML, PDF, HTML, text, and more), and the presentation of results and documents in a highly configurable manner. Includes highly customized versions of the proven open-source components Lucene and Saxon.