YouSeer is an open source search engine framework, which was built on top of other open source components. It’s part of the general SeerSuite framework. YouSeer utilizes Hereitrix as a crawler and solr as an indexing system.
Framework for search and display of heterogenous document collections.
The eXtensible Text Framework (XTF) is an architecture that supports searching across collections of heterogeneous textual data (XML, PDF, HTML, text, and more), and the presentation of results and documents in a highly configurable manner. Includes highly customized versions of the proven open-source components Lucene and Saxon.
This project is an attempt to create a database suitable for storing book reviews, and links to book reviews found on the internet and elsewhere. The intent is to re-use Apache components wherever possible - including Xindice XML database, Cocoon2 XM
This forum software is a Java based discussion forum, that uses JDBC to store data in a database. This discussion forum is available in different languages and has features for easy integration into a site and easy administration of forum.
The "mimor" project is a Java-based software project to implement a meta search engine system that relies on relevance feedback data in order to improve the quality of the retrieved documents based on query, user and document parameters.
Syncato is a Weblog Web Services system built on top of Berkeley DB XML, Webware and Python. It has a number of unique features; XPath access to all content via URLs, XSL-T presentation and extremely flexible database structure.
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
OpenPipe is a scalable platform for manipulating a stream of documents. PipeLines are created from building bricks doing atomic operations on documents, like language detection, field manipulation, POS tagging, entity extraction or posting to Solr.
arachne is a C++ library for HTTP crawling, link, text and metadata extraction designed to run in a distributed environment.
A threaded Web graph (Power law random graph) generator written in Python. It can generate a synthetic Web graph of about one million nodes in a few minutes on a desktop machine. It implements a threaded variant of the RMAT algorithm.
TM4J is a topic map engine implemented entirely in Java. Topic maps are a standard paradigm for the interchange of knowledge structures. This project aims to produce a complete suite of tools for creating, processing and publishing topic map information.
LIMO stands for Lucene Index Monitor. It is a web application that gives basic information about indexes used by the Lucene search engine (http://lucene.apache.org). It allows you to browse and search the index, and reconstruct stored fields.
Catacomb is a WebDAV repository module for use with the Apache WebDAV module, mod_dav. Apache mod_dav parses WebDAV and DeltaV protocol requests into operations on a repository providing persistent storage of resources and their properties.
Classifier4J is a java library that provides an API for automatic classification of text. The default (and only current) implementation of this API is a Bayesian classifier. This library can be used for multiple purposes - as a spam filter or a blog cl
DBPrism is a framework to generate dynamic XML from a database, it provides an high performance DBGenerator for Cocoon2. Also is a J2EE replacement for Oracle mod_plsql. This project also includes a Restlet-Oracle connector exam. and Lucene Domain In
Harkat is a social media search platform that aggregates user generated content from across the web into a single stream of information.
IGLU is a Java class library designed to facilitate sharing of code among Artificial Intelligence/Information Retrieval researchers to illustrate how various problems can be solved in Java. It is developed and maintained by the IGLU Research Group.
TouchGraph provides a set of interfaces for graph visualization using force-based layout and focus+context techniques. For now only older code is available, but we are planning to release new versions as well.
A configurable knowledge management framework. It works out of the box, but it's meant mainly as a framework to build complex information retrieval and analysis systems. The 3 major components: Crawler, Analyzer and Indexer can also be used separately.
Java program to extract postings and comments from http://www.livejournal.com (blog) into DB and view/classify/process it. LJ loader. Components to reuse: perl-like, but efficient Web pages scraper, trees analyzer, concurrent scheduler.
Develop a java API (JAR library, with an example web GUI) for content management. Simple but powerful, based on Apache Lucene project, it would be embeded on projects requiring content management.
Lucene Server is a java server application for simply create and manage Jakarta Lucene Indexes. It is designed to help you integrate Lucene in distributed environnements.
Java API for creating Rich Site Summary (RSS) feed files. Created for people who want to create RSS files from within their applications but don't want to get into the nitty gritty of working out XML specs.
RewriteFilter is a java servlet filter that try to solve a very common problem of not being well represented in search engines. Pages containing ? are considered by indexers too transient.See the Home Page for more info.