NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
LIMO stands for Lucene Index Monitor. It is a web application that gives basic information about indexes used by the Lucene search engine (http://lucene.apache.org). It allows you to browse and search the index, and reconstruct stored fields.
Classifier4J is a java library that provides an API for automatic classification of text. The default (and only current) implementation of this API is a Bayesian classifier. This library can be used for multiple purposes - as a spam filter or a blog cl
TouchGraph provides a set of interfaces for graph visualization using force-based layout and focus+context techniques. For now only older code is available, but we are planning to release new versions as well.
Framework for search and display of heterogenous document collections.
The eXtensible Text Framework (XTF) is an architecture that supports searching across collections of heterogeneous textual data (XML, PDF, HTML, text, and more), and the presentation of results and documents in a highly configurable manner. Includes highly customized versions of the proven open-source components Lucene and Saxon.
Java API for creating Rich Site Summary (RSS) feed files. Created for people who want to create RSS files from within their applications but don't want to get into the nitty gritty of working out XML specs.
Lucene Server is a java server application for simply create and manage Jakarta Lucene Indexes. It is designed to help you integrate Lucene in distributed environnements.
IGLU is a Java class library designed to facilitate sharing of code among Artificial Intelligence/Information Retrieval researchers to illustrate how various problems can be solved in Java. It is developed and maintained by the IGLU Research Group.
TM4J is a topic map engine implemented entirely in Java. Topic maps are a standard paradigm for the interchange of knowledge structures. This project aims to produce a complete suite of tools for creating, processing and publishing topic map information.
arachne is a C++ library for HTTP crawling, link, text and metadata extraction designed to run in a distributed environment.
This project will implement DAV Searching & Locating (DASL), an application of HTTP/1.1 forming a lightweight search protocol to transport queries and result sets and allows clients to make use of server-side search facilities.
DBPrism is a framework to generate dynamic XML from a database, it provides an high performance DBGenerator for Cocoon2. Also is a J2EE replacement for Oracle mod_plsql. This project also includes a Restlet-Oracle connector exam. and Lucene Domain In
This project is an attempt to create a database suitable for storing book reviews, and links to book reviews found on the internet and elsewhere. The intent is to re-use Apache components wherever possible - including Xindice XML database, Cocoon2 XM
Alternative web server technology for publication, s and searching.
A SOAP-based Document/File-Sharing solution written in Java. It includes a basic web-interface but other clients are possible. You can share and download all common office document formats like MS Word, Excel, OpenOffice and PDF.
Harkat is a social media search platform that aggregates user generated content from across the web into a single stream of information.
This project is a Dmoz RDF parser and utilities to allow you to manipulate, display, and navigate the Dmoz RDF data on your web site. It will make use of software at jakarta.apache.org and xml.apache.org to display the data and will attempt to tightly int
Java program to extract postings and comments from http://www.livejournal.com (blog) into DB and view/classify/process it. LJ loader. Components to reuse: perl-like, but efficient Web pages scraper, trees analyzer, concurrent scheduler.
A servlet that generates html indexes for your music collection, for easy browsing and listening to your remote vorbis files ;) Using XSLT as stylesheet.
The OpenBorges project intends to provide an humble place to experiment, and debate, about what can be an open, distributed, adaptive and collaborative, semantic virtual library. Inspirations are: As we May Think, Library of Babel, and Weaving the web
OpenPipe is a scalable platform for manipulating a stream of documents. PipeLines are created from building bricks doing atomic operations on documents, like language detection, field manipulation, POS tagging, entity extraction or posting to Solr.
A threaded Web graph (Power law random graph) generator written in Python. It can generate a synthetic Web graph of about one million nodes in a few minutes on a desktop machine. It implements a threaded variant of the RMAT algorithm.
RewriteFilter is a java servlet filter that try to solve a very common problem of not being well represented in search engines. Pages containing ? are considered by indexers too transient.See the Home Page for more info.
Develop a java API (JAR library, with an example web GUI) for content management. Simple but powerful, based on Apache Lucene project, it would be embeded on projects requiring content management.