Web-as-corpus tools in Java.
* Simple Crawler (and also integration with Nutch and Heritrix)
* HTML cleaner to remove boiler plate code
* Language recognition
* Corpus builder
Open Source Semantic Web Search Engine Software: If two machines anywhere on the web can agree on the same definition of a digital service or digital good, then machine to machine transactions can use this lingua franca to transact on the users behalf.
This project intends to create an indexing search engine, for knowledge management. The primary object is to apply an information retrieval core. And implement a knowledge data discovery theory such as data mining algorithm, text mining.