...First, over-training and second, shortage of data for a training of categories. Instead, each TXT file is a category on its own, rather than an assigned category. In a way, this is similar to clustering but not really a clustering algorithm since there is some training involved. The summarizer from Classifier4J has been adjusted to accept two inputs (lets call them A and B). Then, the summarizer gets trained with A to summarize a document B, and vice versa. This extracts a relevant structure for both documents (and thus avoids the over-training) which are then compared using the Vector-Space analysis to give a range of belonging of one document to another (and thus avoids the shortage of information). ...