Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> Matt Collier wrote:
> 
>> See attached, you will need Xerces and NekoHTML in your classpath.  

Just to make a note: there is one more option to deal with HTML soup 
(when you nedd to clean up MSWord HTML, for example). It seems, that 
NekoHTML does the same thing, but there is one more library called JTidy 
(http://lempinen.net/sami/jtidy/) based on code from the W3C Tidy 
(http://www.w3.org/People/Raggett/tidy/). Since I did not work with 
Necko, I can not compare them, but, concerning JTidy, I must say, that 
it is pretty good library. It can be used like a JavaBean 
(http://sourceforge.net/docman/display_doc.php?docid=1298&group_id=13153), 
and, finally, it has a very nice option: draconianWord2000Cleaning 
(http://www.w3.org/People/Raggett/tidy/#word2000). I used it for this 
kind of things. Also it does not binded to concrete Xerces version.