Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review
Status: Beta
Brought to you by:
nicklothian
From: moedusa <mo...@in...> - 2003-11-16 10:38:41
|
> Matt Collier wrote: > >> See attached, you will need Xerces and NekoHTML in your classpath. Just to make a note: there is one more option to deal with HTML soup (when you nedd to clean up MSWord HTML, for example). It seems, that NekoHTML does the same thing, but there is one more library called JTidy (http://lempinen.net/sami/jtidy/) based on code from the W3C Tidy (http://www.w3.org/People/Raggett/tidy/). Since I did not work with Necko, I can not compare them, but, concerning JTidy, I must say, that it is pretty good library. It can be used like a JavaBean (http://sourceforge.net/docman/display_doc.php?docid=1298&group_id=13153), and, finally, it has a very nice option: draconianWord2000Cleaning (http://www.w3.org/People/Raggett/tidy/#word2000). I used it for this kind of things. Also it does not binded to concrete Xerces version. |