[Classifier4j-devel] HTML Tokenize v0.000001 Ready for review
Status: Beta
Brought to you by:
nicklothian
From: Matt C. <MCo...@my...> - 2003-11-15 07:27:14
|
See attached, you will need Xerces and NekoHTML in your classpath. Run TestHTMLDOM and pass either a file name or a HTTP URL as an argument. Although it took me a while to figure out how Xerces works, I think this is an excellent solution. Very flexible. As for implementation, you tell me. This particular code only leaves in the following items: content of meta tags alt text of images plain text It's a cinch to configure alternative parameters. The current output has carriage returns, line feeds and spaces a-plenty. Anybody have a good way of cleaning this mess up? I'm thinking the thing to do would be to replace all the System.out.println calls with a call to some other method. Do we already have an appropriate method in place for this? Do we need a new one? How will this code integrate into c4J? How are we going to get this data into the stop-list-->stemmer? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |