[Classifier4j-devel] HTML Tokenize v0.000001 Ready for review

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

See attached, you will need Xerces and NekoHTML in your classpath.  Run 
TestHTMLDOM and pass either a file name or a HTTP URL as an argument.

Although it took me a while to figure out how Xerces works, I think this is an 
excellent solution.  Very flexible.  As for implementation, you tell me.

This particular code only leaves in the following items:

content of meta tags
alt text of images
plain text

It's a cinch to configure alternative parameters.

The current output has carriage returns, line feeds and spaces a-plenty.  
Anybody have a good way of cleaning this mess up?

I'm thinking the thing to do would be to replace all the System.out.println 
calls with a call to some other method.  Do we already have an appropriate 
method in place for this?  Do we need a new one?

How will this code integrate into c4J?

How are we going to get this data into the stop-list-->stemmer?

Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN