RE: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-17 05:50:43
|
What are peoples general requirements for an HTML Tokenizer? Personally, I want to get rid of all the tags and just get the pure text of the document. I'm thinking about writing a tokenizer based on this (obviouly cleaned up and turned into a tokenizer): import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.Stack; public class Main { static Stack stack = new Stack(); public static void main(String[] args) throws IOException { BufferedReader reader = new BufferedReader(new FileReader("input.html")); StringBuffer contents = new StringBuffer(); String line = reader.readLine(); while (line != null) { contents.append(line); line = reader.readLine(); } char[] chars = contents.toString().toCharArray(); for (int i = 0; i < chars.length; i++) { if (chars[i] == '<') { stack.push(Boolean.TRUE); } else if (chars[i] == '>') { stack.pop(); } else if (stack.size() == 0) { System.out.print(chars[i]); } } } } What do people think? The advantage is that it doesn't require any external libraries. The disadvange is that it can't return things like meta tag information, or things in alt or text attributes. Opinions? Nick > -----Original Message----- > From: moedusa [mailto:mo...@in...] > Sent: Sunday, 16 November 2003 9:10 PM > To: cla...@li... > Subject: Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for > review > > > > Matt Collier wrote: > > > >> See attached, you will need Xerces and NekoHTML in your > classpath. > > Just to make a note: there is one more option to deal with HTML soup > (when you nedd to clean up MSWord HTML, for example). It seems, that > NekoHTML does the same thing, but there is one more library > called JTidy > (http://lempinen.net/sami/jtidy/) based on code from the W3C Tidy > (http://www.w3.org/People/Raggett/tidy/). Since I did not work with > Necko, I can not compare them, but, concerning JTidy, I must > say, that > it is pretty good library. It can be used like a JavaBean > (http://sourceforge.net/docman/display_doc.php?docid=1298&grou > p_id=13153), > and, finally, it has a very nice option: draconianWord2000Cleaning > (http://www.w3.org/People/Raggett/tidy/#word2000). I used it for this > kind of things. Also it does not binded to concrete Xerces version. > > > > > ------------------------------------------------------- > This SF. Net email is sponsored by: GoToMyPC > GoToMyPC is the fast, easy and secure way to access your computer from > any Web browser or wireless device. Click here to Try it Free! > https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/ g22lp.tmpl _______________________________________________ Classifier4j-devel mailing list Cla...@li... https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |