Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review
Status: Beta
Brought to you by:
nicklothian
From: moedusa <mo...@in...> - 2003-11-17 06:14:39
|
Nick Lothian wrote: > What are peoples general requirements for an HTML Tokenizer? > > Personally, I want to get rid of all the tags and just get the pure text of > the document. I think meta tags are required if you need to classify (or train) already classified html documents. Also remember Doublin Core meta tags (http://www.ietf.org/rfc/rfc2731.txt). But alts, titles etc could be missed, since the only real meta are in meta tags... Also remember that if you need to classify non-ASCII text, the only source for encoding is meta tag. |