RE: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 
> Nick Lothian wrote:
> > What are peoples general requirements for an HTML Tokenizer? 
> > 
> > Personally, I want to get rid of all the tags and just get 
> the pure text of
> > the document.
> 
> I think meta tags are required if you need to classify (or train) 
> already classified html documents. Also remember Doublin Core 
> meta tags 
> (http://www.ietf.org/rfc/rfc2731.txt). But alts, titles etc could be 
> missed, since the only real meta are in meta tags... Also 
> remember that 
> if you need to classify non-ASCII text, the only source for 
> encoding is 
> meta tag.
> 
> 

So you are suggesting using the dublin core tags for deciding what category
a document says it is in when training? That's a good idea - I hadn't
thought of that.

With respect to non-ASCII text, why does C4J need to know what encoding the
source is in? I the definition of word breaks etc is encoding (and language)
specific, but this is a limitation of the current default tokenizer, too. Is
that the on;y reason to find the encoding?