RE: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-17 06:27:47
|
> > Nick Lothian wrote: > > What are peoples general requirements for an HTML Tokenizer? > > > > Personally, I want to get rid of all the tags and just get > the pure text of > > the document. > > I think meta tags are required if you need to classify (or train) > already classified html documents. Also remember Doublin Core > meta tags > (http://www.ietf.org/rfc/rfc2731.txt). But alts, titles etc could be > missed, since the only real meta are in meta tags... Also > remember that > if you need to classify non-ASCII text, the only source for > encoding is > meta tag. > > So you are suggesting using the dublin core tags for deciding what category a document says it is in when training? That's a good idea - I hadn't thought of that. With respect to non-ASCII text, why does C4J need to know what encoding the source is in? I the definition of word breaks etc is encoding (and language) specific, but this is a limitation of the current default tokenizer, too. Is that the on;y reason to find the encoding? |