Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review
Status: Beta
Brought to you by:
nicklothian
From: moedusa <mo...@in...> - 2003-11-17 06:39:37
|
Nick Lothian wrote: > So you are suggesting using the dublin core tags for deciding what category > a document says it is in when training? Well, if it is not too time-consuming to implement, it could be a nice option. As i know, DC metadata set is the only near-standard way to write real metadata to metatags, since metas in html spec. are almost undefined, and left to document author to decide what to do with them... > With respect to non-ASCII text, why does C4J need to know what encoding the > source is in? I the definition of word breaks etc is encoding (and language) > specific, but this is a limitation of the current default tokenizer, too. Is > that the on;y reason to find the encoding? I am not very informed on how bayesian algorythm works. I know, that it can be used without knowing encoding, also, but we have talked about stemmers and stop-words, and it seems that this stuff is language-encoding specific... Correct me if I am wrong. |