Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Nick Lothian wrote:
> So you are suggesting using the dublin core tags for deciding what category
> a document says it is in when training? 

Well, if it is not too time-consuming to implement, it could be a nice 
option. As i know, DC metadata set is the only near-standard way to 
write real metadata to metatags, since metas in html spec. are almost 
undefined, and left to document author to decide what to do with them...

> With respect to non-ASCII text, why does C4J need to know what encoding the
> source is in? I the definition of word breaks etc is encoding (and language)
> specific, but this is a limitation of the current default tokenizer, too. Is
> that the on;y reason to find the encoding?

I am not very informed on how bayesian algorythm works. I know, that it 
can be used without knowing encoding, also, but we have talked about 
stemmers and stop-words, and it seems that this stuff is 
language-encoding specific... Correct me if I am wrong.