RE: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 
> Nick Lothian wrote:
> > So you are suggesting using the dublin core tags for 
> deciding what category
> > a document says it is in when training? 
> 
> Well, if it is not too time-consuming to implement, it could 
> be a nice 
> option. As i know, DC metadata set is the only near-standard way to 
> write real metadata to metatags, since metas in html spec. are almost 
> undefined, and left to document author to decide what to do 
> with them...
> 

Yes - it's a good idea. I'm not sure if I'll get to implement it, though ;-)

> > With respect to non-ASCII text, why does C4J need to know 
> what encoding the
> > source is in? I the definition of word breaks etc is 
> encoding (and language)
> > specific, but this is a limitation of the current default 
> tokenizer, too. Is
> > that the on;y reason to find the encoding?
> 
> I am not very informed on how bayesian algorythm works. I 
> know, that it 
> can be used without knowing encoding, also, but we have talked about 
> stemmers and stop-words, and it seems that this stuff is 
> language-encoding specific... Correct me if I am wrong.
> 

Yes, I think you are right. Our stemmers & stop words will be language
specific. Unfortunately I don't see a way around this, unless there is some
magic way to generate a stop word list & stemmer in any language...