RE: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-17 06:48:14
|
> > Nick Lothian wrote: > > So you are suggesting using the dublin core tags for > deciding what category > > a document says it is in when training? > > Well, if it is not too time-consuming to implement, it could > be a nice > option. As i know, DC metadata set is the only near-standard way to > write real metadata to metatags, since metas in html spec. are almost > undefined, and left to document author to decide what to do > with them... > Yes - it's a good idea. I'm not sure if I'll get to implement it, though ;-) > > With respect to non-ASCII text, why does C4J need to know > what encoding the > > source is in? I the definition of word breaks etc is > encoding (and language) > > specific, but this is a limitation of the current default > tokenizer, too. Is > > that the on;y reason to find the encoding? > > I am not very informed on how bayesian algorythm works. I > know, that it > can be used without knowing encoding, also, but we have talked about > stemmers and stop-words, and it seems that this stuff is > language-encoding specific... Correct me if I am wrong. > Yes, I think you are right. Our stemmers & stop words will be language specific. Unfortunately I don't see a way around this, unless there is some magic way to generate a stop word list & stemmer in any language... |