From: Bill Y <ws...@me...> - 2007-07-25 21:37:08
|
Eugene: You're absolutely correct- very short and ambiguous texts often do not have enough signal to get a score outside the thick threshold. For example, the "null spams" sometimes don't make it, it depends on how long the "from" address is. I've tried score normalizers but haven't come up with one that didn't hurt overall accuracy because of _devaluing_ long message scores. It's almost like a truncation of Brownian motion. So- if you can come up with a patch that actually helps, I'd be glad to test it and if it proves out against the test corpora, we'll put it in the mainline. Assuming you're using OSB, the place to look in the code is in file crm_osb_bayes.c, around line 1762: for (m = succhash; m < maxhash; m++) if (bestseen != m) { remainder = remainder + ptc[m]; }; overall_pR = log10 (accumulator) - log10 (remainder); ------^^^^^^^^^^ right here! ... and the total number of words found in the input is "unk_features", which would be the parameter to normalize with. If your patch can approach an asymptote of 1.0 when unk_features is > 100 to 200 (that's 25 to 50 "words"), that would be very useful. - Bill Yerazunis Date: Wed, 25 Jul 2007 17:32:08 +0400 From: Eugene Crosser <cr...@av...> Hello, I noticed that very small messages (with just a few words in the body) usually fall into "need learning" category, that is, their score have correct sign, but does not reach the "thick threshold" line (+/- 10). Can it be because absolute value of the score is raising with the number of tokens processed? If so, wouldn't it be right to normalize by the size of the text, or by the number of "significant" tokens processed, or somehow? Eugene ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Crm114-general mailing list Crm...@li... https://lists.sourceforge.net/lists/listinfo/crm114-general |