From: Paolo <oo...@us...> - 2007-10-27 08:27:54
|
On Fri, Oct 26, 2007 at 06:37:32PM -0600, Brad Waite wrote: > Is there a benefit to training/classifying with or without markup (HTML, XML, > etc)? experience shows that testing the raw text yields better results, in general. > Or would the tags be ignored if they show up in both CSS files? they add up equally to feats count for both classes, so to some extent they 'cancel out'; lot of common feats lowers the diffs/common ratio though, hence pulls pR toward midway. For HTML+CSS msgs it's ok to preprocess eg via unhtml(1), so you get the real text only, but spam comes with lot of *hidden* tags and/or even no clear text at all (just imgs, webbugs, etc) hence preproc would just throw away *all* body spam markers. In short, preproc might be good/desirable for classifying good text, not for spam/good razors. -- paolo |