From: martin f k. <ma...@ma...> - 2007-05-24 23:53:54
|
Hi, a colleague showed me his way of invoking crm114, which is |formail -kxFrom: -xSubject: | crm =E2=80=A6 so effectively he deletes all headers but From and Subject. This got me thinking: do headers contain valuable information for crm114? There is a lot of redundant or constant information in there, and spam these days comes via random routes and from random senders with random subjects anyway. Would it not make sense to simply crop all headers other than Subject and then to train and classify based on the subject and body only? Cheers, --=20 martin; (greetings from the heart of the sun.) \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck =20 spamtraps: mad...@ma... =20 http://www.vcnet.com/bms/ |
From: Tony G. <to...@of...> - 2007-05-25 04:27:01
|
Good idea. Run the test and report back on the improvement, if any, in accuracy. On 5/24/07, martin f krafft <ma...@ma...> wrote: > Hi, > > a colleague showed me his way of invoking crm114, which is > > |formail -kxFrom: -xSubject: | crm =85 > > so effectively he deletes all headers but From and Subject. > > This got me thinking: do headers contain valuable information for > crm114? There is a lot of redundant or constant information in > there, and spam these days comes via random routes and from random > senders with random subjects anyway. > > Would it not make sense to simply crop all headers other than > Subject and then to train and classify based on the subject and body > only? > > Cheers, > > -- > martin; (greetings from the heart of the sun.) > \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck > > spamtraps: mad...@ma... > > http://www.vcnet.com/bms/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > > iD8DBQFGViWMIgvIgzMMSnURAkm4AKDi89/b1kqAR7lBzUkZffFOiHj9twCfWzgZ > lVVf9sUgjl+HeQ1iR6bFR8M=3D > =3Doznd > -----END PGP SIGNATURE----- > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Crm114-discuss mailing list > Crm...@li... > https://lists.sourceforge.net/lists/listinfo/crm114-discuss > > --=20 -- Tony Godshall g |
From: Paolo <oo...@us...> - 2007-05-25 05:22:17
Attachments:
pf.awk
|
On Fri, May 25, 2007 at 01:53:48AM +0200, martin f krafft wrote: > |formail -kxFrom: -xSubject: | crm ??? ... > This got me thinking: do headers contain valuable information for yes > Would it not make sense to simply crop all headers other than > Subject and then to train and classify based on the subject and body depends on your stuff: if you want to classify your mail, other than spam/good, you *may* want to keep just to: cc: subject: and body, since you're mostly interested in the actual body contents. But for spam/good extensive tests have shown that headers play a big role, though how useful such role is, depends on classifier (eg ifile(1) didn't like them). I'm used to trash all ID headers and zero date fields though, as that's pure noise. See attached awk script to get an(other) idea to play with. (you may want to (try to) re-implement it in crm as an exercise ;) if so pls post back :) ) -- paolo |
From: Tony G. <to...@of...> - 2007-05-27 21:12:55
|
In my tests, I found that running through spamassassin (with razor tests) and keeping its headers did better than giving crm114 the raw headers since spamassasin generalized the whole open-relay-identification issue. But haven't done the same test with recent classifiers On 5/24/07, Paolo <oo...@us...> wrote: > On Fri, May 25, 2007 at 01:53:48AM +0200, martin f krafft wrote: > > |formail -kxFrom: -xSubject: | crm ??? > ... > > This got me thinking: do headers contain valuable information for > > yes > > > Would it not make sense to simply crop all headers other than > > Subject and then to train and classify based on the subject and body > > depends on your stuff: if you want to classify your mail, other than > spam/good, you *may* want to keep just to: cc: subject: and body, > since you're mostly interested in the actual body contents. > > But for spam/good extensive tests have shown that headers play a big role, > though how useful such role is, depends on classifier (eg ifile(1) didn't > like them). > I'm used to trash all ID headers and zero date fields though, as that's > pure noise. > > See attached awk script to get an(other) idea to play with. > (you may want to (try to) re-implement it in crm as an exercise ;) > if so pls post back :) ) > > > > -- > paolo > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Crm114-discuss mailing list > Crm...@li... > https://lists.sourceforge.net/lists/listinfo/crm114-discuss > > > -- -- Tony Godshall g |