From: Pavel K. <ko...@fz...> - 2004-11-24 16:16:33
|
Recently I have reported unexpectedly high errors in the case of the OSBF method (about 120 errors of the last 500 messages).I have tested the situation by means of the "toer.crm" - learning script by Fidelis Assis and this mail is a report about my results. Firstly, I had completely reproduced the error values reported by Fidelis for the SA corpus. Then I tested my private corpus. I have added 9 new shuffles to the original one (i.e. with the bad results mentioned). My tests support the "bad" results I have obtained by another method earlier. At the end of this long mail the table is presented with my results for 10 shuffles. The original shuffle is denoted as 41 and the above mentioned bad results are marked by the arrow. (It is 122 errors/500 last messages of the shuffle.) It was observed that the original shuffle was the most difficult one in all tests including even the SBPH. The reasons for such a behavior seem to be following: (i) Message preprocessing. It is the main reason for the observed errors.I have tested normalizemime (denoted as N) and simple normalization (S) which deletes some header items from each message (the details are explained below). The number of errors (for both preprocessing methods) increased by the factor of 4-5 (see table). Such a behavior was not observed for the SBPH, where the both preprocessing methods are fully plausible and can lead to better results. It seems that OSBF (at least for my corpus) doesn't like preprocessing. (ii) Number of buckets. OSBF has a strange dependence on the number of buckets. The default is 94321, so very low. The number of errors increases with the number of buckets. The best results have been obtained for b=5321 ! The number of buckets can be decreased only to some limit, of course. Nevertheless I was surprised that with such small numbers of buckets I can achieve the results comparable with the SBPH (for 1000,000 buckets). (iii) The ratio parameter r (suggested by Fidelis) (crm is called as crm -r). It was shown that this parameter has also strong influence on the error rate. The default value is r=9 but the best results for my corpus have been obtained for r=1! Very good are also the error values for r=2-5. Combining the r=1 with the low number of buckets I got at the error rate comparable or even better than the one of SBPH. The details are in the tables below. A message from this analysis could be: the OSBF should be carefully tuned. One can meet very high range of errors (here from 0 to 122 !). Of course, all my conclusions are, strictly speaking, valid only for my corpus and should be checked on different data. I have no comparison with OSB, due to the problem reported by Fidelis (computer freezing). OSBF results: ------------- 10 shuffles of 2750 hams and 2750 spams Errors from the last 500 mails in shuffle. OSBF threshold=10 preprocessing: 0 no normalization N = normalizemime S = simple normalization (explained below) --------------------------------------------------- shuffle | buckets | preproc. | r | errors/500 | --------------------------------------------------- 41 | 5321 | 0 | 9 | 6 | 41 | 5321 | 0 | 1 | 8 | 41 | 15321 | 0 | 9 | 5 | 41 | 54321 | 0 | 9 | 4 | 41 | 94321 | 0 | 9 | 3 | 41 | 94321 | N | 9 | 7 | 41 | 94321 | S | 9 | 93 | 41 | 94321 | 0 | 11 | 5 | 41 | 94321 | 0 | 10 | 3 | 41 | 94321 | 0 | 7 | 4 | 41 | 94321 | 0 | 6 | 5 | 41 | 94321 | 0 | 3 | 4 | 41 | 94321 | 0 | 2 | 5 | 41 | 94321 | 0 | 1 | 10 | 41 | 800001 | 0 | 9 | 4 | 41 | 800001 | N | 9 | 6 | 41 | 800001 | S | 9 | 22 | <----- ---------------------------------------------------- 42 | 5321 | 0 | 9 | 0 | 42 | 5321 | 0 | 1 | 0 | 42 | 15321 | 0 | 9 | 4 | 42 | 54321 | 0 | 9 | 6 | 42 | 94321 | 0 | 9 | 7 | 42 | 94321 | N | 9 | 30 | 42 | 94321 | S | 9 | 35 | 42 | 94321 | 0 | 11 | 7 | 42 | 94321 | 0 | 10 | 7 | 42 | 94321 | 0 | 7 | 6 | 42 | 94321 | 0 | 6 | 6 | 42 | 94321 | 0 | 3 | 3 | 42 | 94321 | 0 | 2 | 2 | 42 | 94321 | 0 | 1 | 0 | 42 | 800001 | 0 | 9 | 9 | 42 | 800001 | N | 9 | 36 | 42 | 800001 | S | 9 | 53 | --------------------------------------------------- 43 | 5321 | 0 | 9 | 2 | 43 | 5321 | 0 | 1 | 0 | 43 | 15321 | 0 | 9 | 3 | 43 | 54321 | 0 | 9 | 9 | 43 | 94321 | 0 | 9 | 10 | 43 | 94321 | N | 9 | 34 | 43 | 94321 | S | 9 | 31 | 43 | 94321 | 0 | 11 | 14 | 43 | 94321 | 0 | 10 | 10 | 43 | 94321 | 0 | 7 | 10 | 43 | 94321 | 0 | 6 | 8 | 43 | 94321 | 0 | 3 | 5 | 43 | 94321 | 0 | 2 | 0 | 43 | 94321 | 0 | 1 | 1 | 43 | 800001 | 0 | 9 | 11 | 43 | 800001 | N | 9 | 41 | 43 | 800001 | S | 9 | 41 | --------------------------------------------------- 44 | 5321 | 0 | 9 | 4 | 44 | 5321 | 0 | 1 | 3 | 44 | 15321 | 0 | 9 | 3 | 44 | 54321 | 0 | 9 | 4 | 44 | 94321 | 0 | 9 | 7 | 44 | 94321 | N | 9 | 41 | 44 | 94321 | S | 9 | 42 | 44 | 94321 | 0 | 11 | 7 | 44 | 94321 | 0 | 10 | 7 | 44 | 94321 | 0 | 7 | 9 | 44 | 94321 | 0 | 6 | 9 | 44 | 94321 | 0 | 3 | 5 | 44 | 94321 | 0 | 2 | 2 | 44 | 94321 | 0 | 1 | 3 | 44 | 800001 | 0 | 9 | 10 | 44 | 800001 | N | 9 | 49 | 44 | 800001 | S | 9 | 57 | --------------------------------------------------- 45 | 5321 | 0 | 9 | 1 | 45 | 5321 | 0 | 1 | 0 | 45 | 15321 | 0 | 9 | 2 | 45 | 54321 | 0 | 9 | 4 | 45 | 94321 | 0 | 9 | 9 | 45 | 94321 | N | 9 | 36 | 45 | 94321 | S | 9 | 24 | 45 | 94321 | 0 | 11 | 8 | 45 | 94321 | 0 | 10 | 9 | 45 | 94321 | 0 | 7 | 9 | 45 | 94321 | 0 | 6 | 8 | 45 | 94321 | 0 | 3 | 2 | 45 | 94321 | 0 | 2 | 1 | 45 | 94321 | 0 | 1 | 0 | 45 | 800001 | 0 | 9 | 12 | 45 | 800001 | N | 9 | 43 | 45 | 800001 | S | 9 | 41 | --------------------------------------------------- 46 | 5321 | 0 | 9 | 1 | 46 | 5321 | 0 | 1 | 0 | 46 | 15321 | 0 | 9 | 5 | 46 | 54321 | 0 | 9 | 7 | 46 | 94321 | 0 | 9 | 10 | 46 | 94321 | N | 9 | 37 | 46 | 94321 | S | 9 | 39 | 46 | 94321 | 0 | 11 | 11 | 46 | 94321 | 0 | 10 | 10 | 46 | 94321 | 0 | 7 | 9 | 46 | 94321 | 0 | 6 | 11 | 46 | 94321 | 0 | 3 | 2 | 46 | 94321 | 0 | 2 | 7 | 46 | 94321 | 0 | 1 | 1 | 46 | 800001 | 0 | 9 | 13 | 46 | 800001 | N | 9 | 40 | 46 | 800001 | S | 9 | 52 | --------------------------------------------------- 47 | 5321 | 0 | 9 | 2 | 47 | 5321 | 0 | 1 | 1 | 47 | 15321 | 0 | 9 | 2 | 47 | 54321 | 0 | 9 | 8 | 47 | 94321 | 0 | 9 | 5 | 47 | 94321 | N | 9 | 38 | 47 | 94321 | S | 9 | 21 | 47 | 94321 | 0 | 11 | 3 | 47 | 94321 | 0 | 10 | 5 | 47 | 94321 | 0 | 7 | 7 | 47 | 94321 | 0 | 6 | 7 | 47 | 94321 | 0 | 3 | 1 | 47 | 94321 | 0 | 2 | 2 | 47 | 94321 | 0 | 1 | 1 | 47 | 800001 | 0 | 9 | 11 | 47 | 800001 | N | 9 | 36 | 47 | 800001 | S | 9 | 35 | --------------------------------------------------- 48 | 5321 | 0 | 9 | 2 | 48 | 5321 | 0 | 1 | 2 | 48 | 15321 | 0 | 9 | 4 | 48 | 54321 | 0 | 9 | 8 | 48 | 94321 | 0 | 9 | 7 | 48 | 94321 | N | 9 | 37 | 48 | 94321 | S | 9 | 39 | 48 | 94321 | 0 | 11 | 7 | 48 | 94321 | 0 | 10 | 7 | 48 | 94321 | 0 | 7 | 10 | 48 | 94321 | 0 | 6 | 6 | 48 | 94321 | 0 | 3 | 3 | 48 | 94321 | 0 | 2 | 2 | 48 | 94321 | 0 | 1 | 2 | 48 | 800001 | 0 | 9 | 13 | 48 | 800001 | N | 9 | 50 | 48 | 800001 | S | 9 | 56 | --------------------------------------------------- 49 | 5321 | 0 | 9 | 3 | 49 | 5321 | 0 | 1 | 2 | 49 | 15321 | 0 | 9 | 3 | 49 | 54321 | 0 | 9 | 10 | 49 | 94321 | 0 | 9 | 10 | 49 | 94321 | N | 9 | 34 | 49 | 94321 | S | 9 | 33 | 49 | 94321 | 0 | 11 | 13 | 49 | 94321 | 0 | 10 | 10 | 49 | 94321 | 0 | 7 | 12 | 49 | 94321 | 0 | 6 | 9 | 49 | 94321 | 0 | 3 | 7 | 49 | 94321 | 0 | 2 | 5 | 49 | 94321 | 0 | 1 | 2 | 49 | 800001 | 0 | 9 | 15 | 49 | 800001 | N | 9 | 46 | 49 | 800001 | S | 9 | 53 | --------------------------------------------------- 410 | 5321 | 0 | 9 | 1 | 410 | 5321 | 0 | 1 | 1 | 410 | 15321 | 0 | 9 | 4 | 410 | 54321 | 0 | 9 | 8 | 410 | 94321 | 0 | 9 | 10 | 410 | 94321 | N | 9 | 43 | 410 | 94321 | S | 9 | 31 | 410 | 94321 | 0 | 11 | 9 | 410 | 94321 | 0 | 10 | 10 | 410 | 94321 | 0 | 7 | 11 | 410 | 94321 | 0 | 6 | 9 | 410 | 94321 | 0 | 3 | 2 | 410 | 94321 | 0 | 2 | 2 | 410 | 94321 | 0 | 1 | 2 | 410 | 800001 | 0 | 9 | 12 | 410 | 800001 | N | 9 | 51 | 410 | 800001 | S | 9 | 43 | --------------------------------------------------- S = simple normalization is a severe reduction of the header items. The following items are deleted: X-, Lines, Received (with exception of the last), Date and all dates in header, Status, List-, MIME-version, Precedence, Delivered-To, Priority and Content-Length. SBPH results ------------ threshold = 25 -------------------------------------------- shuffle | buckets | preproc. | errors/500 | -------------------------------------------- 41 | 1000001 | 0 | 12 | 41 | 1000001 | N | 10 | 41 | 1000001 | S | 5 | 41 | 1300001 | 0 | 12 | 41 | 1300001 | N | 10 | 41 | 1300001 | S | 5 | -------------------------------------------- 42 | 1000001 | 0 | 1 | 42 | 1000001 | N | 0 | 42 | 1000001 | S | 1 | 42 | 1300001 | 0 | 1 | 42 | 1300001 | N | 0 | 42 | 1300001 | S | 0 | -------------------------------------------- 43 | 1000001 | 0 | 2 | 43 | 1000001 | N | 2 | 43 | 1000001 | S | 0 | 43 | 1300001 | 0 | 2 | 43 | 1300001 | N | 2 | 43 | 1300001 | S | 0 | -------------------------------------------- 44 | 1000001 | 0 | 1 | 44 | 1000001 | N | 1 | 44 | 1000001 | S | 1 | 44 | 1300001 | 0 | 1 | 44 | 1300001 | N | 1 | 44 | 1300001 | S | 1 | -------------------------------------------- 45 | 1000001 | 0 | 2 | 45 | 1000001 | N | 2 | 45 | 1000001 | S | 1 | 45 | 1300001 | 0 | 2 | 45 | 1300001 | N | 1 | 45 | 1300001 | S | 1 | -------------------------------------------- 46 | 1000001 | 0 | 2 | 46 | 1000001 | N | 4 | 46 | 1000001 | S | 2 | 46 | 1300001 | 0 | 2 | 46 | 1300001 | N | 4 | 46 | 1300001 | S | 2 | -------------------------------------------- 47 | 1000001 | 0 | 0 | 47 | 1000001 | N | 0 | 47 | 1000001 | S | 1 | 47 | 1300001 | 0 | 0 | 47 | 1300001 | N | 0 | 47 | 1300001 | S | 1 | -------------------------------------------- 48 | 1000001 | 0 | 2 | 48 | 1000001 | N | 2 | 48 | 1000001 | S | 2 | 48 | 1300001 | 0 | 2 | 48 | 1300001 | N | 2 | 48 | 1300001 | S | 1 | -------------------------------------------- 49 | 1000001 | 0 | 1 | 49 | 1000001 | N | 2 | 49 | 1000001 | S | 0 | 49 | 1300001 | 0 | 1 | 49 | 1300001 | N | 0 | 49 | 1300001 | S | 0 | -------------------------------------------- 410 | 1000001 | 0 | 2 | 410 | 1000001 | N | 3 | 410 | 1000001 | S | 3 | 410 | 1300001 | 0 | 2 | 410 | 1300001 | N | 3 | 410 | 1300001 | S | 3 | -------------------------------------------- Pavel Kolar |