From: Bill Y. <ws...@me...> - 2004-11-15 16:22:26
|
I took a look at Fidelis's patch last night. It's a good patch. It's now in the main stream. I advise others to use it. (though I want to rethink -p. I think that there may be a better way) -Bill Yerazunis From: Fidelis Assis <fi...@po...> X-Accept-Language: en-us, en, pt-br Cc: Crm114-SF <crm...@li...>, Crm114-devel <crm...@li...> X-Spam-Score: 0.0 (/) Sender: crm...@li... Date: Fri, 12 Nov 2004 20:11:55 -0200 Received-SPF: pass (cambridge.merl.com: domain of crm...@li... designates 66.35.250.206 as permitted sender) X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on tsumi.merl.com X-Spam-Level: X-Spam-Status: No, hits=-4.6 required=4.5 tests=AWL,BAYES_00 autolearn=no version=2.63 Bill, This is the new OSBF patch, now updated for 20041110.BlameFidelisMore, which improves OSBF accuracy even further, reaching the measurable, and reproducible, mark of 1.03 errors in 500 messages! It also fixes some bugs. It is about 15k and can be downloaded from: ftp://ftp.embratel.net.br/pub/opensource/crm114/patch-20041110.BlameFidelisMore.gz md5sum - 77103165f0e87bf981c94a53d82bc232 Here are the results of this new patch on 6 10xshuffles from the SpamAssassin corpus (4147 msgs each shuffle), using thresholds from 10 to 15: ----------------------------------- | Errors in last 500 messages | | 10xshuffles 0 to 5 | ------------------------------------------------------ Threshold | 0 1 2 3 4 5 | Avg ------------------------------------------------------ 10 | 1.2 1.1 0.6 1.3 0.9 1.1 | 1.03 12 | 0.8 1.2 0.9 1.2 1.2 1.2 | 1.08 11 | 1.0 1.2 0.9 1.6 1.0 1.2 | 1.15 14 | 1.5 1.2 0.7 1.4 1.1 1.1 | 1.17 15 | 1.2 1.2 0.3 1.4 1.4 1.6 | 1.18 (0.3!) 13 | 1.4 1.6 0.6 1.5 1.2 1.1 | 1.23 ------------------------------------------------------ With threshold = 12 it gives the best error rate on Bill's 10xshuffle, 0.8/500. With threshold = 10 it gives the best average error rate on the 6 10xshuffles, 1.03/500, and shows more uniform results. Conditions of the tests: buckets: 94.321 (default) max chain length: 29 (default) pmax/pmin ratio: 9 (default) New records :) - average error rate on 6 10xshuffles = 1.03, for threshold = 10 - average error rate on on Bill's 10xshuffle = 0.8, for threshold = 12 - 65 seconds to classify 4147 messages, and train 260, on a Pentium III, 800 MHz, 256M, running Linux 2.4.18. Evan, if you have the time and give it a try with your Monkeyplexer for multi-class, I'll appreciate very much :) Feedbacks are more than welcome! -- Fidelis Assis PS: Bill I noted what seems to be a big problem with your new OSB code in this version. After classifying about 700 msgs, 30 learnings, it became so slow that my PC was completely unusable, not responding to the mouse or keyboard. I tried with your original code and with the one in this patch which makes just minor (type) corrections. --------------------------------------------------------------------- Changes: - Improved confidence factor (OSBF), with additional per class factor: unique/total features; - better accuracy; - allows for reduced threshold ( approx. 10); - reduces the number of reinforcements needed for getting final accuracy. - New hash function compatible with the original but portable; - not tested with 64 bits, but it's supposed return the same values as for 32 bits because it's not affected by endianess; - OSBF "classify" accepts an extra, optional, argument to set the pR success/failure decision point (0 by default). This can be easily extended to the other classifiers, if desired. Ex: classify <osbf microgroom> (:*:nonspamcss: | :*:spamcss: ) ( :stats: ) [:msg:] /:*:lcr:/ /:*:offset/ "classify" suceeds if the calculated pR is >= :offset:. The default offset is 0 and if this argument is not given, the behaviour is exactly the same as in the original code. If it is given, the pR shown in :stat: will have the form "pR/offset". This extra parameter makes it easier to bias against false positives and to use thresholds in mailfilter. - Fixed a bug with chain average lenght calculation in cssutil and osbf-util; - New command line option, -r, to set min pmax/pmin ratio (default=9); - Other minor bug fixes and code improvements; ------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Crm114-general mailing list Crm...@li... https://lists.sourceforge.net/lists/listinfo/crm114-general |