From: Ger H. <ge...@ho...> - 2008-09-03 15:08:54
|
On Wed, Sep 3, 2008 at 3:09 PM, Bill Yerazunis <ws...@me...> wrote: > More like: > > LEARN ( c1.stat c2.stat | c5.stat ... c127.stat) < osbf unique> [my.txt] > > which means "train my.txt in as a positive example in statistics files C1 > and C2, and as a negative example in files C5 through C127". If a > file is not found, initialize it as "osbf unique", otherwise use > the self identification in the file to choose the correct learning > method. Whoa. I am probably OD'ing on Microsoft Excel right now so my 'grok' is down to zero, but can you please run that "self identification" bit by me again? Or is that something along the lines of 'open file, read header, check classifier id+config in there, *then* jump to classifier? (Which can be done, if you provide the 'csscreate' script opcode or some such (which is only a stupid stub in GerH now, btw) which is then to be used to 'create/set-up' any new CSS file. (mailreaver's 'learn zilch' trick to create css on the fly has to be replaced then with such a csscreate opcode.) Am I thinking too 'classical/procedural' here regarding learn? Anyway, from what I read in your text is that you're going for something like this: assume message M which will be classified, then [unidentified intelligent code] will train message M as 'spam' or 'ham' --> code assuming auto-ID'ing classifier as described above so no attributes needed: classify (S|H) [M] ... learn (S|H) [M] --> learn as spam (left side is 'S'pam CSS files, right is 'H'am CSS) ... learn (H|S) [M] --> learn as ham (because now 'H'am is at left) which means you rotate the S/H CSS file[s] [collections] around that | pipe symbol there. That would be identical - I think - to Paolo's learn (S|H) <1> [M] --> learn '1st' side == left side == spam ... learn (S|H) <2> [M] --> learn '2nd' side == ham Now for multiclass A|B|C|D|... it would probably work the same, you just rotate the proper class E {A,B,C,D,...} (E == element of, no math symbols in email) to the front while Paolo's would send along the proper 'index' value as an attribute or some such. If it's like that, I'd rather have the 'indexed' variant instead of the 'rotated around | pipe' style because it would take one isolated var only to schlepp that bunch around and it saves on possible if/else conditionals as well, because I might be able to blunty derive index i E {<1>, <2>, ..} from a previously determined pR using a bit of :@: math, but that's just me. The 'rotating' style is auto-backwards compatible (while keeping 'details' like <refute> outside that equation for now) when you have 'optional pipe' instead of 'required pipe' (and provided "you know what you are doing" caveat applies to script writer). Meanwhile, SVM still has 2 pipes and 3 files where anybody else uses A|B (1 pipe, 2 files) for same, so there's still a bit of 'irregularity' there to my mind, but then I probably should stick to looking at lotsa numbers in rectangles instead of attempting brain activity today. -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 -------------------------------------------------- |