[Crm114-discuss] The new LEARN syntax

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Wed, Sep 3, 2008 at 3:09 PM, Bill Yerazunis <ws...@me...> wrote:
> More like:
>
>  LEARN ( c1.stat c2.stat | c5.stat ... c127.stat) < osbf unique> [my.txt]
>
> which means "train my.txt in as a positive example in statistics files C1
> and C2, and as a negative example in files C5 through C127".  If a
> file is not found, initialize it as "osbf unique", otherwise use
> the self identification in the file to choose the correct learning
> method.

Whoa. I am probably OD'ing on Microsoft Excel right now so my 'grok'
is down to zero, but can you please run that "self identification" bit
by me again?

Or is that something along the lines of 'open file, read header, check
classifier id+config in there, *then* jump to classifier? (Which can
be done, if you provide the 'csscreate' script opcode or some such
(which is only a stupid stub in GerH now, btw) which is then to be
used to 'create/set-up' any new CSS file. (mailreaver's 'learn zilch'
trick to create css on the fly has to be replaced then with such a
csscreate opcode.)

Am I thinking too 'classical/procedural' here regarding learn? Anyway,
from what I read in your text is that you're going for something like
this:

assume message M which will be classified, then [unidentified
intelligent code] will train message M as 'spam' or 'ham' --> code
assuming auto-ID'ing classifier as described above so no attributes
needed:

classify (S|H) [M]
...
learn (S|H) [M]  --> learn as spam (left side is 'S'pam CSS files,
right is 'H'am CSS)
...
learn (H|S) [M] --> learn as ham (because now 'H'am is at left)

which means you rotate the S/H CSS file[s] [collections] around that |
pipe symbol there.
That would be identical - I think - to Paolo's

learn (S|H) <1> [M] --> learn '1st' side == left side == spam
...
learn (S|H) <2> [M] --> learn '2nd' side == ham

Now for multiclass A|B|C|D|... it would probably work the same, you
just rotate the proper class E {A,B,C,D,...}  (E == element of, no
math symbols in email) to the front while Paolo's would send along the
proper 'index' value as an attribute or some such.
If it's like that, I'd rather have the 'indexed' variant instead of
the 'rotated around | pipe' style because it would take one isolated
var only to schlepp that bunch around and it saves on possible if/else
conditionals as well, because I might be able to blunty derive index i
E {<1>, <2>, ..} from a previously determined pR using a bit of :@:
math, but that's just me. The 'rotating' style is auto-backwards
compatible (while keeping 'details' like <refute> outside that
equation for now) when you have 'optional pipe' instead of 'required
pipe' (and provided "you know what you are doing" caveat applies to
script writer).

Meanwhile, SVM still has 2 pipes and 3 files where anybody else uses
A|B (1 pipe, 2 files) for same, so there's still a bit of
'irregularity' there to my mind, but then I probably should stick to
looking at lotsa numbers in rectangles instead of attempting brain
activity today.

-- 
Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------