From: Paolo <oo...@us...> - 2008-09-03 07:27:40
|
On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote: ... > Given 60 classes (= CSS files), Paolo can have his KISS and I can eat > my pie too. Simple. let me stress once again that I question the _requirement_. > The set passed to classify is a set and should be passed to learn as right, if we had vector/array struct it'd be 'natural' ... > isolate (:c:) /class1 | class2 | and so on .../ ... > classify (:*:c:) [message] which is a fake vector, works on strict assumptions on how to name var/classes. Like in other situations, having true array data structure would be quite useful. > learn (:*:c:) (index) [message] > > Both look good to _me_. ;-) agreed, provided that ! learn (:*:s:) [message] <i flags> where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the N classes in use, is (remains) legal (where allowed). > Because you always pass along the whole set at script level, the > classifier code (both learn and classify implementation) gets to pick there's no need for that, where's the binding between script level and classifiers implementation? eg I can define N classes, but use any subset for both LEARN / CLASSIFY at any point to my taste/needs, with the limit of the actual classifier's requirement: !# use classes: one two three four five six seven ! learn (one two three four five six seven) <i flags> [msg_x] ! learn (three four five) <i flags> [msg_y] ! learn (one) <flags> [msg1] ! ... ! classify (one two three four five six seven) <flags> ! classify (five six seven) <flags> ! classify (three six seven) <flags> ! classify (one three four six seven) <flags> ! classify (six) <flags> (cm) # class membership -> cm, unsupported atm ... > what they want/need, you get the chance to apply filters & processes > in learn that are simply impossible right now PLUS you don't have to that's C level, SVM wants 3 because it uses 3 in both cases. > worry anymore either which classifier you're gonna use because today > all the bloody buggers require their own particular incantation when > it comes to number of css files (classes) passed to learn. there are categories of classifiers that have same requirements wrt #classes and params. Now suppose the actual classes are compatible, but one classifier needs 1+ extras (eg SVM) and I want to compare classifiers, then it'd be nice to do (SVM case, forget 4now actual class compatibility): ! learn (a b a_v_b) <svm flags> # wants all 3 ! classify (a b a_v_b) <svm flags> (s_svm) # wants all 3 ! classify (a b) <xxx flags> (s_xxx) # can't use the extra a_v_b > So no unified ... mess; I'd say it's unified ... structure / design. maybe, but that's not as simple as saying : define: N classes hence: LEARN(1 2 ... N) CLASSIFY(1 2 ... N) which might turn into a mess, or better shift the mess from one place to another. > Cost for Trever @ 60 classes? nil. wasn't thinking of run time cost, but script readability. > You save far more time when you find a way to reduce disc I/O cache > misses on your memory-mapped CSS files, even when you achieve such a > feat for learn alone (which would be rather weird and besides, unless > you 'Train Everything', optimizing classify is the winner). I have a yes, though once N classes get mmaped for a CLASSIFY a single class LEARN can check for it and won't mmap() again, and mmsync() can be deferred iff other processes that use same class(es) do that via shared mem. > Want some real, achievable gain? convert crm114 to play 'server', i.e. > permanently loaded and CSS files (close to) permanently mapped in yes yes yes yes - the endless daemon saga :) > invocation of crm114 and the moment the script *tokenizer* kicks in. > You're not even *executing* script yet by then! The rest (8%) is > spread across tokenizing ('compiling the [small!] script'), tokenized > script code execution, wrap-up and unidentified fluff elsewhere. > Believe me, if I'd see an easy way to kick that bugger into higher > gear, you'd already have it. yeah, maybe the ability to run pre-compiled scripts can be good idea for a number of applications. > seriously considering hacking crm114 into becoming mod_crm114, i.e. an > Apache2 plugin: you get the server, the socket I/O and the like Apache's Lucene and derivatives. > live in there like a wicked PHP-alike server-side scripting language > and you will definitely achieve instant notoriety. ;-) and support headache ;) > Anyhow, I don't see any good reason why the learn (classes) argument > cannot be identical to the related classify (classes) argument, except see above: CAN but definitely should not be a MUST. > ONE: strict adherence to 'backwards compatibility' at CRM114 script just one good reason. -- paolo |
From: Ger H. <ge...@ho...> - 2008-09-03 10:28:19
|
On Wed, Sep 3, 2008 at 9:27 AM, Paolo <oo...@us...> wrote: > On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote: > ... >> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat >> my pie too. Simple. > > let me stress once again that I question the _requirement_. [...] > agreed, provided that > > ! learn (:*:s:) [message] <i flags> > > where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the > N classes in use, is (remains) legal (where allowed). Yes, that should be possible in my line of thought. Assuming "you know what you're doing" i.e. are aware of classifier internals, you can do this in the new 'learn'. Take existing OSB for example (*forget* my 'delta' stuff for a sec there), which touches only a single CSS file on learn, then learn (A|B) <1> is identical to learn (A) <1> is identical to learn (A) is identical to learn (A|B) because <1> is a possible 'default' -- though that might be a disputable thing - I'd rather see an error report, because learn (A|B) isn't 'obviously' going to teach the way of A. The thing I'm really after is that at script level learn (A|B|...) <i> is supported for _all_ classifiers. When you're doing smart stuff script-wise where you like to code learn (A) while you classify code is classify (A|B|C|D|E|F|..) fine. The bit of 'cut at pipe, pick the ones you want' code I envision can handle it, so you've got options script-code-wise. In other words: a 'set' of one, is still a set in my book. That you as a script writer might want to take that thought to the edge (set of 1) is fine with me. I always appreciate that kind of craftiness. It's just that the starting point shifts for people new to this: keep the set around and apply to both classify and learn equally. When you are ready to read the fine print in the manual, you can decide to use 'set of 1' as a valid 'fringe case' (fringe from script-language structural point of view). What I *need* is learn (A|B) support for classifiers that don't have it yet (OSB and friends) and currently there's no possibility for coding learn (A|B) <i osb> so I am prevented from testing my ideas for the classifier itself. >> what they want/need, you get the chance to apply filters & processes >> in learn that are simply impossible right now PLUS you don't have to > > that's C level, SVM wants 3 because it uses 3 in both cases. Aicks! You _got_ me there. Forgot the 3rd one in SVM. DANG! Still a remaining 'oddity' hence. :-(( No good answer there expect mumbling about the implicit 'variable size' of a 'set' as I approach it. > ! learn (a b a_v_b) <svm flags> # wants all 3 > ! classify (a b a_v_b) <svm flags> (s_svm) # wants all 3 > ! classify (a b) <xxx flags> (s_xxx) # can't use the extra a_v_b > >> So no unified ... mess; I'd say it's unified ... structure / design. > > maybe, but that's not as simple as saying : > define: N classes > hence: LEARN(1 2 ... N) > CLASSIFY(1 2 ... N) > which might turn into a mess, or better shift the mess from one place to > another. Sure it's a shift: out of the [script] language, so it's 'black boxing' learn as it is classify, and into the [C] code. I think for general use it's less mess because you need to 'remember' less about the script language and the 'learn' interface, because apart from the extra index (in a sense you're _feeding_ it the pR which would pop out of classify as a result) it's exactly like classify. I really like language layout where general use requires the least number of 'rules' and 'details' to be remembered: it makes for a simpler language overall which is good for me as I work with multiple languages and a limited brain. ;-) (This learn/classify stuff is - in a way - comparable to old discussions about 'coding standards' and such for Pascal or C, where there's a class of folks that say: "you can skip the braces/begin-end and the semicolons so you should" while I am clearly with the folks that say: "don't matter what you do, always apply the same structure: braces/begin-end and semicolons and stuff, unless it is _prohibited_ by the language". Right now 'learn (A)' is prohibiting me from using 'learn (A|B)'. I think that bit didn't make it through last night.) >> Cost for Trever @ 60 classes? nil. > > wasn't thinking of run time cost, but script readability. Same here. But Trever was starting to worry, it seemed to me, performance would drop, if ever so slightly, if we'd be introducing this. And in case others were going to think it mattered. > yes, though once N classes get mmaped for a CLASSIFY a single class LEARN > can check for it and won't mmap() again, and mmsync() can be deferred > iff other processes that use same class(es) do that via shared mem. Yep. When you construct your scripts to handle classification and subsequent learning in the same crm114 instance, you get that advantage today. A (very limited) 'server'-y approach doable right now is writing a script which loops, waiting for messages available on disc or stdin, and keep on processing them one after the other in the same instance: you have the 'CSS stays in mem' benefit then as well (note: ignoring how to code for cutting up stdin into messages and/or poll/wait for disc-based messages here - that's another subject) >> Want some real, achievable gain? convert crm114 to play 'server', i.e. >> permanently loaded and CSS files (close to) permanently mapped in > > yes yes yes yes - the endless daemon saga :) [...] > yeah, maybe the ability to run pre-compiled scripts can be good idea > for a number of applications. You mean a kind of .java p-coded crm114 scripts, i.e. a real crm114 *compiler* (.crm --> .114 binary file) and, er, accompanying 'virtual machine'? Oh boy, the table rises here. ;-P But that's just the geek in me getting all exited. It's not on my list of 'things worth doing @ mid/short-term' though, but fun anyway. A crude/cheap way might be an option to 'dump' and 'load' tokenized script as it leaves the crm114 tokenizer going to the execution unit. Tokenize once, run multiple times. It's not worth it for me (I ran tiny scripts) but all the folks out there enjoying mailreaver and friends might get some good delight out of that as mailreaver/mailtrainer are _significant_ sized scripts. >> seriously considering hacking crm114 into becoming mod_crm114, i.e. an >> Apache2 plugin: you get the server, the socket I/O and the > > like Apache's Lucene and derivatives. Sorta. Yes. >> live in there like a wicked PHP-alike server-side scripting language >> and you will definitely achieve instant notoriety. ;-) > > and support headache ;) I like my native Americans ;-)) Granted, moving from 1.3 to 2.0/2.2 wasn't easy for a mod_xxx, but still I like it way more than 'roll your own [TCP-based] server' again: linking it to Apache (and no, despite the fact that I do Win32/64, I don't think I'll be the go-to guy if you want IIS plugin support: IIS6 is nice, in a way, and has good performance, but I run Apache on Windows for free projects and only do IIS for paying customers. Got to draw the line _somewhere_. If they open-source IIS, I'll reconsider that statement.) Anyhow, the crm114 scripts would still be there as they are right now; I would just take the std I/O and bend it so stdin = request and stdout (and stderr?) == response. Maybe add a touch of XML if you want to have a freeze-dried instant low-cal 'web service' (which is hot stuff these days, but rather old wine in fashionable new Walmart bags if you ask me, but then folks don't seem to study IT history anymore) Why Apache really? Because I can then 'lean on' the stick provided by them when I need to scale up: distributed servers, pardon, *services*, and the whole bloody lot are documented already. Besides, my purposes lead me towards a production environment as a 'web backend' anyhow, so why not bolt it to the web server itself? Yup, doing so requires some understanding of the Apache API interfacing and that's raising the tech level by +1, but at least you can be spared some significant intricacies regarding TCP/server performance tactics at server level. It's fun to write it, but in this case, my feeling was it's faster to go for mod_crm114 in dev time. And yes: that's 'faster' regarding a _production quality_ mod_crm114 compared to _production quality_ crm114d (note the 'd'). (For free as well: SSL secured communications with the crm114 'service' - which might be something to cheer the 'remote services' folks up quite a bit.) Anyway, I don't 'do' the alpha release of mod_crm114 in one week, nor can I deliver alpha stage crm114d in the same timeframe, so it'll probably stay a great idea over whiskey on Friday as I don't see Bill getting his hands on a particular red phone booth with free access either. ;-) > see above: CAN but definitely should not be a MUST. Does my approach of 'set' as described at start of this email match your CAN, or does it still sound like MUST to you? >> ONE: strict adherence to 'backwards compatibility' at CRM114 script > > just one good reason. well.... ;-) -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 -------------------------------------------------- |