From: Chris B. <cba...@as...> - 2008-07-21 23:21:30
|
The answer to these questions might very well be, "test and measure". If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.) How do you calculate the hardware requirements, especially the size of CSS files needed, with a CRM114 program? One of my long range projects is to write an AI for the boardgame Diplomacy using CRM114. The approach is to archive combinations of turn results and moves sorted by how favorable the outcome was. The program builds movement sets by parsing game results to determine the disposition of its units then consults a movement matrix to generate all possible order sets. Each movement set and result combination is submitted to the classifier to determine how closely it "resembles" winning combinations from games in its training. What do I need to know in order to estimate the necessary size of the CSS file? There's ~175 unit dispositions and an average of 5 possible destinations for each unit. A well trained classifier which has not eliminated any trivial cases will have no more than 17*5*(175!/158!) documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based refinements in classification. If I divide the classification according to a metric of game progress then I can effectively reduce the size of the CSS files at the expense of calculating that metric each turn. Are there any guidelines for determining how the size of the CSS files affects classification speeds? Chris |
From: Ger H. <ge...@ho...> - 2008-07-22 06:00:52
|
On Tue, Jul 22, 2008 at 1:21 AM, Chris Babcock <cba...@as...> wrote: > The answer to these questions might very well be, "test and measure". > If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.) > > How do you calculate the hardware requirements, especially the size of > CSS files needed, with a CRM114 program? Okay, let's try my hand at this on the quick. First of all, there's the classifier you pick. Different classifier, different behaviour, different size requirements. A few of them require unlimited space (CSS files grow a little every time), most of them have fixed space requirements. I'm a bottom-up guy most of the time, so let's start bottom up for some goodness. All 'production quality' classifiers (that's OSB and friends) are based on a fixed-sized hash table. Since the method chosen to store stuff in that hash table is the 'linear probe' algorithm, you should never ever try to get beyond the 50% fill rate point as then the hash table performance is QUICKLY deteriorating (quite a few papers on that; 50% isn't 'hard' but a 'rule of thumb' number). Since the hash table is filled with hash elements and we assume a reasonable quality hash here (flat distribution in N dimensions and bla bla bla), best case is a flat fill. To satisfy the mechanical engineer in me who's learned there's no such thing as a 'best case' in daily practice unless you get your lifetime's joss delivered in a single day, we add a fudge factor and after sucking on my thumb (mjam) I'd say a fill of about 20-30% would be swell. Performance? Can be assumed to be about near flat (O(1)); don't have 'live numbers' on this one, but my guess is bigger CSS files will be slower due to more chances at disc and CPU cache misses while poking at the hash table entries. Fast disc I/O helps there. RAID5, maybe RAID6 or other dandiness... Now one element is one hash plus a number, clocking in at 4 bytes each, so at N elements, that's N*8 bytes disc space per 'feature'. Given a 32-bit box and a safety margin of 2 for signed versus unsigned queerness -- NOT to be mistaken with the use of signed versus unsigned int discussed elsewhere ;-) -- that's a max size of 2GB / 8 = 2**(31-3) = 2**28 elements, then use 25% fill because it's a nice number (1/2**2) that means we can store 2**26 elements max on 32-bit 'without noticeable loss of performance'. (Thanks to the 2G instead of 4G edge I also have a fighting chance at getting this to actually /work/ on such a box as CRM114's using memory mapping and we can't eat all for just the CSS file. No space left for the binary and misc data there.) Ah! But to classify you need two CSS files at least! And given our memory mapping is done all at the same time, I'll dial down that max(N) number to 2**25. Because you can smell the napalm from here when looking at your numbers, let's quickly see what 64-bit has on offer in 'best case'; and that would include additional money for harddisc technology researchers an' all: 2**(63-3-2-1) ==> max(N) storage capacity at 2**57 which would mean you're good to go, topping out somewhere beyond 0.125 ExaFeatures (where one Feature is one CRM114 token a.k.a. 'hash'). If you get that kinda space, could I maybe charge you please for a measly commission fee in the form of .00001% of your disc space, yes? MY problems are solved then. ;-) So far the 'practical' limits. Now from your side of the fence: Taking 17*5*(175!/158!) on faith (this is my morning coffee, and it's gotta be fast, so I DO believe) at 6K sized docs? Hmmmm.... Let's just assume one doc is one(1) Feature (it probably isn't but what the heck, my backbone already feels where this is going; trying to beat Big Blue at it, are we, eh? :-)) ) that would mean, say, n!/m! =~= 100**(n-m) for m >= 100 here (and that's a BIG lie! but a really sweet one.) ==> we're going to be hit by a feed of over 17*5*(100**(175-158)) =? and since we're ballparking here like there's no tomorrow, that'd be somewhere beyond 100**(175+1-158)=100**18 == 10**36 which is somewhere over the rainbow and beyond an Exa SQUARED. Like the backbone already knew: ...OOPS?! Not to be the bee in your bonnet - I like the idea! :-)) - but (a) all them Features are never ever gonna fit, even when you get unlimited sponsoring by Hitachi and IBM, heck, you /buy/ them, and (b) assuming for now that (pre-)calculating/learning/whatever one such item takes about a single modern day CPU clock tick, i.e. ~ 1.0**-9 seconds, which is rather optimistic and out the /other/ side, you'd /still/ be at it when the Four Horsemen are having a snack on our offspring. Of course, we can make the bugger 'learn on the job' (don't we all?) and then it turns into the question of 'lifetime': how much do you want it to learn and how good should the bugger be at playing Diplomacy... in the end? Because there's surely to be found 'pathways' in that data a.k.a. 'successful strategies'. Guestimating what learning _those_ will cost is _way_ beyond the morning coffee, though. Sooooo... getting that Diplomacy-playing Big Blue going somewhere during /this/ lifetime, brute force ain't gonna cut it. Assuming the above was kicking in an open stabledoor (but fun!), the plus benefit of it all is that we have one practical usable result here: if you know how many different words ('features') you want this Bayes box to 'remember', you can take that number (N), multiply it by 4*8=32 for a 25% filled OSB[F].Markov/... classifier and your advised/preferred CSS file size would be N*32 bytes. To be eligible for one(1) yes/no style classification question ("is it or isn't it?"), that takes two(2) CSS files, so total disc /cost/ would be about N*64 bytes, excluding a negligible bit of header icing on the cake. Of course, that doesn't say nothing at what a 'feature' would BE in your case; in email it's generally one word, but that's also an 'it depends...' so there's lots of puzzling to do before you hit the Bayes box. Big question before plugging it in: exactly _what_ are we going to feed the animal? (See also a blurb about stocks analysis a few months ago in this ML. Simply plonking in raw data ain't gonna cut it. Same here.) On another note - before I run: if you want 'win/loose/draw' three-ways or other 'multiway' decisions, it is theoretically (and practically) possible with CRM114; for every extra choice you have to add one(1) more CSS file (and a | pipe symbol in your script). Multiway weighting is 'supported' in the code but I haven't heard about anybody actively using it since I first popped by in autumn 2007 so software-wise YMMV, Caveat Emptor, pick your classifier wisely and all that and here's a rabbit foot as well. Cause you're gonna _need_ it. Having unsettled you sufficiently, I'm exit left outa here. The laboring masses and all that. Still, I love your idea. Tip if you want to pursue this: check out what the chess boys have been doing. Same problem; smaller scale (cough). > > One of my long range projects is to write an AI for the boardgame > Diplomacy using CRM114. The approach is to archive combinations of turn > results and moves sorted by how favorable the outcome was. The program > builds movement sets by parsing game results to determine the > disposition of its units then consults a movement matrix to generate > all possible order sets. Each movement set and result combination is > submitted to the classifier to determine how closely it "resembles" > winning combinations from games in its training. > > What do I need to know in order to estimate the necessary size of the > CSS file? There's ~175 unit dispositions and an average of 5 possible > destinations for each unit. A well trained classifier which has > not eliminated any trivial cases will have no more than 17*5*(175!/158!) > documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more > outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based > refinements in classification. If I divide the classification according > to a metric of game progress then I can effectively reduce the size of > the CSS files at the expense of calculating that metric each turn. Are > there any guidelines for determining how the size of the CSS files > affects classification speeds? > > Chris > > > > > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Crm114-discuss mailing list > Crm...@li... > https://lists.sourceforge.net/lists/listinfo/crm114-discuss > > -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 -------------------------------------------------- |