Thread: [Crm114-discuss] Calculating Disk Space Requirements

Brought to you by: nkadel, oopla, vanbaal, wsy

crm114-discuss

[Crm114-discuss] Calculating Disk Space Requirements

From: Chris B. <cba...@as...> - 2008-07-21 23:21:30

The answer to these questions might very well be, "test and measure".
If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.)

How do you calculate the hardware requirements, especially the size of
CSS files needed, with a CRM114 program?

One of my long range projects is to write an AI for the boardgame
Diplomacy using CRM114. The approach is to archive combinations of turn
results and moves sorted by how favorable the outcome was. The program
builds movement sets by parsing game results to determine the
disposition of its units then consults a movement matrix to generate
all possible order sets. Each movement set and result combination is
submitted to the classifier to determine how closely it "resembles"
winning combinations from games in its training.

What do I need to know in order to estimate the necessary size of the
CSS file? There's ~175 unit dispositions and an average of 5 possible
destinations for each unit. A well trained classifier which has
not eliminated any trivial cases will have no more than 17*5*(175!/158!)
documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more
outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based
refinements in classification. If I divide the classification according
to a metric of game progress then I can effectively reduce the size of
the CSS files at the expense of calculating that metric each turn. Are
there any guidelines for determining how the size of the CSS files
affects classification speeds?

Chris

Re: [Crm114-discuss] Calculating Disk Space Requirements

From: Ger H. <ge...@ho...> - 2008-07-22 06:00:52

On Tue, Jul 22, 2008 at 1:21 AM, Chris Babcock <cba...@as...> wrote:
> The answer to these questions might very well be, "test and measure".
> If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.)
>
> How do you calculate the hardware requirements, especially the size of
> CSS files needed, with a CRM114 program?

Okay, let's try my hand at this on the quick.

First of all, there's the classifier you pick. Different classifier,
different behaviour, different size requirements. A few of them
require unlimited space (CSS files grow a little every time), most of
them have fixed space requirements.
I'm a bottom-up guy most of the time, so let's start bottom up for
some goodness.
All 'production quality' classifiers (that's OSB and friends) are
based on a fixed-sized hash table. Since the method chosen to store
stuff in that hash table is the 'linear probe' algorithm, you should
never ever try to get beyond the 50% fill rate point as then the hash
table performance is QUICKLY deteriorating (quite a few papers on
that; 50% isn't 'hard' but a 'rule of thumb' number).

Since the hash table is filled with hash elements and we assume a
reasonable quality hash here (flat distribution in N dimensions and
bla bla bla), best case is a flat fill. To satisfy the mechanical
engineer in me who's learned there's no such thing as a 'best case' in
daily practice unless you get your lifetime's joss delivered in a
single day, we add a fudge factor and after sucking on my thumb (mjam)
I'd say a fill of about 20-30% would be swell.
Performance? Can be assumed to be about near flat (O(1)); don't have
'live numbers' on this one, but my guess is bigger CSS files will be
slower due to more chances at disc and CPU cache misses while poking
at the hash table entries. Fast disc I/O helps there. RAID5, maybe
RAID6 or other dandiness...

Now one element is one hash plus a number, clocking in at 4 bytes
each, so at N elements, that's N*8 bytes disc space per 'feature'.

Given a 32-bit box and a safety margin of 2 for signed versus unsigned
queerness -- NOT to be mistaken with the use of signed versus unsigned
int discussed elsewhere ;-) -- that's a max size of 2GB / 8 =
2**(31-3) = 2**28 elements, then use 25% fill because it's a nice
number (1/2**2) that means we can store 2**26 elements max on 32-bit
'without noticeable loss of performance'. (Thanks to the 2G instead of
4G edge I also have a fighting chance at getting this to actually
/work/ on such a box as CRM114's using memory mapping and we can't eat
all for just the CSS file. No space left for the binary and misc data
there.)

Ah! But to classify you need two CSS files at least! And given our
memory mapping is done all at the same time, I'll dial down that
max(N) number to 2**25.

Because you can smell the napalm from here when looking at your
numbers, let's quickly see what 64-bit has on offer in 'best case';
and that would include additional money for harddisc technology
researchers an' all: 2**(63-3-2-1) ==> max(N) storage capacity at
2**57 which would mean you're good to go, topping out somewhere beyond
0.125 ExaFeatures (where one Feature is one CRM114 token a.k.a.
'hash'). If you get that kinda space, could I maybe charge you please
for a measly commission fee in the form of .00001% of your disc space,
yes? MY problems are solved then. ;-)

So far the 'practical' limits.

Now from your side of the fence:

Taking 17*5*(175!/158!) on faith (this is my morning coffee, and it's
gotta be fast, so I DO believe) at 6K sized docs? Hmmmm.... Let's just
assume one doc is one(1) Feature (it probably isn't but what the heck,
my backbone already feels where this is going; trying to beat Big Blue
at it, are we, eh? :-)) ) that would mean, say, n!/m! =~= 100**(n-m)
for m >= 100 here (and that's a BIG lie! but a really sweet one.) ==>
we're going to be hit by a feed of over 17*5*(100**(175-158)) =? and
since we're ballparking here like there's no tomorrow, that'd be
somewhere beyond 100**(175+1-158)=100**18 == 10**36 which is somewhere
over the rainbow and beyond an Exa SQUARED.

Like the backbone already knew: ...OOPS?!

Not to be the bee in your bonnet - I like the idea! :-)) - but (a) all
them Features are never ever gonna fit, even when you get unlimited
sponsoring by Hitachi and IBM, heck, you /buy/ them, and (b) assuming
for now that (pre-)calculating/learning/whatever one such item takes
about a single modern day CPU clock tick, i.e. ~ 1.0**-9 seconds,
which is rather optimistic and out the /other/ side, you'd /still/ be
at it when the Four Horsemen are having a snack on our offspring.

Of course, we can make the bugger 'learn on the job' (don't we all?)
and then it turns into the question of 'lifetime': how much do you
want it to learn and how good should the bugger be at playing
Diplomacy... in the end? Because there's surely to be found 'pathways'
in that data a.k.a. 'successful strategies'. Guestimating what
learning _those_ will cost is _way_ beyond the morning coffee, though.

Sooooo... getting that Diplomacy-playing Big Blue going somewhere
during /this/ lifetime, brute force ain't gonna cut it. Assuming the
above was kicking in an open stabledoor (but fun!), the plus benefit
of it all is that we have one practical usable result here:

if you know how many different words ('features') you want this Bayes
box to 'remember', you can take that number (N), multiply it by 4*8=32
for a 25% filled OSB[F].Markov/... classifier and your
advised/preferred CSS file size would be N*32 bytes. To be eligible
for one(1) yes/no style classification question ("is it or isn't
it?"), that takes two(2) CSS files, so total disc /cost/ would be
about N*64 bytes, excluding a negligible bit of header icing on the
cake.
Of course, that doesn't say nothing at what a 'feature' would BE in
your case; in email it's generally one word, but that's also an 'it
depends...' so there's lots of puzzling to do before you hit the Bayes
box. Big question before plugging it in: exactly _what_ are we going
to feed the animal? (See also a blurb about stocks analysis a few
months ago in this ML. Simply plonking in raw data ain't gonna cut it.
Same here.)

On another note - before I run: if you want 'win/loose/draw'
three-ways or other 'multiway' decisions, it is theoretically (and
practically) possible with CRM114; for every extra choice you have to
add one(1) more CSS file (and a | pipe symbol in your script).
Multiway weighting is 'supported' in the code but I haven't heard
about anybody actively using it since I first popped by in autumn 2007
so software-wise YMMV, Caveat Emptor, pick your classifier wisely and
all that and here's a rabbit foot as well. Cause you're gonna _need_
it.

Having unsettled you sufficiently, I'm exit left outa here. The
laboring masses and all that. Still, I love your idea. Tip if you want
to pursue this: check out what the chess boys have been doing. Same
problem; smaller scale (cough).

>
> One of my long range projects is to write an AI for the boardgame
> Diplomacy using CRM114. The approach is to archive combinations of turn
> results and moves sorted by how favorable the outcome was. The program
> builds movement sets by parsing game results to determine the
> disposition of its units then consults a movement matrix to generate
> all possible order sets. Each movement set and result combination is
> submitted to the classifier to determine how closely it "resembles"
> winning combinations from games in its training.
>
> What do I need to know in order to estimate the necessary size of the
> CSS file? There's ~175 unit dispositions and an average of 5 possible
> destinations for each unit. A well trained classifier which has
> not eliminated any trivial cases will have no more than 17*5*(175!/158!)
> documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more
> outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based
> refinements in classification. If I divide the classification according
> to a metric of game progress then I can effectively reduce the size of
> the CSS files at the expense of calculating that metric each turn. Are
> there any guidelines for determining how the size of the CSS files
> affects classification speeds?
>
> Chris
>
>
>
>
>
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
>

-- 
Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------