From: Francesc A. <fa...@py...> - 2009-05-19 08:42:37
|
A Tuesday 19 May 2009 05:03:48 escriguéreu: > On May 18, 2009, at 3:06 AM, Francesc Alted wrote: > > A Monday 18 May 2009 10:31:47 Francesc Alted escrigué: > >> A Sunday 17 May 2009 15:31:00 Robert Ferrell escrigué: > >>> I have an elementary question. > >>> > >>> I have a dictionary with about 10,000 keys. The keys are (shortish) > >>> strings. Each value is a time series of structured arrays (record > >>> arrays) with 5 fields. Each value totals about 100,000 bytes, so > >>> the > >>> total data size isn't huge, about 1GB. > >>> > >>> What would be a good way to store this in PyTables? I've been > >>> creating a group for each key, but that is a bad idea (since it's > >>> very > >>> slow). > >>> > >>> I have very little knowledge/experience with either data bases or > >>> PyTables, so I'm pretty sure I'm just missing a basic concept. > >> > >> Mmh, there are several ways to implement what you want. However, > >> provided > >> that your values are structured arrays, the easiest (and probably > >> one of > >> the fastest) way is to implement the dictionary as a monolithic > >> table. > > > > Er, this is the fastest, if you have PyTables Pro and you index the > > key field, > > of course ;) > > > > Another solution in case you don't want to buy Pro is to setup a > > VLArray of > > ObjectAtom atoms and save every recarray in a single row. Then, > > build a table > > with two fields: 'key' where you save your key and 'vrow' where you > > save the > > row location of your value in the VLArray. With this, you can fetch > > the value > > quickly by using an idiom like: > > > > print 'key == "2" -->', vlarray[keys.readWhere('key == "2"')['vrow'] > > [0]] > > print 'key == "1001" -->', vlarray[keys.readWhere('key == "1001"') > > ['vrow'][0]] > > > > I'm attaching a new script based on this approach. > > Thanks for your quick response. I'll try this out. I neglected to > mention that the time series vary somewhat in length. I'm thinking > that makes the VLArray desirable. In any case, I get the idea of > putting the keys in the table. That's a step forward in my > understanding. Yet another solution is to use a single table for keeping the time series and another one where you keep the key, starting row for a specific time series and the length of this time series. Something like: class Record(tb.IsDescription): key = tb.StringCol(itemsize=10, pos=0) srow = tb.Int64Col(pos=1) # start row in recarray table rlen = tb.Int64Col(pos=2) # length of recarray in recarray table With this the queries would be: (_, srow, rlen) = k.readWhere('key == "2"')[0] print 'key == "2" -->', v[srow:srow+rlen] (_, srow, rlen) = k.readWhere('key == "1001"')[0] print 'key == "1001" -->', v[srow:srow+rlen] Attached is a simple example of this. As I said before, there are many possibilities :) -- Francesc Alted |