[Pytables-users] Chunkshape Clarification and EArray Column Access

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

PyTable Users,

I've read the following thread in an attempt to better understand how to organize a 2D EArray/CArray and retain the ability to efficiently select rows or columns.

http://www.mail-archive.com/pyt...@li.../msg00723.html

In this thread it was suggested that access to the columns of an EArray that was built by appending rows could be done efficiently if the appropriate chunkshape is passed (At least by my reading).  It was also suggested that a second copy of the data be stored in a different orientation but this statement was a bit unclear.  What I'm looking for is a clear example of how to efficiently access the columns an array build by appending rows.  My data come in as a series of rows but I would like to be able to read the columns in a reasonable amount of time.  

Below I have a code snippet that creates a fairly large EArray by appending rows.  Can anyone provide some insight on how to access these columns efficiently and or how to make a second copy of the data in the file using the appropriate chunkshape?  (It is the chunkshape aspect that I'm unclear on how that size is chosen).  Thanks for all your help.

Brian

#################Begin Snippet:

import tables as T
import numpy  as N
import time

t1 = time.clock()
hdf = T.openFile('test.h5', mode = "w", title = '')

atom = T.Int32Atom()
#shape = (?,?
#chunkshape = (?,?)
rows = 400
columns = 350000
arr = N.random.random(rows)*100
shape = (rows, columns)#(rows, columns)
filters = T.Filters(complevel=5, complib='zlib')

ea = hdf.createEArray(hdf.root, "EArray", atom, (0, rows),  filters = filters,  expectedrows = rows)

for i in xrange(columns):
    arr = N.random.random(rows)*100
    #print i
    ea.append(arr[N.newaxis,:])
    ea.flush()
    if i%10000 == 0:
        print i

#ea[:,1] #is really slow, whereas,
#ea[1] #is fast, how to use chunkshape in order to effeciently access columns when
          #the array was built by rows?

hdf.close()
print "Done"
t2 = time.clock()
print t2-t1