From: Alvaro T. C. <al...@mi...> - 2012-04-18 17:33:31
|
A single array with 312 000 000 int 16 values. Two (uncompressed) ways to store it: * Array >>> wa02[:10] array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16 * Table wtab02 (single column, named 'val') >>> wtab02[:10] array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,), (338,), (357,)], dtype=[('val', '<i2')]) read time respectively 120 ms, 220 ms. >>> timeit big=np.nonzero(wa02[:]>1) 1 loops, best of 3: 1.66 s per loop >>> timeit bigtab=wtab02.getWhereList('val>1') 1 loops, best of 3: 119 s per loop with a Complete Sorted Index on val and blosc9 compression: 1 loops, best of 3: 149 s per loop indicating expectedrows=312 000 000 (so that chunklen goes from 32K to 132K) 1 loops, best of 3: 119 s per loop (I wanted to compare getting a boolean mask, but it seems that Tables don't have a .wheretrue like carrays in Francesc's carray package (?). For reference just the mask times to 344 ms). --- Question: the difference in speed is due to in-core vs out-of-core? If so, and if maximum unit of data fits in memory (even considering loading a few columns to operate among them) -> is the corollary is 'stay in memory at all costs'? With this exercise, I was trying to find out what is the best structure to hold raw data (just one col in this case), and whether indexing could help in queries. -á. |