[Pytables-users] Performance of tables vs. arrays (out vs in core?)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

A single array with 312 000 000 int 16 values.

Two (uncompressed) ways to store it:

* Array

>>> wa02[:10]
array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16

* Table wtab02 (single column, named 'val')
>>> wtab02[:10]
array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,),
       (338,), (357,)],
      dtype=[('val', '<i2')])

read time respectively 120 ms, 220 ms.

>>> timeit big=np.nonzero(wa02[:]>1)
1 loops, best of 3: 1.66 s per loop

>>> timeit bigtab=wtab02.getWhereList('val>1')
1 loops, best of 3: 119 s per loop

with a Complete Sorted Index on val and blosc9 compression:
1 loops, best of 3: 149 s per loop

indicating expectedrows=312 000 000 (so that chunklen goes from 32K to 132K)
1 loops, best of 3: 119 s per loop

(I wanted to compare getting a boolean mask, but it seems that Tables
don't have a .wheretrue like carrays in Francesc's carray package (?).
For reference just the mask times to 344 ms).

---

Question: the difference in speed is due to in-core vs out-of-core?

If so, and if maximum unit of data fits in memory (even considering
loading a few columns to operate among them) -> is the corollary is
'stay in memory at all costs'?

With this exercise, I was trying to find out what is the best
structure to hold raw data (just one col in this case), and whether
indexing could help in queries.

-á.