Thread: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

Brought to you by: a_valentino, falted, ivilata, joshmoore

pytables-users

[Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Alvaro T. C. <al...@mi...> - 2012-04-18 17:33:31

A single array with 312 000 000 int 16 values.

Two (uncompressed) ways to store it:

* Array

>>> wa02[:10]
array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16

* Table wtab02 (single column, named 'val')
>>> wtab02[:10]
array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,),
       (338,), (357,)],
      dtype=[('val', '<i2')])

read time respectively 120 ms, 220 ms.

>>> timeit big=np.nonzero(wa02[:]>1)
1 loops, best of 3: 1.66 s per loop

>>> timeit bigtab=wtab02.getWhereList('val>1')
1 loops, best of 3: 119 s per loop

with a Complete Sorted Index on val and blosc9 compression:
1 loops, best of 3: 149 s per loop

indicating expectedrows=312 000 000 (so that chunklen goes from 32K to 132K)
1 loops, best of 3: 119 s per loop

(I wanted to compare getting a boolean mask, but it seems that Tables
don't have a .wheretrue like carrays in Francesc's carray package (?).
For reference just the mask times to 344 ms).

---

Question: the difference in speed is due to in-core vs out-of-core?

If so, and if maximum unit of data fits in memory (even considering
loading a few columns to operate among them) -> is the corollary is
'stay in memory at all costs'?

With this exercise, I was trying to find out what is the best
structure to hold raw data (just one col in this case), and whether
indexing could help in queries.

-á.

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Anthony S. <sc...@gm...> - 2012-04-18 18:02:33

Hello Alvaro,

What are the timings using the normal where() method?
http://pytables.github.com/usersguide/libref.html?highlight=where#tables.Table.where

Be Well
Anthony

On Wed, Apr 18, 2012 at 12:33 PM, Alvaro Tejero Cantero <al...@mi...>wrote:

> A single array with 312 000 000 int 16 values.
>
> Two (uncompressed) ways to store it:
>
> * Array
>
> >>> wa02[:10]
> array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16
>
> * Table wtab02 (single column, named 'val')
> >>> wtab02[:10]
> array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,),
>       (338,), (357,)],
>      dtype=[('val', '<i2')])
>
> read time respectively 120 ms, 220 ms.
>
> >>> timeit big=np.nonzero(wa02[:]>1)
> 1 loops, best of 3: 1.66 s per loop
>
> >>> timeit bigtab=wtab02.getWhereList('val>1')
> 1 loops, best of 3: 119 s per loop
>
> with a Complete Sorted Index on val and blosc9 compression:
> 1 loops, best of 3: 149 s per loop
>
> indicating expectedrows=312 000 000 (so that chunklen goes from 32K to
> 132K)
> 1 loops, best of 3: 119 s per loop
>
> (I wanted to compare getting a boolean mask, but it seems that Tables
> don't have a .wheretrue like carrays in Francesc's carray package (?).
> For reference just the mask times to 344 ms).
>
> ---
>
> Question: the difference in speed is due to in-core vs out-of-core?
>
> If so, and if maximum unit of data fits in memory (even considering
> loading a few columns to operate among them) -> is the corollary is
> 'stay in memory at all costs'?
>
> With this exercise, I was trying to find out what is the best
> structure to hold raw data (just one col in this case), and whether
> indexing could help in queries.
>
> -á.
>
>
> ------------------------------------------------------------------------------
> Better than sec? Nothing is better than sec when it comes to
> monitoring Big Data applications. Try Boundary one-second
> resolution app monitoring today. Free.
> http://p.sf.net/sfu/Boundary-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Alvaro T. C. <al...@mi...> - 2012-04-19 11:46:29

where will give me an iterator over the /values/; in this case I
wanted the indexes. Plus, it will give me an iterator, so it will be
trivially fast.

Are you interested in the timings of where + building a list? or where
+ building an array?


-á.



On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...> wrote:
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Alvaro T. C. <al...@mi...> - 2012-04-19 13:43:59

Some complementary info (I copy the details of the tables below)

timeit vals = numpy.fromiter((x['val'] for x in
my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
1 loops, best of 3: 30.4 s per loop


Using the compressed and indexed version, it mysteriously does not
work (output is empty list)
>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), dtype=np.int16)
>>> cvals
array([], dtype=int16)

But it does if we skip using where ( I don't print cvals, but it is correct )
>>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if x['val']>1), dtype=np.int16)
1 loops, best of 3: 54.8 s per loop

(the version with longer chunklen works fine and times to 30.7s).


-á.

wtab02: not compressed, not indexed, small chunklen:
/raw/t0/wtab02 (Table(312000000,)) ''
  description := {
  "val": Int16Col(shape=(), dflt=0, pos=0)}
  byteorder := 'little'
  chunkshape := (32768,)

larger chunklen (as calculated from expectedrows=312000000)
/raw/t0/wcetab02 (Table(312000000,)) 'test'
  description := {
  "val": Int16Col(shape=(), dflt=0, pos=0)}
  byteorder := 'little'
  chunkshape := (131072,)

wctab02: compressed, with CSI index
/raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test'
  description := {
  "val": Int16Col(shape=(), dflt=0, pos=0)}
  byteorder := 'little'
  chunkshape := (32768,)
  autoIndex := True
  colindexes := {
    "val": Index(9, full, shuffle, zlib(1)).is_CSI=True}



On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...> wrote:
> where will give me an iterator over the /values/; in this case I
> wanted the indexes. Plus, it will give me an iterator, so it will be
> trivially fast.
>
> Are you interested in the timings of where + building a list? or where
> + building an array?
>
>
> -á.
>
>
>
> On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...> wrote:
>>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Anthony S. <sc...@gm...> - 2012-04-19 14:33:38

I was interested in how long it takes to iterate, since this is arguably
where the
majority of the time is spent.

On Thu, Apr 19, 2012 at 8:43 AM, Alvaro Tejero Cantero <al...@mi...>wrote:

> Some complementary info (I copy the details of the tables below)
>
> timeit vals = numpy.fromiter((x['val'] for x in
> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
> 1 loops, best of 3: 30.4 s per loop
>
>
> Using the compressed and indexed version, it mysteriously does not
> work (output is empty list)
> >>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
> dtype=np.int16)
> >>> cvals
> array([], dtype=int16)
>

This doesn't work because numpy doesn't accept generators.  The following
should work:
>>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')],
dtype=np.int16)

Also, I am a little concerned that np.nonzero() doesn't really compare to
Table.getWhereList('val>1').  Testing for all zero bits *should be* a lot
faster
than a numeric comparison.  Could you instead try the same actual operation
in numpy as whereList():

>>> timeit big=np.argwhere(np.greater(wa02[:], 1))

Thanks!
Anthony


>
> But it does if we skip using where ( I don't print cvals, but it is
> correct )
> >>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if x['val']>1),
> dtype=np.int16)
> 1 loops, best of 3: 54.8 s per loop
>
> (the version with longer chunklen works fine and times to 30.7s).
>
>
> -á.
>
> wtab02: not compressed, not indexed, small chunklen:
> /raw/t0/wtab02 (Table(312000000,)) ''
>  description := {
>  "val": Int16Col(shape=(), dflt=0, pos=0)}
>  byteorder := 'little'
>  chunkshape := (32768,)
>
> larger chunklen (as calculated from expectedrows=312000000)
> /raw/t0/wcetab02 (Table(312000000,)) 'test'
>  description := {
>  "val": Int16Col(shape=(), dflt=0, pos=0)}
>  byteorder := 'little'
>  chunkshape := (131072,)
>
> wctab02: compressed, with CSI index
> /raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test'
>  description := {
>  "val": Int16Col(shape=(), dflt=0, pos=0)}
>  byteorder := 'little'
>  chunkshape := (32768,)
>  autoIndex := True
>  colindexes := {
>    "val": Index(9, full, shuffle, zlib(1)).is_CSI=True}
>
>
>
> On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...>
> wrote:
> > where will give me an iterator over the /values/; in this case I
> > wanted the indexes. Plus, it will give me an iterator, so it will be
> > trivially fast.
> >
> > Are you interested in the timings of where + building a list? or where
> > + building an array?
> >
> >
> > -á.
> >
> >
> >
> > On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...>
> wrote:
> >>
>
>
> ------------------------------------------------------------------------------
> For Developers, A Lot Can Happen In A Second.
> Boundary is the first to Know...and Tell You.
> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> http://p.sf.net/sfu/Boundary-d2dvs2
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Alvaro T. C. <al...@mi...> - 2012-04-19 16:47:07

I have to run, but here's what you requested (I won't be back on this
computer until monday)

>>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')], dtype=np.int16)
>>> cvals
array([], dtype=int16)

>>> timeit big=np.argwhere(np.greater(wa02[:], 1))
1 loops, best of 3: 15.3 s per loop

this gives me a mask, that I can get with

>>> big2 = wa02[:]>1
>>> np.alltrue(big == big2)
True

and in far less time:
>>> timeit big2 = wa02[:]>1
1 loops, best of 3: 348 ms per loop




-á.

/raw/t0/wa02 (Array(312000000,)) ''
  atom := Int16Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None


On Thu, Apr 19, 2012 at 15:33, Anthony Scopatz <sc...@gm...> wrote:
> I was interested in how long it takes to iterate, since this is arguably
> where the
> majority of the time is spent.
>
> On Thu, Apr 19, 2012 at 8:43 AM, Alvaro Tejero Cantero <al...@mi...>
> wrote:
>>
>> Some complementary info (I copy the details of the tables below)
>>
>> timeit vals = numpy.fromiter((x['val'] for x in
>> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
>> 1 loops, best of 3: 30.4 s per loop
>>
>>
>> Using the compressed and indexed version, it mysteriously does not
>> work (output is empty list)
>> >>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
>> >>> dtype=np.int16)
>> >>> cvals
>> array([], dtype=int16)
>
>
> This doesn't work because numpy doesn't accept generators.  The following
> should work:
>>>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')],
>>>> dtype=np.int16)
>
> Also, I am a little concerned that np.nonzero() doesn't really compare to
> Table.getWhereList('val>1').  Testing for all zero bits should be a lot
> faster
> than a numeric comparison.  Could you instead try the same actual operation
> in numpy as whereList():
>
>>>> timeit big=np.argwhere(np.greater(wa02[:], 1))
>
> Thanks!
> Anthony
>
>>
>>
>> But it does if we skip using where ( I don't print cvals, but it is
>> correct )
>> >>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if x['val']>1),
>> >>> dtype=np.int16)
>> 1 loops, best of 3: 54.8 s per loop
>>
>> (the version with longer chunklen works fine and times to 30.7s).
>>
>>
>> -á.
>>
>> wtab02: not compressed, not indexed, small chunklen:
>> /raw/t0/wtab02 (Table(312000000,)) ''
>>  description := {
>>  "val": Int16Col(shape=(), dflt=0, pos=0)}
>>  byteorder := 'little'
>>  chunkshape := (32768,)
>>
>> larger chunklen (as calculated from expectedrows=312000000)
>> /raw/t0/wcetab02 (Table(312000000,)) 'test'
>>  description := {
>>  "val": Int16Col(shape=(), dflt=0, pos=0)}
>>  byteorder := 'little'
>>  chunkshape := (131072,)
>>
>> wctab02: compressed, with CSI index
>> /raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test'
>>  description := {
>>  "val": Int16Col(shape=(), dflt=0, pos=0)}
>>  byteorder := 'little'
>>  chunkshape := (32768,)
>>  autoIndex := True
>>  colindexes := {
>>    "val": Index(9, full, shuffle, zlib(1)).is_CSI=True}
>>
>>
>>
>> On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...>
>> wrote:
>> > where will give me an iterator over the /values/; in this case I
>> > wanted the indexes. Plus, it will give me an iterator, so it will be
>> > trivially fast.
>> >
>> > Are you interested in the timings of where + building a list? or where
>> > + building an array?
>> >
>> >
>> > -á.
>> >
>> >
>> >
>> > On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...>
>> > wrote:
>> >>
>>
>>
>> ------------------------------------------------------------------------------
>> For Developers, A Lot Can Happen In A Second.
>> Boundary is the first to Know...and Tell You.
>> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
>> http://p.sf.net/sfu/Boundary-d2dvs2
>>
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
> ------------------------------------------------------------------------------
> For Developers, A Lot Can Happen In A Second.
> Boundary is the first to Know...and Tell You.
> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> http://p.sf.net/sfu/Boundary-d2dvs2
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Anthony S. <sc...@gm...> - 2012-04-19 17:24:02

On Thu, Apr 19, 2012 at 11:46 AM, Alvaro Tejero Cantero <al...@mi...>wrote:

> I have to run, but here's what you requested (I won't be back on this
> computer until monday)
>
> >>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')],
> dtype=np.int16)
> >>> cvals
> array([], dtype=int16)
>

Hmmm...


>
> >>> timeit big=np.argwhere(np.greater(wa02[:], 1))
> 1 loops, best of 3: 15.3 s per loop
>
> this gives me a mask,


argwhere() should not give you a mask.  It should give you the
coordinates.<http://docs.scipy.org/doc/numpy/reference/generated/numpy.argwhere.html>

Also it seems like np.argwhere(np.greater(wa02[:], 1)) and
np.argwhere(wa02[:]>1)  should run in the same amount of time.

At this point though we are just comparing the performance of numpy
routines.  What we really want is to compare numpy to PyTables.

Maybe I'll try playing around with this this weekend.


> that I can get with
>
> >>> big2 = wa02[:]>1
> >>> np.alltrue(big == big2)
> True
>
> and in far less time:
> >>> timeit big2 = wa02[:]>1
> 1 loops, best of 3: 348 ms per loop
>
>
>
>
> -á.
>
> /raw/t0/wa02 (Array(312000000,)) ''
>  atom := Int16Atom(shape=(), dflt=0)
>  maindim := 0
>  flavor := 'numpy'
>  byteorder := 'little'
>  chunkshape := None
>
>
> On Thu, Apr 19, 2012 at 15:33, Anthony Scopatz <sc...@gm...> wrote:
> > I was interested in how long it takes to iterate, since this is arguably
> > where the
> > majority of the time is spent.
> >
> > On Thu, Apr 19, 2012 at 8:43 AM, Alvaro Tejero Cantero <al...@mi...>
> > wrote:
> >>
> >> Some complementary info (I copy the details of the tables below)
> >>
> >> timeit vals = numpy.fromiter((x['val'] for x in
> >> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
> >> 1 loops, best of 3: 30.4 s per loop
> >>
> >>
> >> Using the compressed and indexed version, it mysteriously does not
> >> work (output is empty list)
> >> >>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
> >> >>> dtype=np.int16)
> >> >>> cvals
> >> array([], dtype=int16)
> >
> >
> > This doesn't work because numpy doesn't accept generators.  The following
> > should work:
> >>>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')],
> >>>> dtype=np.int16)
> >
> > Also, I am a little concerned that np.nonzero() doesn't really compare to
> > Table.getWhereList('val>1').  Testing for all zero bits should be a lot
> > faster
> > than a numeric comparison.  Could you instead try the same actual
> operation
> > in numpy as whereList():
> >
> >>>> timeit big=np.argwhere(np.greater(wa02[:], 1))
> >
> > Thanks!
> > Anthony
> >
> >>
> >>
> >> But it does if we skip using where ( I don't print cvals, but it is
> >> correct )
> >> >>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if
> x['val']>1),
> >> >>> dtype=np.int16)
> >> 1 loops, best of 3: 54.8 s per loop
> >>
> >> (the version with longer chunklen works fine and times to 30.7s).
> >>
> >>
> >> -á.
> >>
> >> wtab02: not compressed, not indexed, small chunklen:
> >> /raw/t0/wtab02 (Table(312000000,)) ''
> >>  description := {
> >>  "val": Int16Col(shape=(), dflt=0, pos=0)}
> >>  byteorder := 'little'
> >>  chunkshape := (32768,)
> >>
> >> larger chunklen (as calculated from expectedrows=312000000)
> >> /raw/t0/wcetab02 (Table(312000000,)) 'test'
> >>  description := {
> >>  "val": Int16Col(shape=(), dflt=0, pos=0)}
> >>  byteorder := 'little'
> >>  chunkshape := (131072,)
> >>
> >> wctab02: compressed, with CSI index
> >> /raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test'
> >>  description := {
> >>  "val": Int16Col(shape=(), dflt=0, pos=0)}
> >>  byteorder := 'little'
> >>  chunkshape := (32768,)
> >>  autoIndex := True
> >>  colindexes := {
> >>    "val": Index(9, full, shuffle, zlib(1)).is_CSI=True}
> >>
> >>
> >>
> >> On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...>
> >> wrote:
> >> > where will give me an iterator over the /values/; in this case I
> >> > wanted the indexes. Plus, it will give me an iterator, so it will be
> >> > trivially fast.
> >> >
> >> > Are you interested in the timings of where + building a list? or where
> >> > + building an array?
> >> >
> >> >
> >> > -á.
> >> >
> >> >
> >> >
> >> > On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...>
> >> > wrote:
> >> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> For Developers, A Lot Can Happen In A Second.
> >> Boundary is the first to Know...and Tell You.
> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> >> http://p.sf.net/sfu/Boundary-d2dvs2
> >>
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > For Developers, A Lot Can Happen In A Second.
> > Boundary is the first to Know...and Tell You.
> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> > http://p.sf.net/sfu/Boundary-d2dvs2
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> For Developers, A Lot Can Happen In A Second.
> Boundary is the first to Know...and Tell You.
> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> http://p.sf.net/sfu/Boundary-d2dvs2
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Francesc A. <fa...@py...> - 2012-04-24 02:10:21

On 4/18/12 12:33 PM, Alvaro Tejero Cantero wrote:
> A single array with 312 000 000 int 16 values.
>
> Two (uncompressed) ways to store it:
>
> * Array
>
>>>> wa02[:10]
> array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16
>
> * Table wtab02 (single column, named 'val')
>>>> wtab02[:10]
> array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,),
>         (338,), (357,)],
>        dtype=[('val', '<i2')])
>
> read time respectively 120 ms, 220 ms.
>
>>>> timeit big=np.nonzero(wa02[:]>1)
> 1 loops, best of 3: 1.66 s per loop
>
>>>> timeit bigtab=wtab02.getWhereList('val>1')
> 1 loops, best of 3: 119 s per loop

Yes, this is expected.  The fact that one method is much faster than the 
other is precisely that one is designed for operating out-of-core, while 
the other is operating completely in-memory, and this has a cost.  But 
that does not mean that out-of-core has to be necessarily slower.  Look 
at this:

In [107]: da
Out[107]:
/da (Array(10000000,)) ''
   atom := Int16Atom(shape=(), dflt=0)
   maindim := 0
   flavor := 'numpy'
   byteorder := 'little'
   chunkshape := None

In [108]: dra
Out[108]:
/dra (Table(10000000,), shuffle, blosc(5)) ''
   description := {
   "a": Int16Col(shape=(), dflt=0, pos=0)}
   byteorder := 'little'
   chunkshape := (65536,)

In [127]: time r = np.argwhere(da[:] == 1)
CPU times: user 0.08 s, sys: 0.02 s, total: 0.10 s
Wall time: 0.10 s

In [111]: time l = dra.getWhereList('a == 1')
CPU times: user 0.10 s, sys: 0.01 s, total: 0.11 s
Wall time: 0.11 s

So, tables' getWhereList() perfomance is pretty close to NumPy, even if 
the former is using compression.  This is a great achievement.  Why I'm 
getting very different results than you is this:

In [119]: len(l)
Out[119]: 153

That is, the selectivity of the query is extremely high (153 out of 10 
million elements), which is the scenario where queries are designed to 
shine.  If you use indexing, then you can get even more speed:

In [131]: dra.cols.a.createCSIndex()
Out[131]: 10000000

In [132]: time l = dra.getWhereList('a == 1')
CPU times: user 0.02 s, sys: 0.01 s, total: 0.03 s
Wall time: 0.02 s

In your case, using small selectivities (you are asking possibly for 
almost 50% of the initial datasets, perhaps less or perhaps more, 
depending on your data pattern), makes the data object creation (one for 
iteration in loop) in PyTables the big overhead:

In [134]: time r = np.argwhere(da[:] > 1)
CPU times: user 1.03 s, sys: 0.03 s, total: 1.06 s
Wall time: 1.12 s

In [135]: time l = dra.getWhereList('a > 1')
CPU times: user 5.62 s, sys: 0.16 s, total: 5.78 s
Wall time: 5.89 s

Now getWhereList() is more than 5x times slower.  Removing the index 
helps a bit here:

In [136]: dra.cols.a.removeIndex()

In [137]: time l = dra.getWhereList('a > 1')
CPU times: user 5.10 s, sys: 0.12 s, total: 5.22 s
Wall time: 5.30 s

But, if the internal query machinery in PyTables is the same, why it 
takes longer?  The short answer is object creation (and some data 
copy).  getWhereList() can be expressed like this:

In [165]: time l = np.array([r.nrow for r in dra.where('a > 1')])
CPU times: user 5.54 s, sys: 0.09 s, total: 5.63 s
Wall time: 5.71 s

Now, if we count the time to get the coordinates only:

In [159]: time s = [r.nrow for r in dra.where('a > 1')]
CPU times: user 3.86 s, sys: 0.08 s, total: 3.95 s
Wall time: 4.02 s

This time is a bit long, but this is due to the .nrow implementation (a 
Cython property of the Row class; I wonder if this could be accelerated 
somewhat).  In general, the Row iterator can be much faster, like for 
example, in getting values:

In [161]: time s = [r['a'] for r in dra.where('a > 1')]
CPU times: user 1.57 s, sys: 0.07 s, total: 1.63 s
Wall time: 1.61 s

and you can notice that this is barely the time that it takes a pure 
list creation:

In [139]: time l = [r for r in xrange(len(l))]
CPU times: user 1.44 s, sys: 0.11 s, total: 1.55 s
Wall time: 1.53 s

So, the 'slow' times that you are seeing are a consequence of the 
different data object creation and the internal data copies (for 
building the final NumPy array).  NumPy is much faster because all this 
process is made in pure C.

But again, this does not preclude the fact that queries in PyTables are 
actually fast --and potentially much faster than NumPy for high 
selectivities and indexing.

Hope this helps,

-- 
Francesc Alted

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Francesc A. <fa...@py...> - 2012-04-24 02:14:49

On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote:
> Some complementary info (I copy the details of the tables below)
>
> timeit vals = numpy.fromiter((x['val'] for x in
> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
> 1 loops, best of 3: 30.4 s per loop
>
>
> Using the compressed and indexed version, it mysteriously does not
> work (output is empty list)
>>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), dtype=np.int16)
>>>> cvals
> array([], dtype=int16)

This smells like a bug, but I cannot reproduce it.  Could you send an 
self-contained example reproducing this behavior?

-- 
Francesc Alted

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Anthony S. <sc...@gm...> - 2012-04-24 03:40:23

On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted <fa...@py...> wrote:

> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote:
> > Some complementary info (I copy the details of the tables below)
> >
> > timeit vals = numpy.fromiter((x['val'] for x in
> > my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
> > 1 loops, best of 3: 30.4 s per loop
> >
> >
> > Using the compressed and indexed version, it mysteriously does not
> > work (output is empty list)
> >>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
> dtype=np.int16)
> >>>> cvals
> > array([], dtype=int16)
>
> This smells like a bug, but I cannot reproduce it.  Could you send an
> self-contained example reproducing this behavior?
>

I am not able to reproduce this either...


>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Alvaro T. C. <al...@mi...> - 2012-04-25 11:13:37

Hi,

Thanks for the clarification.

I retried today both with a normal and a completely sorted index on a
a blosc-compressed table (complevel 5) and could not reproduce the
putative bug either.

-á.


On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz <sc...@gm...> wrote:
> On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted <fa...@py...> wrote:
>>
>> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote:
>> > Some complementary info (I copy the details of the tables below)
>> >
>> > timeit vals = numpy.fromiter((x['val'] for x in
>> > my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
>> > 1 loops, best of 3: 30.4 s per loop
>> >
>> >
>> > Using the compressed and indexed version, it mysteriously does not
>> > work (output is empty list)
>> >>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
>> >>>> dtype=np.int16)
>> >>>> cvals
>> > array([], dtype=int16)
>>
>> This smells like a bug, but I cannot reproduce it.  Could you send an
>> self-contained example reproducing this behavior?
>
>
> I am not able to reproduce this either...
>
>>
>>
>> --
>> Francesc Alted
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Francesc A. <fa...@py...> - 2012-04-26 03:10:22

On 4/25/12 6:13 AM, Alvaro Tejero Cantero wrote:
> Hi,
>
> Thanks for the clarification.
>
> I retried today both with a normal and a completely sorted index on a
> a blosc-compressed table (complevel 5) and could not reproduce the
> putative bug either.

So could you please confirm if you can reproduce the problem with blosc 
level 9?

Thanks!

>
> -á.
>
>
> On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz<sc...@gm...>  wrote:
>> On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted<fa...@py...>  wrote:
>>> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote:
>>>> Some complementary info (I copy the details of the tables below)
>>>>
>>>> timeit vals = numpy.fromiter((x['val'] for x in
>>>> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
>>>> 1 loops, best of 3: 30.4 s per loop
>>>>
>>>>
>>>> Using the compressed and indexed version, it mysteriously does not
>>>> work (output is empty list)
>>>>>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
>>>>>>> dtype=np.int16)
>>>>>>> cvals
>>>> array([], dtype=int16)
>>> This smells like a bug, but I cannot reproduce it.  Could you send an
>>> self-contained example reproducing this behavior?
>>
>> I am not able to reproduce this either...
>>
>>>
>>> --
>>> Francesc Alted
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pyt...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users


-- 
Francesc Alted

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Alvaro T. C. <al...@mi...> - 2012-04-26 10:47:53

Hi,

I tried again, also with different chunklens and couldn't reproduce it.
Unfortunately the session where I had this result was killed by a power
outage and the history buffer does not go as far back, so I can't find out
what exactly triggered it.


-á.


On Thu, Apr 26, 2012 at 04:10, Francesc Alted <fa...@py...> wrote:

> On 4/25/12 6:13 AM, Alvaro Tejero Cantero wrote:
> > Hi,
> >
> > Thanks for the clarification.
> >
> > I retried today both with a normal and a completely sorted index on a
> > a blosc-compressed table (complevel 5) and could not reproduce the
> > putative bug either.
>
> So could you please confirm if you can reproduce the problem with blosc
> level 9?
>
> Thanks!
>
> >
> > -á.
> >
> >
> > On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz<sc...@gm...>
>  wrote:
> >> On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted<fa...@py...>
>  wrote:
> >>> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote:
> >>>> Some complementary info (I copy the details of the tables below)
> >>>>
> >>>> timeit vals = numpy.fromiter((x['val'] for x in
> >>>> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
> >>>> 1 loops, best of 3: 30.4 s per loop
> >>>>
> >>>>
> >>>> Using the compressed and indexed version, it mysteriously does not
> >>>> work (output is empty list)
> >>>>>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
> >>>>>>> dtype=np.int16)
> >>>>>>> cvals
> >>>> array([], dtype=int16)
> >>> This smells like a bug, but I cannot reproduce it.  Could you send an
> >>> self-contained example reproducing this behavior?
> >>
> >> I am not able to reproduce this either...
> >>
> >>>
> >>> --
> >>> Francesc Alted
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Live Security Virtual Conference
> >>> Exclusive live event will cover all the ways today's security and
> >>> threat landscape has changed and how IT managers can respond.
> Discussions
> >>> will include endpoint security, mobile security and the latest in
> malware
> >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >>> _______________________________________________
> >>> Pytables-users mailing list
> >>> Pyt...@li...
> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Live Security Virtual Conference
> >> Exclusive live event will cover all the ways today's security and
> >> threat landscape has changed and how IT managers can respond.
> Discussions
> >> will include endpoint security, mobile security and the latest in
> malware
> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >>
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>