Thread: [Pytables-users] Table.where and conditions across tables

Brought to you by: a_valentino, falted, ivilata, joshmoore

pytables-users

[Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-03-26 17:29:47

Hi there,

I am following advice by Anthony and giving a go at representing
different sensors in my dataset as columns in a Table, or in several
Tables. This is about in-kernel queries.

The documentation of condvars in Table.where [1] says "condvars should
consist of identifier-like strings pointing to Column (see The Column
class) instances of this table, or to other values (which will be
converted to arrays)".

Conversion to arrays will likely exhaust the memory and be slow.
Furthermore, when I tried with a toy example (naively extrapolating
the behaviour of indexing in numpy), I obtained

In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18) &
(a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})]

(... elided output)
ValueError: variable ``b`` refers to a column which is not part of
table ``/tetrode1

I am interested in the scenario where an in-kernel query is applied to
a table based in columns *from other tables*  that still are aligned
with the current table (same number of elements). These conditions may
be sophisticated and mix columns from the local table as well.

One obvious solution would be to put all aligned columns on the same
table. But adding columns to a table is cumbersome, and I cannot think
beforehand of the many precomputed columns that I would like to use as
query conditions.

What do you recommend in this scenario?

-á.

[1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where

Re: [Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-03-26 17:43:53

Would it be an option to have

* raw data on one table
* all imaginable columns used for query conditions in another table
(but how to grow it in columns without deleting & recreating?)

and fetch indexes for the first based on .whereList(condition) of the second?

Are there alternatives?

-á.



On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero <al...@mi...> wrote:
> Hi there,
>
> I am following advice by Anthony and giving a go at representing
> different sensors in my dataset as columns in a Table, or in several
> Tables. This is about in-kernel queries.
>
> The documentation of condvars in Table.where [1] says "condvars should
> consist of identifier-like strings pointing to Column (see The Column
> class) instances of this table, or to other values (which will be
> converted to arrays)".
>
> Conversion to arrays will likely exhaust the memory and be slow.
> Furthermore, when I tried with a toy example (naively extrapolating
> the behaviour of indexing in numpy), I obtained
>
> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18) &
> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})]
>
> (... elided output)
> ValueError: variable ``b`` refers to a column which is not part of
> table ``/tetrode1
>
> I am interested in the scenario where an in-kernel query is applied to
> a table based in columns *from other tables*  that still are aligned
> with the current table (same number of elements). These conditions may
> be sophisticated and mix columns from the local table as well.
>
> One obvious solution would be to put all aligned columns on the same
> table. But adding columns to a table is cumbersome, and I cannot think
> beforehand of the many precomputed columns that I would like to use as
> query conditions.
>
> What do you recommend in this scenario?
>
> -á.
>
> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-03-26 22:57:33

Hi Alvaro,

On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote:
> Would it be an option to have
>
> * raw data on one table
> * all imaginable columns used for query conditions in another table

Yes, that sounds like a good solution to me.

> (but how to grow it in columns without deleting&  recreating?)

You can't (at least on cheap way).  Maybe you may want to create 
additional tables and grouping them in terms of the columns you are 
going to need for your queries.

>
> and fetch indexes for the first based on .whereList(condition) of the second?

Exactly.

> Are there alternatives?

Yes.  The alternative would be to have column-wise tables, that would 
allow you to add and remove columns at a a cost of almost zero.  This 
idea of column-wise tables is quite flexible, and would let you have 
even variable-length columns, as well as computed columns (that is, data 
that is generated from other columns based on other columns).  These 
will have a lot of applications, IMO.  I would like to add this proposal 
to our next round of applications for projects to improve PyTables.  
Let's see how it goes.

Francesc
>
> -á.
>
>
>
> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...>  wrote:
>> Hi there,
>>
>> I am following advice by Anthony and giving a go at representing
>> different sensors in my dataset as columns in a Table, or in several
>> Tables. This is about in-kernel queries.
>>
>> The documentation of condvars in Table.where [1] says "condvars should
>> consist of identifier-like strings pointing to Column (see The Column
>> class) instances of this table, or to other values (which will be
>> converted to arrays)".
>>
>> Conversion to arrays will likely exhaust the memory and be slow.
>> Furthermore, when I tried with a toy example (naively extrapolating
>> the behaviour of indexing in numpy), I obtained
>>
>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)&
>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})]
>>
>> (... elided output)
>> ValueError: variable ``b`` refers to a column which is not part of
>> table ``/tetrode1
>>
>> I am interested in the scenario where an in-kernel query is applied to
>> a table based in columns *from other tables*  that still are aligned
>> with the current table (same number of elements). These conditions may
>> be sophisticated and mix columns from the local table as well.
>>
>> One obvious solution would be to put all aligned columns on the same
>> table. But adding columns to a table is cumbersome, and I cannot think
>> beforehand of the many precomputed columns that I would like to use as
>> query conditions.
>>
>> What do you recommend in this scenario?
>>
>> -á.
>>
>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users


-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-03-27 07:21:11

>> (but how to grow it in columns without deleting&  recreating?)
>
> You can't (at least on cheap way).  Maybe you may want to create
> additional tables and grouping them in terms of the columns you are
> going to need for your queries.

Sorry, it is not clear to me: create new tables and (grouping =
grouping in HDF5 Groups) them  in terms of the columns?
As far as I understood, only columns in the same table (regardless the
group of the table) can be queried together with the in-kernel engine?

>> Are there alternatives?
>
> Yes.  The alternative would be to have column-wise tables, that would
> allow you to add and remove columns at a a cost of almost zero.  This
> idea of column-wise tables is quite flexible, and would let you have
> even variable-length columns, as well as computed columns (that is, data
> that is generated from other columns based on other columns).  These
> will have a lot of applications, IMO.  I would like to add this proposal
> to our next round of applications for projects to improve PyTables.
> Let's see how it goes.

This sounds definitely interesting. But I see the interest that
PyTables can query columns in different tables in-kernel, because it
removes one big constraint for data layout (and this is in turn
important because attr dictionaries can only be attached to whole
tables AFAIK). The solution I suggest would be that whenever other
columns are involved, the in-kernel engine loops over the zip of the
columns. It could do a pre-check on column length before starting.

This would be a quite useful enhancement for me.

> Francesc
>>
>> -á.
>>
>>
>>
>> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...>  wrote:
>>> Hi there,
>>>
>>> I am following advice by Anthony and giving a go at representing
>>> different sensors in my dataset as columns in a Table, or in several
>>> Tables. This is about in-kernel queries.
>>>
>>> The documentation of condvars in Table.where [1] says "condvars should
>>> consist of identifier-like strings pointing to Column (see The Column
>>> class) instances of this table, or to other values (which will be
>>> converted to arrays)".
>>>
>>> Conversion to arrays will likely exhaust the memory and be slow.
>>> Furthermore, when I tried with a toy example (naively extrapolating
>>> the behaviour of indexing in numpy), I obtained
>>>
>>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)&
>>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})]
>>>
>>> (... elided output)
>>> ValueError: variable ``b`` refers to a column which is not part of
>>> table ``/tetrode1
>>>
>>> I am interested in the scenario where an in-kernel query is applied to
>>> a table based in columns *from other tables*  that still are aligned
>>> with the current table (same number of elements). These conditions may
>>> be sophisticated and mix columns from the local table as well.
>>>
>>> One obvious solution would be to put all aligned columns on the same
>>> table. But adding columns to a table is cumbersome, and I cannot think
>>> beforehand of the many precomputed columns that I would like to use as
>>> query conditions.
>>>
>>> What do you recommend in this scenario?
>>>
>>> -á.
>>>
>>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-03-27 23:11:53

On 3/27/12 2:20 AM, Alvaro Tejero Cantero wrote:
>>> (but how to grow it in columns without deleting&    recreating?)
>> You can't (at least on cheap way).  Maybe you may want to create
>> additional tables and grouping them in terms of the columns you are
>> going to need for your queries.
> Sorry, it is not clear to me: create new tables and (grouping =
> grouping in HDF5 Groups) them  in terms of the columns?
> As far as I understood, only columns in the same table (regardless the
> group of the table) can be queried together with the in-kernel engine?

Yes, but the idea is to get rid of this limitation for querying columns 
in the same table (which is somewhat artificial).  In fact, now that I 
think about this, you can actually implement queries on different 
unidimensional arrays (think of them as independent columns) by using 
the `tables.Expr` class.  More on this later.

-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-03-27 23:34:35

Another option that occurred to me recently is to save all your columns 
as unidimensional arrays (Array object, or, if you want compression, a 
CArray or EArray), and then use them as components of a boolean 
expression using the class `tables.Expr`.  For example, if a, b and c 
are unidimensional arrays of the same size, you can do:

bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)')
indices = [ind for ind, bool_val in bool_cond if bool_val ]
results = your_dataset[indices]

Does that make sense for your problem?  Of course, this class uses 
numexpr behind the scenes, so it is perfectly equivalent to classical 
queries in tables, but without being restricted to use tables.  Please 
see more details about the `tables.Expr` in:

http://pytables.github.com/usersguide/libref.html#the-expr-class-a-general-purpose-expression-evaluator

Francesc

On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote:
> Would it be an option to have
>
> * raw data on one table
> * all imaginable columns used for query conditions in another table
> (but how to grow it in columns without deleting&  recreating?)
>
> and fetch indexes for the first based on .whereList(condition) of the second?
>
> Are there alternatives?
>
> -á.
>
>
>
> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...>  wrote:
>> Hi there,
>>
>> I am following advice by Anthony and giving a go at representing
>> different sensors in my dataset as columns in a Table, or in several
>> Tables. This is about in-kernel queries.
>>
>> The documentation of condvars in Table.where [1] says "condvars should
>> consist of identifier-like strings pointing to Column (see The Column
>> class) instances of this table, or to other values (which will be
>> converted to arrays)".
>>
>> Conversion to arrays will likely exhaust the memory and be slow.
>> Furthermore, when I tried with a toy example (naively extrapolating
>> the behaviour of indexing in numpy), I obtained
>>
>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)&
>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})]
>>
>> (... elided output)
>> ValueError: variable ``b`` refers to a column which is not part of
>> table ``/tetrode1
>>
>> I am interested in the scenario where an in-kernel query is applied to
>> a table based in columns *from other tables*  that still are aligned
>> with the current table (same number of elements). These conditions may
>> be sophisticated and mix columns from the local table as well.
>>
>> One obvious solution would be to put all aligned columns on the same
>> table. But adding columns to a table is cumbersome, and I cannot think
>> beforehand of the many precomputed columns that I would like to use as
>> query conditions.
>>
>> What do you recommend in this scenario?
>>
>> -á.
>>
>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users


-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@gm...> - 2012-03-28 14:36:56

On 3/27/12 6:34 PM, Francesc Alted wrote:
> Another option that occurred to me recently is to save all your 
> columns as unidimensional arrays (Array object, or, if you want 
> compression, a CArray or EArray), and then use them as components of a 
> boolean expression using the class `tables.Expr`.  For example, if a, 
> b and c are unidimensional arrays of the same size, you can do:
>
> bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)')
> indices = [ind for ind, bool_val in bool_cond if bool_val ]

Of course, the above line needs to read:

indices = [ind for ind, bool_val in enumerate(bool_cond) if bool_val ]

> results = your_dataset[indices]

Another solution, probably faster, although you need to make sure that 
you have memory enough to keep your boolean array, is this:

bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)')
bool_arr = bool_cond.eval()
results = your_dataset[bool_arr]

-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-03-28 15:15:47

That is a perfectly fine solution for me, as long as the arrays aren't
copied in memory for the query.

Thank you!

Thinking that your proposed solution uses iterables to avoid it I tried

boolcond = pt.Expr('(exp(a)<0.9)&(a*b>0.7)|(b*sin(a)<0.1)')
indices = [i for i,v in boolcond if v]
(...) TypeError: 'numpy.bool_' object is not iterable

I can, however, do
boolarr = boolcond.eval()
indices = np.nonzero(boolarr)

but then I get boolarr into memory.

Did I miss something? What is your advice on how to monitor the use of
memory? (I need this until PyTables is second skin).

It is very rewarding to see that these numexpr's are 3-4 times faster
than the same with arrays in memory. However, I didn't find a way to
set the number of threads used

When evaluating the blosc benchmarks I found that in my system with
two 6-core processors , using 12 is best for writing and 6 for
reading. Interesting...

Another question (maybe for a separate thread): is there any way to
shrink memory usage of booleans to 1 bit? It might well be that this
optimizes the use of the memory bus (at some processing cost). But I
am not aware of a numpy container for this.

-á.



On Wed, Mar 28, 2012 at 00:34, Francesc Alted <fa...@py...> wrote:
> Another option that occurred to me recently is to save all your columns
> as unidimensional arrays (Array object, or, if you want compression, a
> CArray or EArray), and then use them as components of a boolean
> expression using the class `tables.Expr`.  For example, if a, b and c
> are unidimensional arrays of the same size, you can do:
>
> bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)')
> indices = [ind for ind, bool_val in bool_cond if bool_val ]
> results = your_dataset[indices]
>
> Does that make sense for your problem?  Of course, this class uses
> numexpr behind the scenes, so it is perfectly equivalent to classical
> queries in tables, but without being restricted to use tables.  Please
> see more details about the `tables.Expr` in:
>
> http://pytables.github.com/usersguide/libref.html#the-expr-class-a-general-purpose-expression-evaluator
>
> Francesc
>
> On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote:
>> Would it be an option to have
>>
>> * raw data on one table
>> * all imaginable columns used for query conditions in another table
>> (but how to grow it in columns without deleting&  recreating?)
>>
>> and fetch indexes for the first based on .whereList(condition) of the second?
>>
>> Are there alternatives?
>>
>> -á.
>>
>>
>>
>> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...>  wrote:
>>> Hi there,
>>>
>>> I am following advice by Anthony and giving a go at representing
>>> different sensors in my dataset as columns in a Table, or in several
>>> Tables. This is about in-kernel queries.
>>>
>>> The documentation of condvars in Table.where [1] says "condvars should
>>> consist of identifier-like strings pointing to Column (see The Column
>>> class) instances of this table, or to other values (which will be
>>> converted to arrays)".
>>>
>>> Conversion to arrays will likely exhaust the memory and be slow.
>>> Furthermore, when I tried with a toy example (naively extrapolating
>>> the behaviour of indexing in numpy), I obtained
>>>
>>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)&
>>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})]
>>>
>>> (... elided output)
>>> ValueError: variable ``b`` refers to a column which is not part of
>>> table ``/tetrode1
>>>
>>> I am interested in the scenario where an in-kernel query is applied to
>>> a table based in columns *from other tables*  that still are aligned
>>> with the current table (same number of elements). These conditions may
>>> be sophisticated and mix columns from the local table as well.
>>>
>>> One obvious solution would be to put all aligned columns on the same
>>> table. But adding columns to a table is cumbersome, and I cannot think
>>> beforehand of the many precomputed columns that I would like to use as
>>> query conditions.
>>>
>>> What do you recommend in this scenario?
>>>
>>> -á.
>>>
>>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-03-28 18:04:05

On 3/28/12 10:15 AM, Alvaro Tejero Cantero wrote:
> That is a perfectly fine solution for me, as long as the arrays aren't
> copied in memory for the query.

No, the arrays are not copied in memory.  They are just read from disk 
block-by-block and then the output is directed to the iterator, or an 
array (depending on the context).

>
> Thank you!
>
> Thinking that your proposed solution uses iterables to avoid it I tried
>
> boolcond = pt.Expr('(exp(a)<0.9)&(a*b>0.7)|(b*sin(a)<0.1)')
> indices = [i for i,v in boolcond if v]
> (...) TypeError: 'numpy.bool_' object is not iterable
>
> I can, however, do
> boolarr = boolcond.eval()
> indices = np.nonzero(boolarr)
>
> but then I get boolarr into memory.
>
> Did I miss something?

Yes, that was an error on my part.  The correct way is:

indices = [i for i,v in boolcond if enumerate(v)]


>   What is your advice on how to monitor the use of
> memory? (I need this until PyTables is second skin).

top?

>
> It is very rewarding to see that these numexpr's are 3-4 times faster
> than the same with arrays in memory. However, I didn't find a way to
> set the number of threads used

Well, you can use the `MAX_THREADS` variable in 'parameters.py', but 
this do not offer separate controls for numexpr and blosc.  Feel free to 
open a ticket asking for imporving this functionality.

>
> When evaluating the blosc benchmarks I found that in my system with
> two 6-core processors , using 12 is best for writing and 6 for
> reading. Interesting...

Yes, it is :)

>
> Another question (maybe for a separate thread): is there any way to
> shrink memory usage of booleans to 1 bit? It might well be that this
> optimizes the use of the memory bus (at some processing cost). But I
> am not aware of a numpy container for this.

Maybe a compressed array?  That would lead to using less that 1 bit per 
element in many situations.  If you are interested in this, look into:

https://github.com/FrancescAlted/carray

-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-03-29 15:50:01

>>   What is your advice on how to monitor the use of
>> memory? (I need this until PyTables is second skin).
>
> top?

I had so far used it only in a very rudimentary way and found the man
page quite intimidating. Would you care to share your tips for this
particular scenario? (e.g. how do you keep the ipython process
'focused'?)

>> It is very rewarding to see that these numexpr's are 3-4 times faster
>> than the same with arrays in memory. However, I didn't find a way to
>> set the number of threads used
>
> Well, you can use the `MAX_THREADS` variable in 'parameters.py', but
> this do not offer separate controls for numexpr and blosc.  Feel free to
> open a ticket asking for imporving this functionality.

Ok, I opened the following tickets (since I have to build the
application first and then revisit the infrastructural issues, I
cannot do more about them now):

* one for implementation of references
https://github.com/PyTables/PyTables/issues/140
* one for the estimation of dataset (group?) size
https://github.com/PyTables/PyTables/issues/141
* one for an interface function to set MAX_THREADS for numexpr
independently of blosc's
https://github.com/PyTables/PyTables/issues/142

>> When evaluating the blosc benchmarks I found that in my system with
>> two 6-core processors , using 12 is best for writing and 6 for
>> reading. Interesting...
>
> Yes, it is :)

Are you interested in my .out bench output file for the
SyntheticBenchmarks page?

>> Another question (maybe for a separate thread): is there any way to
>> shrink memory usage of booleans to 1 bit? It might well be that this
>> optimizes the use of the memory bus (at some processing cost). But I
>> am not aware of a numpy container for this.
>
> Maybe a compressed array?  That would lead to using less that 1 bit per
> element in many situations.  If you are interested in this, look into:
>
> https://github.com/FrancescAlted/carray

Ok, I did some playing around with this:

* a bool array of 10**8 elements with True in two separate slices of
length 10**6 each compresses by ~350. Using .wheretrue to obtain
indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
array). The resulting filesize is 248kb, still far from storing the 4
or 6 integer indexes that define the slices (I am experimenting with
an approach for scientific databases where this is a concern).

* a sample of my normal electrophysiological data (15M Int16 data
points) compresses by about 1.7-1.8.

* how blosc choses the chunklen is black magic for me, but it seems to
be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
64*1024 when CArraying only one row).

* A quick way to know how well your data will compress in PyTables if
you will be using blosc is to test in the REPL with CArray. I guess
for the other compressors we still need to go (for the moment) to
checking filesystem-reported sizes.

Best,

á.


> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-03-29 23:53:07

On 3/29/12 10:49 AM, Alvaro Tejero Cantero wrote:
>>>    What is your advice on how to monitor the use of
>>> memory? (I need this until PyTables is second skin).
>> top?
> I had so far used it only in a very rudimentary way and found the man
> page quite intimidating. Would you care to share your tips for this
> particular scenario? (e.g. how do you keep the ipython process
> 'focused'?)

Well, top by default keeps the most CPU consuming process always on the 
top (hence the name), so I think it is quite easy to spot the 
interesting process.  vmstat is another interesting utility, but will 
report only about general virtual memory consumption, not on a 
per-process basis.

Finally if you can afford to instrument your code and you use Linux (I 
assume this is the case), then you may want to use a small routine that 
tells your the memory used by the caller process each time it is 
called.  Here it is an example on how this is used in PyTables test suite:

https://github.com/PyTables/PyTables/blob/master/tables/tests/common.py#L483

I'm sure you will figure out how to use it in your own scenario.

>
>>> It is very rewarding to see that these numexpr's are 3-4 times faster
>>> than the same with arrays in memory. However, I didn't find a way to
>>> set the number of threads used
>> Well, you can use the `MAX_THREADS` variable in 'parameters.py', but
>> this do not offer separate controls for numexpr and blosc.  Feel free to
>> open a ticket asking for imporving this functionality.
> Ok, I opened the following tickets (since I have to build the
> application first and then revisit the infrastructural issues, I
> cannot do more about them now):
>
> * one for implementation of references
> https://github.com/PyTables/PyTables/issues/140
> * one for the estimation of dataset (group?) size
> https://github.com/PyTables/PyTables/issues/141
> * one for an interface function to set MAX_THREADS for numexpr
> independently of blosc's
> https://github.com/PyTables/PyTables/issues/142

Excellent. Thanks!

>>> When evaluating the blosc benchmarks I found that in my system with
>>> two 6-core processors , using 12 is best for writing and 6 for
>>> reading. Interesting...
>> Yes, it is :)
> Are you interested in my .out bench output file for the
> SyntheticBenchmarks page?

Yes, I am! And if you can produce the matplotlib figures, that would be 
much rejoice :)

>>> Another question (maybe for a separate thread): is there any way to
>>> shrink memory usage of booleans to 1 bit? It might well be that this
>>> optimizes the use of the memory bus (at some processing cost). But I
>>> am not aware of a numpy container for this.
>> Maybe a compressed array?  That would lead to using less that 1 bit per
>> element in many situations.  If you are interested in this, look into:
>>
>> https://github.com/FrancescAlted/carray
> Ok, I did some playing around with this:
>
> * a bool array of 10**8 elements with True in two separate slices of
> length 10**6 each compresses by ~350. Using .wheretrue to obtain
> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
> array). The resulting filesize is 248kb, still far from storing the 4
> or 6 integer indexes that define the slices (I am experimenting with
> an approach for scientific databases where this is a concern).

Oh, you were asking for a 8 to 1 compressor (booleans as bits), but 
apparently a 350 to 1 is not enough? :)

>
> * a sample of my normal electrophysiological data (15M Int16 data
> points) compresses by about 1.7-1.8.

Well, I was expecting something more for these time series data, but it 
is not that bad for int16.  Probably int32 or float64 would reach better 
compression rates.

>
> * how blosc choses the chunklen is black magic for me, but it seems to
> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
> 64*1024 when CArraying only one row).

Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could 
you detail a bit more how you achieve this result?  Providing an example 
would be very useful.

>
> * A quick way to know how well your data will compress in PyTables if
> you will be using blosc is to test in the REPL with CArray. I guess
> for the other compressors we still need to go (for the moment) to
> checking filesystem-reported sizes.

Just be sure that you experiment with different chunklengths by using 
the `chunklen` parameter in carray constructor too.

-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-04-25 12:05:51

Hi, a minor update on this thread

>> * a bool array of 10**8 elements with True in two separate slices of
>> length 10**6 each compresses by ~350. Using .wheretrue to obtain
>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
>> array). The resulting filesize is 248kb, still far from storing the 4
>> or 6 integer indexes that define the slices (I am experimenting with
>> an approach for scientific databases where this is a concern).
>
> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but
> apparently a 350 to 1 is not enough? :)

Here I expected more from a run-length-like compression scheme. My
array would be compressible to the following representation:

(0, x) : 0
(x, x+10**6) : 1
(x+10**6, y) : 0
(y, y+10**6) : 1
(y+10**6, 10**8) : 0

or just:
(x, x+10**6) : 1
(y, y+10**6) : 1

where x and y are two reasonable integers (i.e. in range and with no overlap).

>> * how blosc choses the chunklen is black magic for me, but it seems to
>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
>> 64*1024 when CArraying only one row).
>
> Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could
> you detail a bit more how you achieve this result?  Providing an example
> would be very useful.

I revisited this issue. While in PyTables CArray the guesses are
reasonable, the problem is in carray.carray (or in its reporting of
chunklen).

This is the offender:
carray((64, 15600000), int16)  nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78
  cparams := cparams(clevel=5, shuffle=True)

In [87]: x.chunklen
Out[87]: 1

Could it be that carray is not reporting the second dimension of the
chunkshape? (in PyTables, this is 262144)

The fact that both PyTable's CArray and carray.carray are named carray
is a bit confusing.

>
> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-04-26 03:07:54

On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote:
> Hi, a minor update on this thread
>
>>> * a bool array of 10**8 elements with True in two separate slices of
>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain
>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
>>> array). The resulting filesize is 248kb, still far from storing the 4
>>> or 6 integer indexes that define the slices (I am experimenting with
>>> an approach for scientific databases where this is a concern).
>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but
>> apparently a 350 to 1 is not enough? :)
> Here I expected more from a run-length-like compression scheme. My
> array would be compressible to the following representation:
>
> (0, x) : 0
> (x, x+10**6) : 1
> (x+10**6, y) : 0
> (y, y+10**6) : 1
> (y+10**6, 10**8) : 0
>
> or just:
> (x, x+10**6) : 1
> (y, y+10**6) : 1
>
> where x and y are two reasonable integers (i.e. in range and with no overlap).

Sure, but this is not the spirit of a compressor adapted to the blocking 
technique (in the sense of [1]).  For a compressor that works with 
blocks, you need to add some metainformation for each block, and that 
takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.

[1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

>
>>> * how blosc choses the chunklen is black magic for me, but it seems to
>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
>>> 64*1024 when CArraying only one row).
>> Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could
>> you detail a bit more how you achieve this result?  Providing an example
>> would be very useful.
> I revisited this issue. While in PyTables CArray the guesses are
> reasonable, the problem is in carray.carray (or in its reporting of
> chunklen).
>
> This is the offender:
> carray((64, 15600000), int16)  nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78
>    cparams := cparams(clevel=5, shuffle=True)
>
> In [87]: x.chunklen
> Out[87]: 1
>
> Could it be that carray is not reporting the second dimension of the
> chunkshape? (in PyTables, this is 262144)

Ah yes, this is it.  The carray package is not as sophisticated as HDF5, 
and it only blocks in the leading dimension.  In this case, it is saying 
that the block is a complete row.  So this is the intended behaviour.

>
> The fact that both PyTable's CArray and carray.carray are named carray
> is a bit confusing.

Yup, agreed.  Don't know what to do here.  carray was more a 
proof-of-concept than anything else, but if development for it continues 
in the future, I should ponder about changing the names.

-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-04-26 09:05:29

On Thu, Apr 26, 2012 at 04:07, Francesc Alted <fa...@py...> wrote:
> On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote:
>> Hi, a minor update on this thread
>>
>>>> * a bool array of 10**8 elements with True in two separate slices of
>>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain
>>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
>>>> array). The resulting filesize is 248kb, still far from storing the 4
>>>> or 6 integer indexes that define the slices (I am experimenting with
>>>> an approach for scientific databases where this is a concern).
>>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but
>>> apparently a 350 to 1 is not enough? :)
>> Here I expected more from a run-length-like compression scheme. My
>> array would be compressible to the following representation:
>>
>> (0, x) : 0
>> (x, x+10**6) : 1
>> (x+10**6, y) : 0
>> (y, y+10**6) : 1
>> (y+10**6, 10**8) : 0
>>
>> or just:
>> (x, x+10**6) : 1
>> (y, y+10**6) : 1
>>
>> where x and y are two reasonable integers (i.e. in range and with no overlap).
>
> Sure, but this is not the spirit of a compressor adapted to the blocking
> technique (in the sense of [1]).  For a compressor that works with
> blocks, you need to add some metainformation for each block, and that
> takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.
>
> [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

Absolutely!

Blocking seems a good approach for most data, where the a priori many
possible values degrade very fast the potential compression gains of a
run-length-encoding (RLE) based scheme.

But boolean arrays, that are used extremely often as masks in
scientific applications and suffer already from a 8x penalty in
storage would be an excellent candidate to consider RLE. Boolean
arrays are also an interesting way to encode attributes by
'bit-vectors', i.e. instead of storing an enum column 'car color' with
values in {red, green, blue}, you store three boolean arrays 'red',
'green', 'blue'. Where this gets interesting is in allowing more
generality, because you don't need a taxonomy, i.e. red and green need
not be exclusive if they are tags on a genetic sequence (or in my
case, an electrophysiological recording). To compute ANDs and ORs you
just have to perform the corresponding bit-wise operations if you
reconstruct the bit-vector or you can use some smart algorithm on the
intervals themselves (as mentioned in another mail, I think, R*Trees
or Nested Containment Lists are two viable candidates).

I don't know whether it's possible to have such an specialization for
compression of boolean arrays in PyTables. Maybe a simple,
alternative route is to make the chunklength dependent on the
likelihood of repeated data (i.e. the range of the type domain), or at
the very least, special-casing chunklength estimation for booleans to
be somewhat higher than for other datatypes. This again, I think is an
exception that would do justice to the main use-case of PyTables.

>>>> * how blosc choses the chunklen is black magic for me, but it seems to
>>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
>>>> 64*1024 when CArraying only one row).
>>> Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could
>>> you detail a bit more how you achieve this result?  Providing an example
>>> would be very useful.
>> I revisited this issue. While in PyTables CArray the guesses are
>> reasonable, the problem is in carray.carray (or in its reporting of
>> chunklen).
>>
>> This is the offender:
>> carray((64, 15600000), int16)  nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78
>>    cparams := cparams(clevel=5, shuffle=True)
>>
>> In [87]: x.chunklen
>> Out[87]: 1
>>
>> Could it be that carray is not reporting the second dimension of the
>> chunkshape? (in PyTables, this is 262144)
>
> Ah yes, this is it.  The carray package is not as sophisticated as HDF5,
> and it only blocks in the leading dimension.  In this case, it is saying
> that the block is a complete row.  So this is the intended behaviour.

Ok, it makes sense, and in my particular use case, the rows do fit in
memory, so there is no need for further chunking.

>>
>> The fact that both PyTable's CArray and carray.carray are named carray
>> is a bit confusing.
>
> Yup, agreed.  Don't know what to do here.  carray was more a
> proof-of-concept than anything else, but if development for it continues
> in the future, I should ponder about changing the names.

It's a neat package and I hope it gets the appreciation and support it deserves!

Cheers,

Álvaro.

> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-04-29 22:41:06

On 4/26/12 4:04 AM, Alvaro Tejero Cantero wrote:
> Sure, but this is not the spirit of a compressor adapted to the blocking
> technique (in the sense of [1]).  For a compressor that works with
> blocks, you need to add some metainformation for each block, and that
> takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.
>
> [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf
> Absolutely!
>
> Blocking seems a good approach for most data, where the a priori many
> possible values degrade very fast the potential compression gains of a
> run-length-encoding (RLE) based scheme.
>
> But boolean arrays, that are used extremely often as masks in
> scientific applications and suffer already from a 8x penalty in
> storage would be an excellent candidate to consider RLE. Boolean
> arrays are also an interesting way to encode attributes by
> 'bit-vectors', i.e. instead of storing an enum column 'car color' with
> values in {red, green, blue}, you store three boolean arrays 'red',
> 'green', 'blue'. Where this gets interesting is in allowing more
> generality, because you don't need a taxonomy, i.e. red and green need
> not be exclusive if they are tags on a genetic sequence (or in my
> case, an electrophysiological recording). To compute ANDs and ORs you
> just have to perform the corresponding bit-wise operations if you
> reconstruct the bit-vector or you can use some smart algorithm on the
> intervals themselves (as mentioned in another mail, I think, R*Trees
> or Nested Containment Lists are two viable candidates).
>
> I don't know whether it's possible to have such an specialization for
> compression of boolean arrays in PyTables. Maybe a simple,
> alternative route is to make the chunklength dependent on the
> likelihood of repeated data (i.e. the range of the type domain), or at
> the very least, special-casing chunklength estimation for booleans to
> be somewhat higher than for other datatypes. This again, I think is an
> exception that would do justice to the main use-case of PyTables.

Yes, I think you raised a good point here.  Well, there are quite a few 
possibilities to reduce the space of highly redundant data, and the 
first should be to add a special case in blosc so that, before passing 
control to blosclz, it first checks for identical data for all the 
block, and if found, then collapse everything to a counter and a value.  
This should require a bit more CPU compression effort (so it could be 
active only at higher compression level), but will lead to far better 
compression ratios.

Another possibility is to add code to deal directly with compressed 
data, but that should be done more at PyTables (or carray, the package) 
level, with some help of the blosc compressor.  In particular, it would 
be very interesting to implement interval algebra out of these extremely 
compressed interval data.

> Yup, agreed.  Don't know what to do here.  carray was more a
> proof-of-concept than anything else, but if development for it continues
> in the future, I should ponder about changing the names.
> It's a neat package and I hope it gets the appreciation and support it deserves!

Thanks, I also think it can be useful for some situations.  But before 
being more used, more work should be put in the range of operations 
supported.  Also, defining a C API and being able to use it straight 
from C could help to spread package adoption too.

-- 
Francesc Alted