From: Alvaro T. C. <al...@mi...> - 2012-03-26 17:29:47
|
Hi there, I am following advice by Anthony and giving a go at representing different sensors in my dataset as columns in a Table, or in several Tables. This is about in-kernel queries. The documentation of condvars in Table.where [1] says "condvars should consist of identifier-like strings pointing to Column (see The Column class) instances of this table, or to other values (which will be converted to arrays)". Conversion to arrays will likely exhaust the memory and be slow. Furthermore, when I tried with a toy example (naively extrapolating the behaviour of indexing in numpy), I obtained In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18) & (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] (... elided output) ValueError: variable ``b`` refers to a column which is not part of table ``/tetrode1 I am interested in the scenario where an in-kernel query is applied to a table based in columns *from other tables* that still are aligned with the current table (same number of elements). These conditions may be sophisticated and mix columns from the local table as well. One obvious solution would be to put all aligned columns on the same table. But adding columns to a table is cumbersome, and I cannot think beforehand of the many precomputed columns that I would like to use as query conditions. What do you recommend in this scenario? -á. [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where |
From: Alvaro T. C. <al...@mi...> - 2012-03-26 17:43:53
|
Would it be an option to have * raw data on one table * all imaginable columns used for query conditions in another table (but how to grow it in columns without deleting & recreating?) and fetch indexes for the first based on .whereList(condition) of the second? Are there alternatives? -á. On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero <al...@mi...> wrote: > Hi there, > > I am following advice by Anthony and giving a go at representing > different sensors in my dataset as columns in a Table, or in several > Tables. This is about in-kernel queries. > > The documentation of condvars in Table.where [1] says "condvars should > consist of identifier-like strings pointing to Column (see The Column > class) instances of this table, or to other values (which will be > converted to arrays)". > > Conversion to arrays will likely exhaust the memory and be slow. > Furthermore, when I tried with a toy example (naively extrapolating > the behaviour of indexing in numpy), I obtained > > In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18) & > (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] > > (... elided output) > ValueError: variable ``b`` refers to a column which is not part of > table ``/tetrode1 > > I am interested in the scenario where an in-kernel query is applied to > a table based in columns *from other tables* that still are aligned > with the current table (same number of elements). These conditions may > be sophisticated and mix columns from the local table as well. > > One obvious solution would be to put all aligned columns on the same > table. But adding columns to a table is cumbersome, and I cannot think > beforehand of the many precomputed columns that I would like to use as > query conditions. > > What do you recommend in this scenario? > > -á. > > [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where |
From: Francesc A. <fa...@py...> - 2012-03-26 22:57:33
|
Hi Alvaro, On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote: > Would it be an option to have > > * raw data on one table > * all imaginable columns used for query conditions in another table Yes, that sounds like a good solution to me. > (but how to grow it in columns without deleting& recreating?) You can't (at least on cheap way). Maybe you may want to create additional tables and grouping them in terms of the columns you are going to need for your queries. > > and fetch indexes for the first based on .whereList(condition) of the second? Exactly. > Are there alternatives? Yes. The alternative would be to have column-wise tables, that would allow you to add and remove columns at a a cost of almost zero. This idea of column-wise tables is quite flexible, and would let you have even variable-length columns, as well as computed columns (that is, data that is generated from other columns based on other columns). These will have a lot of applications, IMO. I would like to add this proposal to our next round of applications for projects to improve PyTables. Let's see how it goes. Francesc > > -á. > > > > On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >> Hi there, >> >> I am following advice by Anthony and giving a go at representing >> different sensors in my dataset as columns in a Table, or in several >> Tables. This is about in-kernel queries. >> >> The documentation of condvars in Table.where [1] says "condvars should >> consist of identifier-like strings pointing to Column (see The Column >> class) instances of this table, or to other values (which will be >> converted to arrays)". >> >> Conversion to arrays will likely exhaust the memory and be slow. >> Furthermore, when I tried with a toy example (naively extrapolating >> the behaviour of indexing in numpy), I obtained >> >> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >> >> (... elided output) >> ValueError: variable ``b`` refers to a column which is not part of >> table ``/tetrode1 >> >> I am interested in the scenario where an in-kernel query is applied to >> a table based in columns *from other tables* that still are aligned >> with the current table (same number of elements). These conditions may >> be sophisticated and mix columns from the local table as well. >> >> One obvious solution would be to put all aligned columns on the same >> table. But adding columns to a table is cumbersome, and I cannot think >> beforehand of the many precomputed columns that I would like to use as >> query conditions. >> >> What do you recommend in this scenario? >> >> -á. >> >> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-27 07:21:11
|
>> (but how to grow it in columns without deleting& recreating?) > > You can't (at least on cheap way). Maybe you may want to create > additional tables and grouping them in terms of the columns you are > going to need for your queries. Sorry, it is not clear to me: create new tables and (grouping = grouping in HDF5 Groups) them in terms of the columns? As far as I understood, only columns in the same table (regardless the group of the table) can be queried together with the in-kernel engine? >> Are there alternatives? > > Yes. The alternative would be to have column-wise tables, that would > allow you to add and remove columns at a a cost of almost zero. This > idea of column-wise tables is quite flexible, and would let you have > even variable-length columns, as well as computed columns (that is, data > that is generated from other columns based on other columns). These > will have a lot of applications, IMO. I would like to add this proposal > to our next round of applications for projects to improve PyTables. > Let's see how it goes. This sounds definitely interesting. But I see the interest that PyTables can query columns in different tables in-kernel, because it removes one big constraint for data layout (and this is in turn important because attr dictionaries can only be attached to whole tables AFAIK). The solution I suggest would be that whenever other columns are involved, the in-kernel engine loops over the zip of the columns. It could do a pre-check on column length before starting. This would be a quite useful enhancement for me. > Francesc >> >> -á. >> >> >> >> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >>> Hi there, >>> >>> I am following advice by Anthony and giving a go at representing >>> different sensors in my dataset as columns in a Table, or in several >>> Tables. This is about in-kernel queries. >>> >>> The documentation of condvars in Table.where [1] says "condvars should >>> consist of identifier-like strings pointing to Column (see The Column >>> class) instances of this table, or to other values (which will be >>> converted to arrays)". >>> >>> Conversion to arrays will likely exhaust the memory and be slow. >>> Furthermore, when I tried with a toy example (naively extrapolating >>> the behaviour of indexing in numpy), I obtained >>> >>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >>> >>> (... elided output) >>> ValueError: variable ``b`` refers to a column which is not part of >>> table ``/tetrode1 >>> >>> I am interested in the scenario where an in-kernel query is applied to >>> a table based in columns *from other tables* that still are aligned >>> with the current table (same number of elements). These conditions may >>> be sophisticated and mix columns from the local table as well. >>> >>> One obvious solution would be to put all aligned columns on the same >>> table. But adding columns to a table is cumbersome, and I cannot think >>> beforehand of the many precomputed columns that I would like to use as >>> query conditions. >>> >>> What do you recommend in this scenario? >>> >>> -á. >>> >>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-03-27 23:11:53
|
On 3/27/12 2:20 AM, Alvaro Tejero Cantero wrote: >>> (but how to grow it in columns without deleting& recreating?) >> You can't (at least on cheap way). Maybe you may want to create >> additional tables and grouping them in terms of the columns you are >> going to need for your queries. > Sorry, it is not clear to me: create new tables and (grouping = > grouping in HDF5 Groups) them in terms of the columns? > As far as I understood, only columns in the same table (regardless the > group of the table) can be queried together with the in-kernel engine? Yes, but the idea is to get rid of this limitation for querying columns in the same table (which is somewhat artificial). In fact, now that I think about this, you can actually implement queries on different unidimensional arrays (think of them as independent columns) by using the `tables.Expr` class. More on this later. -- Francesc Alted |
From: Francesc A. <fa...@py...> - 2012-03-27 23:34:35
|
Another option that occurred to me recently is to save all your columns as unidimensional arrays (Array object, or, if you want compression, a CArray or EArray), and then use them as components of a boolean expression using the class `tables.Expr`. For example, if a, b and c are unidimensional arrays of the same size, you can do: bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') indices = [ind for ind, bool_val in bool_cond if bool_val ] results = your_dataset[indices] Does that make sense for your problem? Of course, this class uses numexpr behind the scenes, so it is perfectly equivalent to classical queries in tables, but without being restricted to use tables. Please see more details about the `tables.Expr` in: http://pytables.github.com/usersguide/libref.html#the-expr-class-a-general-purpose-expression-evaluator Francesc On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote: > Would it be an option to have > > * raw data on one table > * all imaginable columns used for query conditions in another table > (but how to grow it in columns without deleting& recreating?) > > and fetch indexes for the first based on .whereList(condition) of the second? > > Are there alternatives? > > -á. > > > > On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >> Hi there, >> >> I am following advice by Anthony and giving a go at representing >> different sensors in my dataset as columns in a Table, or in several >> Tables. This is about in-kernel queries. >> >> The documentation of condvars in Table.where [1] says "condvars should >> consist of identifier-like strings pointing to Column (see The Column >> class) instances of this table, or to other values (which will be >> converted to arrays)". >> >> Conversion to arrays will likely exhaust the memory and be slow. >> Furthermore, when I tried with a toy example (naively extrapolating >> the behaviour of indexing in numpy), I obtained >> >> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >> >> (... elided output) >> ValueError: variable ``b`` refers to a column which is not part of >> table ``/tetrode1 >> >> I am interested in the scenario where an in-kernel query is applied to >> a table based in columns *from other tables* that still are aligned >> with the current table (same number of elements). These conditions may >> be sophisticated and mix columns from the local table as well. >> >> One obvious solution would be to put all aligned columns on the same >> table. But adding columns to a table is cumbersome, and I cannot think >> beforehand of the many precomputed columns that I would like to use as >> query conditions. >> >> What do you recommend in this scenario? >> >> -á. >> >> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Francesc A. <fa...@gm...> - 2012-03-28 14:36:56
|
On 3/27/12 6:34 PM, Francesc Alted wrote: > Another option that occurred to me recently is to save all your > columns as unidimensional arrays (Array object, or, if you want > compression, a CArray or EArray), and then use them as components of a > boolean expression using the class `tables.Expr`. For example, if a, > b and c are unidimensional arrays of the same size, you can do: > > bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') > indices = [ind for ind, bool_val in bool_cond if bool_val ] Of course, the above line needs to read: indices = [ind for ind, bool_val in enumerate(bool_cond) if bool_val ] > results = your_dataset[indices] Another solution, probably faster, although you need to make sure that you have memory enough to keep your boolean array, is this: bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') bool_arr = bool_cond.eval() results = your_dataset[bool_arr] -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-28 15:15:47
|
That is a perfectly fine solution for me, as long as the arrays aren't copied in memory for the query. Thank you! Thinking that your proposed solution uses iterables to avoid it I tried boolcond = pt.Expr('(exp(a)<0.9)&(a*b>0.7)|(b*sin(a)<0.1)') indices = [i for i,v in boolcond if v] (...) TypeError: 'numpy.bool_' object is not iterable I can, however, do boolarr = boolcond.eval() indices = np.nonzero(boolarr) but then I get boolarr into memory. Did I miss something? What is your advice on how to monitor the use of memory? (I need this until PyTables is second skin). It is very rewarding to see that these numexpr's are 3-4 times faster than the same with arrays in memory. However, I didn't find a way to set the number of threads used When evaluating the blosc benchmarks I found that in my system with two 6-core processors , using 12 is best for writing and 6 for reading. Interesting... Another question (maybe for a separate thread): is there any way to shrink memory usage of booleans to 1 bit? It might well be that this optimizes the use of the memory bus (at some processing cost). But I am not aware of a numpy container for this. -á. On Wed, Mar 28, 2012 at 00:34, Francesc Alted <fa...@py...> wrote: > Another option that occurred to me recently is to save all your columns > as unidimensional arrays (Array object, or, if you want compression, a > CArray or EArray), and then use them as components of a boolean > expression using the class `tables.Expr`. For example, if a, b and c > are unidimensional arrays of the same size, you can do: > > bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') > indices = [ind for ind, bool_val in bool_cond if bool_val ] > results = your_dataset[indices] > > Does that make sense for your problem? Of course, this class uses > numexpr behind the scenes, so it is perfectly equivalent to classical > queries in tables, but without being restricted to use tables. Please > see more details about the `tables.Expr` in: > > http://pytables.github.com/usersguide/libref.html#the-expr-class-a-general-purpose-expression-evaluator > > Francesc > > On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote: >> Would it be an option to have >> >> * raw data on one table >> * all imaginable columns used for query conditions in another table >> (but how to grow it in columns without deleting& recreating?) >> >> and fetch indexes for the first based on .whereList(condition) of the second? >> >> Are there alternatives? >> >> -á. >> >> >> >> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >>> Hi there, >>> >>> I am following advice by Anthony and giving a go at representing >>> different sensors in my dataset as columns in a Table, or in several >>> Tables. This is about in-kernel queries. >>> >>> The documentation of condvars in Table.where [1] says "condvars should >>> consist of identifier-like strings pointing to Column (see The Column >>> class) instances of this table, or to other values (which will be >>> converted to arrays)". >>> >>> Conversion to arrays will likely exhaust the memory and be slow. >>> Furthermore, when I tried with a toy example (naively extrapolating >>> the behaviour of indexing in numpy), I obtained >>> >>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >>> >>> (... elided output) >>> ValueError: variable ``b`` refers to a column which is not part of >>> table ``/tetrode1 >>> >>> I am interested in the scenario where an in-kernel query is applied to >>> a table based in columns *from other tables* that still are aligned >>> with the current table (same number of elements). These conditions may >>> be sophisticated and mix columns from the local table as well. >>> >>> One obvious solution would be to put all aligned columns on the same >>> table. But adding columns to a table is cumbersome, and I cannot think >>> beforehand of the many precomputed columns that I would like to use as >>> query conditions. >>> >>> What do you recommend in this scenario? >>> >>> -á. >>> >>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-03-28 18:04:05
|
On 3/28/12 10:15 AM, Alvaro Tejero Cantero wrote: > That is a perfectly fine solution for me, as long as the arrays aren't > copied in memory for the query. No, the arrays are not copied in memory. They are just read from disk block-by-block and then the output is directed to the iterator, or an array (depending on the context). > > Thank you! > > Thinking that your proposed solution uses iterables to avoid it I tried > > boolcond = pt.Expr('(exp(a)<0.9)&(a*b>0.7)|(b*sin(a)<0.1)') > indices = [i for i,v in boolcond if v] > (...) TypeError: 'numpy.bool_' object is not iterable > > I can, however, do > boolarr = boolcond.eval() > indices = np.nonzero(boolarr) > > but then I get boolarr into memory. > > Did I miss something? Yes, that was an error on my part. The correct way is: indices = [i for i,v in boolcond if enumerate(v)] > What is your advice on how to monitor the use of > memory? (I need this until PyTables is second skin). top? > > It is very rewarding to see that these numexpr's are 3-4 times faster > than the same with arrays in memory. However, I didn't find a way to > set the number of threads used Well, you can use the `MAX_THREADS` variable in 'parameters.py', but this do not offer separate controls for numexpr and blosc. Feel free to open a ticket asking for imporving this functionality. > > When evaluating the blosc benchmarks I found that in my system with > two 6-core processors , using 12 is best for writing and 6 for > reading. Interesting... Yes, it is :) > > Another question (maybe for a separate thread): is there any way to > shrink memory usage of booleans to 1 bit? It might well be that this > optimizes the use of the memory bus (at some processing cost). But I > am not aware of a numpy container for this. Maybe a compressed array? That would lead to using less that 1 bit per element in many situations. If you are interested in this, look into: https://github.com/FrancescAlted/carray -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-29 15:50:01
|
>> What is your advice on how to monitor the use of >> memory? (I need this until PyTables is second skin). > > top? I had so far used it only in a very rudimentary way and found the man page quite intimidating. Would you care to share your tips for this particular scenario? (e.g. how do you keep the ipython process 'focused'?) >> It is very rewarding to see that these numexpr's are 3-4 times faster >> than the same with arrays in memory. However, I didn't find a way to >> set the number of threads used > > Well, you can use the `MAX_THREADS` variable in 'parameters.py', but > this do not offer separate controls for numexpr and blosc. Feel free to > open a ticket asking for imporving this functionality. Ok, I opened the following tickets (since I have to build the application first and then revisit the infrastructural issues, I cannot do more about them now): * one for implementation of references https://github.com/PyTables/PyTables/issues/140 * one for the estimation of dataset (group?) size https://github.com/PyTables/PyTables/issues/141 * one for an interface function to set MAX_THREADS for numexpr independently of blosc's https://github.com/PyTables/PyTables/issues/142 >> When evaluating the blosc benchmarks I found that in my system with >> two 6-core processors , using 12 is best for writing and 6 for >> reading. Interesting... > > Yes, it is :) Are you interested in my .out bench output file for the SyntheticBenchmarks page? >> Another question (maybe for a separate thread): is there any way to >> shrink memory usage of booleans to 1 bit? It might well be that this >> optimizes the use of the memory bus (at some processing cost). But I >> am not aware of a numpy container for this. > > Maybe a compressed array? That would lead to using less that 1 bit per > element in many situations. If you are interested in this, look into: > > https://github.com/FrancescAlted/carray Ok, I did some playing around with this: * a bool array of 10**8 elements with True in two separate slices of length 10**6 each compresses by ~350. Using .wheretrue to obtain indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy array). The resulting filesize is 248kb, still far from storing the 4 or 6 integer indexes that define the slices (I am experimenting with an approach for scientific databases where this is a concern). * a sample of my normal electrophysiological data (15M Int16 data points) compresses by about 1.7-1.8. * how blosc choses the chunklen is black magic for me, but it seems to be quite spot-on. (e.g. it changed from '1' for a 64x15M array to 64*1024 when CArraying only one row). * A quick way to know how well your data will compress in PyTables if you will be using blosc is to test in the REPL with CArray. I guess for the other compressors we still need to go (for the moment) to checking filesystem-reported sizes. Best, á. > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-03-29 23:53:07
|
On 3/29/12 10:49 AM, Alvaro Tejero Cantero wrote: >>> What is your advice on how to monitor the use of >>> memory? (I need this until PyTables is second skin). >> top? > I had so far used it only in a very rudimentary way and found the man > page quite intimidating. Would you care to share your tips for this > particular scenario? (e.g. how do you keep the ipython process > 'focused'?) Well, top by default keeps the most CPU consuming process always on the top (hence the name), so I think it is quite easy to spot the interesting process. vmstat is another interesting utility, but will report only about general virtual memory consumption, not on a per-process basis. Finally if you can afford to instrument your code and you use Linux (I assume this is the case), then you may want to use a small routine that tells your the memory used by the caller process each time it is called. Here it is an example on how this is used in PyTables test suite: https://github.com/PyTables/PyTables/blob/master/tables/tests/common.py#L483 I'm sure you will figure out how to use it in your own scenario. > >>> It is very rewarding to see that these numexpr's are 3-4 times faster >>> than the same with arrays in memory. However, I didn't find a way to >>> set the number of threads used >> Well, you can use the `MAX_THREADS` variable in 'parameters.py', but >> this do not offer separate controls for numexpr and blosc. Feel free to >> open a ticket asking for imporving this functionality. > Ok, I opened the following tickets (since I have to build the > application first and then revisit the infrastructural issues, I > cannot do more about them now): > > * one for implementation of references > https://github.com/PyTables/PyTables/issues/140 > * one for the estimation of dataset (group?) size > https://github.com/PyTables/PyTables/issues/141 > * one for an interface function to set MAX_THREADS for numexpr > independently of blosc's > https://github.com/PyTables/PyTables/issues/142 Excellent. Thanks! >>> When evaluating the blosc benchmarks I found that in my system with >>> two 6-core processors , using 12 is best for writing and 6 for >>> reading. Interesting... >> Yes, it is :) > Are you interested in my .out bench output file for the > SyntheticBenchmarks page? Yes, I am! And if you can produce the matplotlib figures, that would be much rejoice :) >>> Another question (maybe for a separate thread): is there any way to >>> shrink memory usage of booleans to 1 bit? It might well be that this >>> optimizes the use of the memory bus (at some processing cost). But I >>> am not aware of a numpy container for this. >> Maybe a compressed array? That would lead to using less that 1 bit per >> element in many situations. If you are interested in this, look into: >> >> https://github.com/FrancescAlted/carray > Ok, I did some playing around with this: > > * a bool array of 10**8 elements with True in two separate slices of > length 10**6 each compresses by ~350. Using .wheretrue to obtain > indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy > array). The resulting filesize is 248kb, still far from storing the 4 > or 6 integer indexes that define the slices (I am experimenting with > an approach for scientific databases where this is a concern). Oh, you were asking for a 8 to 1 compressor (booleans as bits), but apparently a 350 to 1 is not enough? :) > > * a sample of my normal electrophysiological data (15M Int16 data > points) compresses by about 1.7-1.8. Well, I was expecting something more for these time series data, but it is not that bad for int16. Probably int32 or float64 would reach better compression rates. > > * how blosc choses the chunklen is black magic for me, but it seems to > be quite spot-on. (e.g. it changed from '1' for a 64x15M array to > 64*1024 when CArraying only one row). Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could you detail a bit more how you achieve this result? Providing an example would be very useful. > > * A quick way to know how well your data will compress in PyTables if > you will be using blosc is to test in the REPL with CArray. I guess > for the other compressors we still need to go (for the moment) to > checking filesystem-reported sizes. Just be sure that you experiment with different chunklengths by using the `chunklen` parameter in carray constructor too. -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-04-25 12:05:51
|
Hi, a minor update on this thread >> * a bool array of 10**8 elements with True in two separate slices of >> length 10**6 each compresses by ~350. Using .wheretrue to obtain >> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy >> array). The resulting filesize is 248kb, still far from storing the 4 >> or 6 integer indexes that define the slices (I am experimenting with >> an approach for scientific databases where this is a concern). > > Oh, you were asking for a 8 to 1 compressor (booleans as bits), but > apparently a 350 to 1 is not enough? :) Here I expected more from a run-length-like compression scheme. My array would be compressible to the following representation: (0, x) : 0 (x, x+10**6) : 1 (x+10**6, y) : 0 (y, y+10**6) : 1 (y+10**6, 10**8) : 0 or just: (x, x+10**6) : 1 (y, y+10**6) : 1 where x and y are two reasonable integers (i.e. in range and with no overlap). >> * how blosc choses the chunklen is black magic for me, but it seems to >> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to >> 64*1024 when CArraying only one row). > > Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could > you detail a bit more how you achieve this result? Providing an example > would be very useful. I revisited this issue. While in PyTables CArray the guesses are reasonable, the problem is in carray.carray (or in its reporting of chunklen). This is the offender: carray((64, 15600000), int16) nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78 cparams := cparams(clevel=5, shuffle=True) In [87]: x.chunklen Out[87]: 1 Could it be that carray is not reporting the second dimension of the chunkshape? (in PyTables, this is 262144) The fact that both PyTable's CArray and carray.carray are named carray is a bit confusing. > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-04-26 03:07:54
|
On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote: > Hi, a minor update on this thread > >>> * a bool array of 10**8 elements with True in two separate slices of >>> length 10**6 each compresses by ~350. Using .wheretrue to obtain >>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy >>> array). The resulting filesize is 248kb, still far from storing the 4 >>> or 6 integer indexes that define the slices (I am experimenting with >>> an approach for scientific databases where this is a concern). >> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but >> apparently a 350 to 1 is not enough? :) > Here I expected more from a run-length-like compression scheme. My > array would be compressible to the following representation: > > (0, x) : 0 > (x, x+10**6) : 1 > (x+10**6, y) : 0 > (y, y+10**6) : 1 > (y+10**6, 10**8) : 0 > > or just: > (x, x+10**6) : 1 > (y, y+10**6) : 1 > > where x and y are two reasonable integers (i.e. in range and with no overlap). Sure, but this is not the spirit of a compressor adapted to the blocking technique (in the sense of [1]). For a compressor that works with blocks, you need to add some metainformation for each block, and that takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf > >>> * how blosc choses the chunklen is black magic for me, but it seems to >>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to >>> 64*1024 when CArraying only one row). >> Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could >> you detail a bit more how you achieve this result? Providing an example >> would be very useful. > I revisited this issue. While in PyTables CArray the guesses are > reasonable, the problem is in carray.carray (or in its reporting of > chunklen). > > This is the offender: > carray((64, 15600000), int16) nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78 > cparams := cparams(clevel=5, shuffle=True) > > In [87]: x.chunklen > Out[87]: 1 > > Could it be that carray is not reporting the second dimension of the > chunkshape? (in PyTables, this is 262144) Ah yes, this is it. The carray package is not as sophisticated as HDF5, and it only blocks in the leading dimension. In this case, it is saying that the block is a complete row. So this is the intended behaviour. > > The fact that both PyTable's CArray and carray.carray are named carray > is a bit confusing. Yup, agreed. Don't know what to do here. carray was more a proof-of-concept than anything else, but if development for it continues in the future, I should ponder about changing the names. -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-04-26 09:05:29
|
On Thu, Apr 26, 2012 at 04:07, Francesc Alted <fa...@py...> wrote: > On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote: >> Hi, a minor update on this thread >> >>>> * a bool array of 10**8 elements with True in two separate slices of >>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain >>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy >>>> array). The resulting filesize is 248kb, still far from storing the 4 >>>> or 6 integer indexes that define the slices (I am experimenting with >>>> an approach for scientific databases where this is a concern). >>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but >>> apparently a 350 to 1 is not enough? :) >> Here I expected more from a run-length-like compression scheme. My >> array would be compressible to the following representation: >> >> (0, x) : 0 >> (x, x+10**6) : 1 >> (x+10**6, y) : 0 >> (y, y+10**6) : 1 >> (y+10**6, 10**8) : 0 >> >> or just: >> (x, x+10**6) : 1 >> (y, y+10**6) : 1 >> >> where x and y are two reasonable integers (i.e. in range and with no overlap). > > Sure, but this is not the spirit of a compressor adapted to the blocking > technique (in the sense of [1]). For a compressor that works with > blocks, you need to add some metainformation for each block, and that > takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. > > [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf Absolutely! Blocking seems a good approach for most data, where the a priori many possible values degrade very fast the potential compression gains of a run-length-encoding (RLE) based scheme. But boolean arrays, that are used extremely often as masks in scientific applications and suffer already from a 8x penalty in storage would be an excellent candidate to consider RLE. Boolean arrays are also an interesting way to encode attributes by 'bit-vectors', i.e. instead of storing an enum column 'car color' with values in {red, green, blue}, you store three boolean arrays 'red', 'green', 'blue'. Where this gets interesting is in allowing more generality, because you don't need a taxonomy, i.e. red and green need not be exclusive if they are tags on a genetic sequence (or in my case, an electrophysiological recording). To compute ANDs and ORs you just have to perform the corresponding bit-wise operations if you reconstruct the bit-vector or you can use some smart algorithm on the intervals themselves (as mentioned in another mail, I think, R*Trees or Nested Containment Lists are two viable candidates). I don't know whether it's possible to have such an specialization for compression of boolean arrays in PyTables. Maybe a simple, alternative route is to make the chunklength dependent on the likelihood of repeated data (i.e. the range of the type domain), or at the very least, special-casing chunklength estimation for booleans to be somewhat higher than for other datatypes. This again, I think is an exception that would do justice to the main use-case of PyTables. >>>> * how blosc choses the chunklen is black magic for me, but it seems to >>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to >>>> 64*1024 when CArraying only one row). >>> Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could >>> you detail a bit more how you achieve this result? Providing an example >>> would be very useful. >> I revisited this issue. While in PyTables CArray the guesses are >> reasonable, the problem is in carray.carray (or in its reporting of >> chunklen). >> >> This is the offender: >> carray((64, 15600000), int16) nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78 >> cparams := cparams(clevel=5, shuffle=True) >> >> In [87]: x.chunklen >> Out[87]: 1 >> >> Could it be that carray is not reporting the second dimension of the >> chunkshape? (in PyTables, this is 262144) > > Ah yes, this is it. The carray package is not as sophisticated as HDF5, > and it only blocks in the leading dimension. In this case, it is saying > that the block is a complete row. So this is the intended behaviour. Ok, it makes sense, and in my particular use case, the rows do fit in memory, so there is no need for further chunking. >> >> The fact that both PyTable's CArray and carray.carray are named carray >> is a bit confusing. > > Yup, agreed. Don't know what to do here. carray was more a > proof-of-concept than anything else, but if development for it continues > in the future, I should ponder about changing the names. It's a neat package and I hope it gets the appreciation and support it deserves! Cheers, Álvaro. > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-04-29 22:41:06
|
On 4/26/12 4:04 AM, Alvaro Tejero Cantero wrote: > Sure, but this is not the spirit of a compressor adapted to the blocking > technique (in the sense of [1]). For a compressor that works with > blocks, you need to add some metainformation for each block, and that > takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. > > [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf > Absolutely! > > Blocking seems a good approach for most data, where the a priori many > possible values degrade very fast the potential compression gains of a > run-length-encoding (RLE) based scheme. > > But boolean arrays, that are used extremely often as masks in > scientific applications and suffer already from a 8x penalty in > storage would be an excellent candidate to consider RLE. Boolean > arrays are also an interesting way to encode attributes by > 'bit-vectors', i.e. instead of storing an enum column 'car color' with > values in {red, green, blue}, you store three boolean arrays 'red', > 'green', 'blue'. Where this gets interesting is in allowing more > generality, because you don't need a taxonomy, i.e. red and green need > not be exclusive if they are tags on a genetic sequence (or in my > case, an electrophysiological recording). To compute ANDs and ORs you > just have to perform the corresponding bit-wise operations if you > reconstruct the bit-vector or you can use some smart algorithm on the > intervals themselves (as mentioned in another mail, I think, R*Trees > or Nested Containment Lists are two viable candidates). > > I don't know whether it's possible to have such an specialization for > compression of boolean arrays in PyTables. Maybe a simple, > alternative route is to make the chunklength dependent on the > likelihood of repeated data (i.e. the range of the type domain), or at > the very least, special-casing chunklength estimation for booleans to > be somewhat higher than for other datatypes. This again, I think is an > exception that would do justice to the main use-case of PyTables. Yes, I think you raised a good point here. Well, there are quite a few possibilities to reduce the space of highly redundant data, and the first should be to add a special case in blosc so that, before passing control to blosclz, it first checks for identical data for all the block, and if found, then collapse everything to a counter and a value. This should require a bit more CPU compression effort (so it could be active only at higher compression level), but will lead to far better compression ratios. Another possibility is to add code to deal directly with compressed data, but that should be done more at PyTables (or carray, the package) level, with some help of the blosc compressor. In particular, it would be very interesting to implement interval algebra out of these extremely compressed interval data. > Yup, agreed. Don't know what to do here. carray was more a > proof-of-concept than anything else, but if development for it continues > in the future, I should ponder about changing the names. > It's a neat package and I hope it gets the appreciation and support it deserves! Thanks, I also think it can be useful for some situations. But before being more used, more work should be put in the range of operations supported. Also, defining a C API and being able to use it straight from C could help to spread package adoption too. -- Francesc Alted |