You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Francesc A. <fa...@py...> - 2012-04-02 20:55:51
|
I don't know. I personally do not have experience with PyDev. If you don't see the message about PyTables closing files, then there is a high probability that it does not do that. In this case, your suggestion on using try-except-finally block is your best bet, IMO. Francesc On 4/2/12 2:59 PM, Daπid wrote: > I noticed that if a program raises an error, it shows a message > indicating the file is closed, but it doesn't show anything if I > terminate it from outside (in my case, stop from PyDev). > > Is it being flushed? Is there any way of doing that, apart from > enveloping the whole program in a try-except-finally block? > > On Mon, Apr 2, 2012 at 9:48 PM, Francesc Alted<fa...@py...> wrote: >> On 4/2/12 12:38 PM, Alvaro Tejero Cantero wrote: >>> Hi, >>> >>> should PyTables flush on __exit__ ? >>> https://github.com/PyTables/PyTables/blob/master/tables/file.py#L2164 >>> >>> it is not clear to me if a File.close() call results in automatic >>> flushing all the nodes, since Node()._f_close() promises only "On >>> nodes with data, it may be flushed to disk." >>> https://github.com/PyTables/PyTables/blob/master/tables/node.py#L512 >> Yup, it does flush. The message should be more explicit on this. >> >> -- >> Francesc Alted >> >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Daπid <dav...@gm...> - 2012-04-02 19:59:35
|
I noticed that if a program raises an error, it shows a message indicating the file is closed, but it doesn't show anything if I terminate it from outside (in my case, stop from PyDev). Is it being flushed? Is there any way of doing that, apart from enveloping the whole program in a try-except-finally block? On Mon, Apr 2, 2012 at 9:48 PM, Francesc Alted <fa...@py...> wrote: > On 4/2/12 12:38 PM, Alvaro Tejero Cantero wrote: >> Hi, >> >> should PyTables flush on __exit__ ? >> https://github.com/PyTables/PyTables/blob/master/tables/file.py#L2164 >> >> it is not clear to me if a File.close() call results in automatic >> flushing all the nodes, since Node()._f_close() promises only "On >> nodes with data, it may be flushed to disk." >> https://github.com/PyTables/PyTables/blob/master/tables/node.py#L512 > > Yup, it does flush. The message should be more explicit on this. > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-04-02 19:48:45
|
On 4/2/12 12:38 PM, Alvaro Tejero Cantero wrote: > Hi, > > should PyTables flush on __exit__ ? > https://github.com/PyTables/PyTables/blob/master/tables/file.py#L2164 > > it is not clear to me if a File.close() call results in automatic > flushing all the nodes, since Node()._f_close() promises only "On > nodes with data, it may be flushed to disk." > https://github.com/PyTables/PyTables/blob/master/tables/node.py#L512 Yup, it does flush. The message should be more explicit on this. -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-04-02 17:38:43
|
Hi, should PyTables flush on __exit__ ? https://github.com/PyTables/PyTables/blob/master/tables/file.py#L2164 it is not clear to me if a File.close() call results in automatic flushing all the nodes, since Node()._f_close() promises only "On nodes with data, it may be flushed to disk." https://github.com/PyTables/PyTables/blob/master/tables/node.py#L512 Cheers, -á. |
From: Jean-Baptiste R. <boo...@ya...> - 2012-04-02 15:12:49
|
<a href="http://motovideo.cl/videos/website_2.0/wp-content/02efpk.html"> http://motovideo.cl/videos/website_2.0/wp-content/02efpk.html</a> |
From: Daπid <dav...@gm...> - 2012-03-31 12:37:30
|
On Sat, Mar 31, 2012 at 9:13 AM, Antonio Valentino <ant...@ti...> wrote: > Some work on in broken inheritance of IsDescriptor has been done in the > development branch (see [#65]). Thanks to Andrea Bedini for the patch. > > Which version of PyTables are you using David? That is indeed a nice adition. I am using the released 2.3.1 version. By the way, sorry for the incomplete subject. |
From: Francesc A. <fa...@py...> - 2012-03-31 12:30:07
|
On 3/31/12 2:13 AM, Antonio Valentino wrote: > Hi Danid, hi Francesc, > > Il 31/03/2012 03:08, Francesc Alted ha scritto: >> On 3/30/12 7:57 PM, Daπid wrote: >>> Hello, >>> >>> I have several different kinds of data tables, absolutely independent, >>> defined as in the tutorial >>> (http://pytables.github.com/usersguide/tutorials.html): >>> > [CUT] > >>> One, naively, saw the repetition and would want to do something like: >>> >>> class Pie(IsDescription): # All kind of pies >>> dough=Int64Col() >>> baking=Float64Col() >>> >>> >>> class SaltedPie(Pie): >>> anchovy=Float32Col() >>> >>> class SweetPie(Pie): >>> apple=UInt32Col() >>> >>> >>> but, when I try to set 'dough', I get: >>> >>> KeyError: 'no such column: dough' >>> >>> >>> Of course, my approach is not correct. Is there a valid way of doing it? >> Right, subclassing IsDescription is not supported. Sorry, but I think >> that the only way is to do the repetition explicitly. Or, maybe you can >> use a NumPy dtype instead, that allows you to create table schemas more >> succinctly. >> > Some work on in broken inheritance of IsDescriptor has been done in the > development branch (see [#65]). Thanks to Andrea Bedini for the patch. > > Which version of PyTables are you using David? > > > cheers > > [#65] https://github.com/PyTables/PyTables/issues/65 Oh, that's a nice addition indeed. Thanks for the remind! -- Francesc Alted |
From: Antonio V. <ant...@ti...> - 2012-03-31 07:13:20
|
Hi Danid, hi Francesc, Il 31/03/2012 03:08, Francesc Alted ha scritto: > On 3/30/12 7:57 PM, Daπid wrote: >> Hello, >> >> I have several different kinds of data tables, absolutely independent, >> defined as in the tutorial >> (http://pytables.github.com/usersguide/tutorials.html): >> [CUT] >> >> One, naively, saw the repetition and would want to do something like: >> >> class Pie(IsDescription): # All kind of pies >> dough=Int64Col() >> baking=Float64Col() >> >> >> class SaltedPie(Pie): >> anchovy=Float32Col() >> >> class SweetPie(Pie): >> apple=UInt32Col() >> >> >> but, when I try to set 'dough', I get: >> >> KeyError: 'no such column: dough' >> >> >> Of course, my approach is not correct. Is there a valid way of doing it? > > Right, subclassing IsDescription is not supported. Sorry, but I think > that the only way is to do the repetition explicitly. Or, maybe you can > use a NumPy dtype instead, that allows you to create table schemas more > succinctly. > Some work on in broken inheritance of IsDescriptor has been done in the development branch (see [#65]). Thanks to Andrea Bedini for the patch. Which version of PyTables are you using David? cheers [#65] https://github.com/PyTables/PyTables/issues/65 -- Antonio Valentino |
From: Francesc A. <fa...@py...> - 2012-03-31 01:08:57
|
On 3/30/12 7:57 PM, Daπid wrote: > Hello, > > I have several different kinds of data tables, absolutely independent, > defined as in the tutorial > (http://pytables.github.com/usersguide/tutorials.html): > > > from tables import * > > class SaltedPie(IsDescription): > dough=Int64Col() > baking=Float64Col() > anchovy=Float32Col() > > class SweetPie(IsDesctiption): > dough=Int64Col() > baking=Float64Col() > apple=UInt32Col() > > > One, naively, saw the repetition and would want to do something like: > > class Pie(IsDescription): # All kind of pies > dough=Int64Col() > baking=Float64Col() > > > class SaltedPie(Pie): > anchovy=Float32Col() > > class SweetPie(Pie): > apple=UInt32Col() > > > but, when I try to set 'dough', I get: > > KeyError: 'no such column: dough' > > > Of course, my approach is not correct. Is there a valid way of doing it? Right, subclassing IsDescription is not supported. Sorry, but I think that the only way is to do the repetition explicitly. Or, maybe you can use a NumPy dtype instead, that allows you to create table schemas more succinctly. -- Francesc Alted |
From: Daπid <dav...@gm...> - 2012-03-31 00:57:48
|
Hello, I have several different kinds of data tables, absolutely independent, defined as in the tutorial (http://pytables.github.com/usersguide/tutorials.html): from tables import * class SaltedPie(IsDescription): dough=Int64Col() baking=Float64Col() anchovy=Float32Col() class SweetPie(IsDesctiption): dough=Int64Col() baking=Float64Col() apple=UInt32Col() One, naively, saw the repetition and would want to do something like: class Pie(IsDescription): # All kind of pies dough=Int64Col() baking=Float64Col() class SaltedPie(Pie): anchovy=Float32Col() class SweetPie(Pie): apple=UInt32Col() but, when I try to set 'dough', I get: KeyError: 'no such column: dough' Of course, my approach is not correct. Is there a valid way of doing it? Thanks. Best regards, David. |
From: Francesc A. <fa...@py...> - 2012-03-29 23:53:07
|
On 3/29/12 10:49 AM, Alvaro Tejero Cantero wrote: >>> What is your advice on how to monitor the use of >>> memory? (I need this until PyTables is second skin). >> top? > I had so far used it only in a very rudimentary way and found the man > page quite intimidating. Would you care to share your tips for this > particular scenario? (e.g. how do you keep the ipython process > 'focused'?) Well, top by default keeps the most CPU consuming process always on the top (hence the name), so I think it is quite easy to spot the interesting process. vmstat is another interesting utility, but will report only about general virtual memory consumption, not on a per-process basis. Finally if you can afford to instrument your code and you use Linux (I assume this is the case), then you may want to use a small routine that tells your the memory used by the caller process each time it is called. Here it is an example on how this is used in PyTables test suite: https://github.com/PyTables/PyTables/blob/master/tables/tests/common.py#L483 I'm sure you will figure out how to use it in your own scenario. > >>> It is very rewarding to see that these numexpr's are 3-4 times faster >>> than the same with arrays in memory. However, I didn't find a way to >>> set the number of threads used >> Well, you can use the `MAX_THREADS` variable in 'parameters.py', but >> this do not offer separate controls for numexpr and blosc. Feel free to >> open a ticket asking for imporving this functionality. > Ok, I opened the following tickets (since I have to build the > application first and then revisit the infrastructural issues, I > cannot do more about them now): > > * one for implementation of references > https://github.com/PyTables/PyTables/issues/140 > * one for the estimation of dataset (group?) size > https://github.com/PyTables/PyTables/issues/141 > * one for an interface function to set MAX_THREADS for numexpr > independently of blosc's > https://github.com/PyTables/PyTables/issues/142 Excellent. Thanks! >>> When evaluating the blosc benchmarks I found that in my system with >>> two 6-core processors , using 12 is best for writing and 6 for >>> reading. Interesting... >> Yes, it is :) > Are you interested in my .out bench output file for the > SyntheticBenchmarks page? Yes, I am! And if you can produce the matplotlib figures, that would be much rejoice :) >>> Another question (maybe for a separate thread): is there any way to >>> shrink memory usage of booleans to 1 bit? It might well be that this >>> optimizes the use of the memory bus (at some processing cost). But I >>> am not aware of a numpy container for this. >> Maybe a compressed array? That would lead to using less that 1 bit per >> element in many situations. If you are interested in this, look into: >> >> https://github.com/FrancescAlted/carray > Ok, I did some playing around with this: > > * a bool array of 10**8 elements with True in two separate slices of > length 10**6 each compresses by ~350. Using .wheretrue to obtain > indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy > array). The resulting filesize is 248kb, still far from storing the 4 > or 6 integer indexes that define the slices (I am experimenting with > an approach for scientific databases where this is a concern). Oh, you were asking for a 8 to 1 compressor (booleans as bits), but apparently a 350 to 1 is not enough? :) > > * a sample of my normal electrophysiological data (15M Int16 data > points) compresses by about 1.7-1.8. Well, I was expecting something more for these time series data, but it is not that bad for int16. Probably int32 or float64 would reach better compression rates. > > * how blosc choses the chunklen is black magic for me, but it seems to > be quite spot-on. (e.g. it changed from '1' for a 64x15M array to > 64*1024 when CArraying only one row). Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could you detail a bit more how you achieve this result? Providing an example would be very useful. > > * A quick way to know how well your data will compress in PyTables if > you will be using blosc is to test in the REPL with CArray. I guess > for the other compressors we still need to go (for the moment) to > checking filesystem-reported sizes. Just be sure that you experiment with different chunklengths by using the `chunklen` parameter in carray constructor too. -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-29 15:52:54
|
Thanks for the link to Gaël's study (that shows well the strengths of pytables/blosc). Here' s the issue I created related to dataset size estimation linked for reference of people landing from web searches on this thread: https://github.com/PyTables/PyTables/issues/141 (there are pointers there to everything I know about current approaches). -á. On Wed, Mar 28, 2012 at 22:43, Francesc Alted <fa...@py...> wrote: > In case you want more feedback on compression filters, this study might > be interesting for you: > > http://gael-varoquaux.info/blog/?p=159 > > Francesc > > On 3/28/12 12:33 PM, Alvaro Tejero Cantero wrote: >> Hi, >> >> Trying to evaluate compression filters, I was looking for a call in >> PyTables to get the size of a dataset (in bytes). As I didn't find it >> I remembered the many benchmarks and found instead [1] that the way to >> do it is to create single-dataset files and interrogate the >> filesystem. Curiously enough, I didn't find that feature either in >> h5ls, hdfview or ViTables. Is there a structural reason why this size >> cannot be computed from library calls? >> >> -á. >> >> [1] https://github.com/PyTables/PyTables/blob/master/bench/blosc.py >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Alvaro T. C. <al...@mi...> - 2012-03-29 15:50:01
|
>> What is your advice on how to monitor the use of >> memory? (I need this until PyTables is second skin). > > top? I had so far used it only in a very rudimentary way and found the man page quite intimidating. Would you care to share your tips for this particular scenario? (e.g. how do you keep the ipython process 'focused'?) >> It is very rewarding to see that these numexpr's are 3-4 times faster >> than the same with arrays in memory. However, I didn't find a way to >> set the number of threads used > > Well, you can use the `MAX_THREADS` variable in 'parameters.py', but > this do not offer separate controls for numexpr and blosc. Feel free to > open a ticket asking for imporving this functionality. Ok, I opened the following tickets (since I have to build the application first and then revisit the infrastructural issues, I cannot do more about them now): * one for implementation of references https://github.com/PyTables/PyTables/issues/140 * one for the estimation of dataset (group?) size https://github.com/PyTables/PyTables/issues/141 * one for an interface function to set MAX_THREADS for numexpr independently of blosc's https://github.com/PyTables/PyTables/issues/142 >> When evaluating the blosc benchmarks I found that in my system with >> two 6-core processors , using 12 is best for writing and 6 for >> reading. Interesting... > > Yes, it is :) Are you interested in my .out bench output file for the SyntheticBenchmarks page? >> Another question (maybe for a separate thread): is there any way to >> shrink memory usage of booleans to 1 bit? It might well be that this >> optimizes the use of the memory bus (at some processing cost). But I >> am not aware of a numpy container for this. > > Maybe a compressed array? That would lead to using less that 1 bit per > element in many situations. If you are interested in this, look into: > > https://github.com/FrancescAlted/carray Ok, I did some playing around with this: * a bool array of 10**8 elements with True in two separate slices of length 10**6 each compresses by ~350. Using .wheretrue to obtain indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy array). The resulting filesize is 248kb, still far from storing the 4 or 6 integer indexes that define the slices (I am experimenting with an approach for scientific databases where this is a concern). * a sample of my normal electrophysiological data (15M Int16 data points) compresses by about 1.7-1.8. * how blosc choses the chunklen is black magic for me, but it seems to be quite spot-on. (e.g. it changed from '1' for a 64x15M array to 64*1024 when CArraying only one row). * A quick way to know how well your data will compress in PyTables if you will be using blosc is to test in the REPL with CArray. I guess for the other compressors we still need to go (for the moment) to checking filesystem-reported sizes. Best, á. > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-03-28 21:43:22
|
In case you want more feedback on compression filters, this study might be interesting for you: http://gael-varoquaux.info/blog/?p=159 Francesc On 3/28/12 12:33 PM, Alvaro Tejero Cantero wrote: > Hi, > > Trying to evaluate compression filters, I was looking for a call in > PyTables to get the size of a dataset (in bytes). As I didn't find it > I remembered the many benchmarks and found instead [1] that the way to > do it is to create single-dataset files and interrogate the > filesystem. Curiously enough, I didn't find that feature either in > h5ls, hdfview or ViTables. Is there a structural reason why this size > cannot be computed from library calls? > > -á. > > [1] https://github.com/PyTables/PyTables/blob/master/bench/blosc.py > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2012-03-28 18:16:01
|
On Wed, Mar 28, 2012 at 1:05 PM, Francesc Alted <fa...@py...> wrote: > On 3/28/12 12:33 PM, Alvaro Tejero Cantero wrote: > > Hi, > > > > Trying to evaluate compression filters, I was looking for a call in > > PyTables to get the size of a dataset (in bytes). As I didn't find it > > I remembered the many benchmarks and found instead [1] that the way to > > do it is to create single-dataset files and interrogate the > > filesystem. Curiously enough, I didn't find that feature either in > > h5ls, hdfview or ViTables. Is there a structural reason why this size > > cannot be computed from library calls? > > No, no structural reasons, but rather that nobody bothered to implement > this. A patch for this would be more than welcome ;) > +1 > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@py...> - 2012-03-28 18:05:23
|
On 3/28/12 12:33 PM, Alvaro Tejero Cantero wrote: > Hi, > > Trying to evaluate compression filters, I was looking for a call in > PyTables to get the size of a dataset (in bytes). As I didn't find it > I remembered the many benchmarks and found instead [1] that the way to > do it is to create single-dataset files and interrogate the > filesystem. Curiously enough, I didn't find that feature either in > h5ls, hdfview or ViTables. Is there a structural reason why this size > cannot be computed from library calls? No, no structural reasons, but rather that nobody bothered to implement this. A patch for this would be more than welcome ;) -- Francesc Alted |
From: Francesc A. <fa...@py...> - 2012-03-28 18:04:05
|
On 3/28/12 10:15 AM, Alvaro Tejero Cantero wrote: > That is a perfectly fine solution for me, as long as the arrays aren't > copied in memory for the query. No, the arrays are not copied in memory. They are just read from disk block-by-block and then the output is directed to the iterator, or an array (depending on the context). > > Thank you! > > Thinking that your proposed solution uses iterables to avoid it I tried > > boolcond = pt.Expr('(exp(a)<0.9)&(a*b>0.7)|(b*sin(a)<0.1)') > indices = [i for i,v in boolcond if v] > (...) TypeError: 'numpy.bool_' object is not iterable > > I can, however, do > boolarr = boolcond.eval() > indices = np.nonzero(boolarr) > > but then I get boolarr into memory. > > Did I miss something? Yes, that was an error on my part. The correct way is: indices = [i for i,v in boolcond if enumerate(v)] > What is your advice on how to monitor the use of > memory? (I need this until PyTables is second skin). top? > > It is very rewarding to see that these numexpr's are 3-4 times faster > than the same with arrays in memory. However, I didn't find a way to > set the number of threads used Well, you can use the `MAX_THREADS` variable in 'parameters.py', but this do not offer separate controls for numexpr and blosc. Feel free to open a ticket asking for imporving this functionality. > > When evaluating the blosc benchmarks I found that in my system with > two 6-core processors , using 12 is best for writing and 6 for > reading. Interesting... Yes, it is :) > > Another question (maybe for a separate thread): is there any way to > shrink memory usage of booleans to 1 bit? It might well be that this > optimizes the use of the memory bus (at some processing cost). But I > am not aware of a numpy container for this. Maybe a compressed array? That would lead to using less that 1 bit per element in many situations. If you are interested in this, look into: https://github.com/FrancescAlted/carray -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-28 17:34:14
|
Hi, Trying to evaluate compression filters, I was looking for a call in PyTables to get the size of a dataset (in bytes). As I didn't find it I remembered the many benchmarks and found instead [1] that the way to do it is to create single-dataset files and interrogate the filesystem. Curiously enough, I didn't find that feature either in h5ls, hdfview or ViTables. Is there a structural reason why this size cannot be computed from library calls? -á. [1] https://github.com/PyTables/PyTables/blob/master/bench/blosc.py |
From: Alvaro T. C. <al...@mi...> - 2012-03-28 15:15:47
|
That is a perfectly fine solution for me, as long as the arrays aren't copied in memory for the query. Thank you! Thinking that your proposed solution uses iterables to avoid it I tried boolcond = pt.Expr('(exp(a)<0.9)&(a*b>0.7)|(b*sin(a)<0.1)') indices = [i for i,v in boolcond if v] (...) TypeError: 'numpy.bool_' object is not iterable I can, however, do boolarr = boolcond.eval() indices = np.nonzero(boolarr) but then I get boolarr into memory. Did I miss something? What is your advice on how to monitor the use of memory? (I need this until PyTables is second skin). It is very rewarding to see that these numexpr's are 3-4 times faster than the same with arrays in memory. However, I didn't find a way to set the number of threads used When evaluating the blosc benchmarks I found that in my system with two 6-core processors , using 12 is best for writing and 6 for reading. Interesting... Another question (maybe for a separate thread): is there any way to shrink memory usage of booleans to 1 bit? It might well be that this optimizes the use of the memory bus (at some processing cost). But I am not aware of a numpy container for this. -á. On Wed, Mar 28, 2012 at 00:34, Francesc Alted <fa...@py...> wrote: > Another option that occurred to me recently is to save all your columns > as unidimensional arrays (Array object, or, if you want compression, a > CArray or EArray), and then use them as components of a boolean > expression using the class `tables.Expr`. For example, if a, b and c > are unidimensional arrays of the same size, you can do: > > bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') > indices = [ind for ind, bool_val in bool_cond if bool_val ] > results = your_dataset[indices] > > Does that make sense for your problem? Of course, this class uses > numexpr behind the scenes, so it is perfectly equivalent to classical > queries in tables, but without being restricted to use tables. Please > see more details about the `tables.Expr` in: > > http://pytables.github.com/usersguide/libref.html#the-expr-class-a-general-purpose-expression-evaluator > > Francesc > > On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote: >> Would it be an option to have >> >> * raw data on one table >> * all imaginable columns used for query conditions in another table >> (but how to grow it in columns without deleting& recreating?) >> >> and fetch indexes for the first based on .whereList(condition) of the second? >> >> Are there alternatives? >> >> -á. >> >> >> >> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >>> Hi there, >>> >>> I am following advice by Anthony and giving a go at representing >>> different sensors in my dataset as columns in a Table, or in several >>> Tables. This is about in-kernel queries. >>> >>> The documentation of condvars in Table.where [1] says "condvars should >>> consist of identifier-like strings pointing to Column (see The Column >>> class) instances of this table, or to other values (which will be >>> converted to arrays)". >>> >>> Conversion to arrays will likely exhaust the memory and be slow. >>> Furthermore, when I tried with a toy example (naively extrapolating >>> the behaviour of indexing in numpy), I obtained >>> >>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >>> >>> (... elided output) >>> ValueError: variable ``b`` refers to a column which is not part of >>> table ``/tetrode1 >>> >>> I am interested in the scenario where an in-kernel query is applied to >>> a table based in columns *from other tables* that still are aligned >>> with the current table (same number of elements). These conditions may >>> be sophisticated and mix columns from the local table as well. >>> >>> One obvious solution would be to put all aligned columns on the same >>> table. But adding columns to a table is cumbersome, and I cannot think >>> beforehand of the many precomputed columns that I would like to use as >>> query conditions. >>> >>> What do you recommend in this scenario? >>> >>> -á. >>> >>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@gm...> - 2012-03-28 14:36:56
|
On 3/27/12 6:34 PM, Francesc Alted wrote: > Another option that occurred to me recently is to save all your > columns as unidimensional arrays (Array object, or, if you want > compression, a CArray or EArray), and then use them as components of a > boolean expression using the class `tables.Expr`. For example, if a, > b and c are unidimensional arrays of the same size, you can do: > > bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') > indices = [ind for ind, bool_val in bool_cond if bool_val ] Of course, the above line needs to read: indices = [ind for ind, bool_val in enumerate(bool_cond) if bool_val ] > results = your_dataset[indices] Another solution, probably faster, although you need to make sure that you have memory enough to keep your boolean array, is this: bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') bool_arr = bool_cond.eval() results = your_dataset[bool_arr] -- Francesc Alted |
From: Francesc A. <fa...@py...> - 2012-03-27 23:34:35
|
Another option that occurred to me recently is to save all your columns as unidimensional arrays (Array object, or, if you want compression, a CArray or EArray), and then use them as components of a boolean expression using the class `tables.Expr`. For example, if a, b and c are unidimensional arrays of the same size, you can do: bool_cond = tables.Expr('(2*a>0) & (cos(b) < .5) & (c**3 < 1)') indices = [ind for ind, bool_val in bool_cond if bool_val ] results = your_dataset[indices] Does that make sense for your problem? Of course, this class uses numexpr behind the scenes, so it is perfectly equivalent to classical queries in tables, but without being restricted to use tables. Please see more details about the `tables.Expr` in: http://pytables.github.com/usersguide/libref.html#the-expr-class-a-general-purpose-expression-evaluator Francesc On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote: > Would it be an option to have > > * raw data on one table > * all imaginable columns used for query conditions in another table > (but how to grow it in columns without deleting& recreating?) > > and fetch indexes for the first based on .whereList(condition) of the second? > > Are there alternatives? > > -á. > > > > On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >> Hi there, >> >> I am following advice by Anthony and giving a go at representing >> different sensors in my dataset as columns in a Table, or in several >> Tables. This is about in-kernel queries. >> >> The documentation of condvars in Table.where [1] says "condvars should >> consist of identifier-like strings pointing to Column (see The Column >> class) instances of this table, or to other values (which will be >> converted to arrays)". >> >> Conversion to arrays will likely exhaust the memory and be slow. >> Furthermore, when I tried with a toy example (naively extrapolating >> the behaviour of indexing in numpy), I obtained >> >> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >> >> (... elided output) >> ValueError: variable ``b`` refers to a column which is not part of >> table ``/tetrode1 >> >> I am interested in the scenario where an in-kernel query is applied to >> a table based in columns *from other tables* that still are aligned >> with the current table (same number of elements). These conditions may >> be sophisticated and mix columns from the local table as well. >> >> One obvious solution would be to put all aligned columns on the same >> table. But adding columns to a table is cumbersome, and I cannot think >> beforehand of the many precomputed columns that I would like to use as >> query conditions. >> >> What do you recommend in this scenario? >> >> -á. >> >> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Francesc A. <fa...@py...> - 2012-03-27 23:11:53
|
On 3/27/12 2:20 AM, Alvaro Tejero Cantero wrote: >>> (but how to grow it in columns without deleting& recreating?) >> You can't (at least on cheap way). Maybe you may want to create >> additional tables and grouping them in terms of the columns you are >> going to need for your queries. > Sorry, it is not clear to me: create new tables and (grouping = > grouping in HDF5 Groups) them in terms of the columns? > As far as I understood, only columns in the same table (regardless the > group of the table) can be queried together with the in-kernel engine? Yes, but the idea is to get rid of this limitation for querying columns in the same table (which is somewhat artificial). In fact, now that I think about this, you can actually implement queries on different unidimensional arrays (think of them as independent columns) by using the `tables.Expr` class. More on this later. -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-27 15:24:20
|
Hi, I came across the "Million Song Dataset" https://github.com/tb2332/MSongsDB It uses PyTables. Their descriptors perhaps connect better to certain people than a particle detector ;) https://github.com/tb2332/MSongsDB/blob/master/PythonSrc/hdf5_descriptors.py -á. |
From: Francesc A. <fa...@py...> - 2012-03-27 15:05:26
|
Ahhhh. The problem seems rather that you have not compiled/installed the szip compressor, or that the library cannot be found for some reason. Please double check if your homebrew installation of HDF5 is sane. If not, my advice is to ask the HDF5 users' list directly. Francesc On 3/27/12 10:00 AM, Tobias Erhardt wrote: > Hey Francesc > > Hm, seems that I've forgotten the actual error I get, as soon as I import pyTables: > > import tables > > ImportError: dlopen(/Library/Python/2.7/site-packages/tables/utilsExtension.so, 2): Symbol not found: _SZ_BufftoBuffCompress > Referenced from: /Library/Python/2.7/site-packages/tables/utilsExtension.so > Expected in: flat namespace > in /Library/Python/2.7/site-packages/tables/utilsExtension.so > > > I guess that means that utilsExtension.so was actually not compiled due to the warning in line 471 in the log > > The HDF5 installation was done that way with homebrew, that seems to be the default way there. > > Tobias > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Tobias E. <tob...@gm...> - 2012-03-27 15:00:46
|
Hey Francesc Hm, seems that I've forgotten the actual error I get, as soon as I import pyTables: import tables ImportError: dlopen(/Library/Python/2.7/site-packages/tables/utilsExtension.so, 2): Symbol not found: _SZ_BufftoBuffCompress Referenced from: /Library/Python/2.7/site-packages/tables/utilsExtension.so Expected in: flat namespace in /Library/Python/2.7/site-packages/tables/utilsExtension.so I guess that means that utilsExtension.so was actually not compiled due to the warning in line 471 in the log The HDF5 installation was done that way with homebrew, that seems to be the default way there. Tobias |