You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Antonio V. <ant...@ti...> - 2013-06-25 18:02:57
|
Hi Sebastian, Il 25/06/2013 09:36, Wagner Sebastian ha scritto: > Hi Anthony and Antonio, > > Thanks for your fast responses. It's great to hear all features are now free to use, though I needed one and a half week to get this. > > The first reference I read to learn the usage of PyTables was Hints for SQL Users [1], where is stated several times, for example in the section ' Creating an index': >> Indexing is supported in the commercial version of PyTables (PyTablesPro). > I would suggest that these texts should be updated. > Being convinced it's only available in Pro-Version after I read it so often, I also overread the warning in the PyTables Pro page[2] (As I were only interested in the features not available in the free version I just scrolled down immediately, diagonal reading...). So the next suggestion is to give a color to the warning text there :) > > [1] > http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex > http://www.pytables.org/moin/HintsForSQLUsers#Selectingdata > [2] > http://www.pytables.org/moin/PyTablesPro > > regards, > Sebastian > thank you for reporting the issue, I will fix it ASAP. The same problem also affect the corresponding cookbook page [1]. Anyway, please, feel free to update the wiki if you find outdated material. [1] http://pytables.github.io/cookbook/hints_for_sql_users.html > On Mon, Jun 24, 2013 at 4:25 AM, Wagner Sebastian < Seb...@ai...> wrote: > >> Dear PyTables-Users,**** >> >> ** ** >> >> For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and >> 3xFloat) with 750k rows, the total file size about 90MB. As the free >> version does no support indexing I thought that a search (full-table) >> on this database would last a least one or two seconds, because the >> file has to be loaded first (throttleneck I/O), and then the search >> over ~20k rows can begin. But PyTables took only 0.05 seconds for a >> full table search (in-kernel, so near C-speed, but nevertheless full >> table), while my bisecting algorithm with a precomputed sorted list >> wrapped around PyTables (but saved in there), took about 0.5 >> seconds.**** >> >> ** ** >> >> So the thing I don?t understand: How can PyTables be so fast without >> any Indexing? >> > > Hi Sebastian, > > First, there is no longer a non-free version of PyTables and v3.0 *does* have indexing capabilities. However, you have to enable them so you probably weren't using them. > > PyTables is fast because HDF5 is a binary format, it using pthreads under the covers to parallelize some tasks, and it uses numexpr (which is also > parallel) to evaluate many expressions. All of these things help make PyTables great! > > Be Well > Anthony > > > Il 24/06/2013 11:25, Wagner Sebastian ha scritto: >> Dear PyTables-Users, >> >> For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. >> >> So the thing I don't understand: How can PyTables be so fast without any Indexing? >> >> I'm using 3.0.0rc2 coming with WinPython >> >> Regards, >> Sebastian > > The indexing features of PyTables Pro are now available in the open source version of PyTables since version 2.3 (please see [1]). > > > > [1] > http://pytables.github.io/release-notes/RELEASE_NOTES_v2.3.x.html#changes-from-2-2-1-to-2-3 > > ciao > > -- > Antonio Valentino > -- Antonio Valentino |
From: Antonio V. <ant...@ti...> - 2013-06-25 17:48:28
|
Hi Andre', Il 25/06/2013 10:26, Andre' Walker-Loud ha scritto: > Dear PyTables users, > > I am trying to figure out the best way to write some metadata into some files I have. > > The hdf5 file looks like > > /root/data_1/stat > /root/data_1/sys > > where "stat" and "sys" are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object > > /root/data_1/fit > > where "fit" is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. > > I began to follow the example in the PyTables manual, in Section 1.2 "The Object Tree", where first a class is defined > > class Particle(tables.IsDescription): > identity = tables.StringCol(itemsize=22, dflt=" ", pos=0) > ... > > and then this class is used to populate a table. > > In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The "Table" does not seem optimal, but I don't see what else I would use. > > > Thanks, > > Andre > For leaf nodes (Tables, Array, ets) you can use the "attrs" attribute set [1] as described in [2]. For group objects (like e.g. "root") you can use the "set_node_attr" method [3] of File objects or "_v_attrs". cheers [1] http://pytables.github.io/usersguide/libref/declarative_classes.html#attributesetclassdescr [2] http://pytables.github.io/usersguide/tutorials.html#setting-and-getting-user-attributes [3] http://pytables.github.io/usersguide/libref/file_class.html#tables.File.set_node_attr -- Antonio Valentino |
From: Anthony S. <sc...@gm...> - 2013-06-25 15:08:36
|
Also, depending on how much meta data you really needed to store you could just use attributes. That is what they are there for. On Tue, Jun 25, 2013 at 10:06 AM, Josh Ayers <jos...@gm...> wrote: > Another option is to create a Python object - dict, list, or whatever > works - containing the metadata and then store a pickled version of it in a > PyTables array. It's nice for this sort of thing because you have the full > flexibility of Python's data containers. > > For example, if the Python object is called 'fit', then > numpy.frombuffer(pickle.dumps(fit), 'u1') will pickle it and convert the > result to a NumPy array of unsigned bytes. It can be stored in a PyTables > array using a UInt8Atom. To retrieve the Python object, just use > pickle.loads(hdf5_file.root.data_1.fit[:]). > > It gets a little more complicated if you want to be able to modify the > Python object, because the length of the pickle will change. In that case, > you can use an EArray (for the case when the pickle grows), and store the > number of bytes as an attribute. Storing the number of bytes handles the > case when the pickle shrinks and doesn't use the full length of the on-disk > array. To load it, use > pickle.loads(hdf5_file.root.data_1.fit[:num_bytes]), where num_bytes is the > previously stored attribute. To modify it, just overwrite the array with > the new version, expanding if necessary, then update the num_bytes > attribute. > > Using a PyTables VLArray with an 'object' atom uses a similar technique > under the hood, so that may be easier. It doesn't allow resizing though. > > Hope that helps, > Josh > > > > On Tue, Jun 25, 2013 at 1:33 AM, Andreas Hilboll <li...@hi...> wrote: > >> On 25.06.2013 10:26, Andre' Walker-Loud wrote: >> > Dear PyTables users, >> > >> > I am trying to figure out the best way to write some metadata into some >> files I have. >> > >> > The hdf5 file looks like >> > >> > /root/data_1/stat >> > /root/data_1/sys >> > >> > where "stat" and "sys" are Arrays containing statistical and systematic >> fluctuations of numerical fits to some data I have. What I would like to >> do is add another object >> > >> > /root/data_1/fit >> > >> > where "fit" is just a metadata key that describes all the choices I >> made in performing the fit, such as seed for the random number generator, >> and many choices for fitting options, like initial guess values of >> parameters, fitting range, etc. >> > >> > I began to follow the example in the PyTables manual, in Section 1.2 >> "The Object Tree", where first a class is defined >> > >> > class Particle(tables.IsDescription): >> > identity = tables.StringCol(itemsize=22, dflt=" ", pos=0) >> > ... >> > >> > and then this class is used to populate a table. >> > >> > In my case, I won't have a table, but really just want a single object >> containing my metadata. I am wondering if there is a recommended way to do >> this? The "Table" does not seem optimal, but I don't see what else I would >> use. >> >> For complex information I'd probably indeed use a table object. It >> doesn't matter if the table only has one row, but still you have all the >> information there nicely structured. >> >> -- Andreas. >> >> >> >> ------------------------------------------------------------------------------ >> This SF.net email is sponsored by Windows: >> >> Build for Windows Store. >> >> http://p.sf.net/sfu/windows-dev2dev >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Josh A. <jos...@gm...> - 2013-06-25 15:06:52
|
Another option is to create a Python object - dict, list, or whatever works - containing the metadata and then store a pickled version of it in a PyTables array. It's nice for this sort of thing because you have the full flexibility of Python's data containers. For example, if the Python object is called 'fit', then numpy.frombuffer(pickle.dumps(fit), 'u1') will pickle it and convert the result to a NumPy array of unsigned bytes. It can be stored in a PyTables array using a UInt8Atom. To retrieve the Python object, just use pickle.loads(hdf5_file.root.data_1.fit[:]). It gets a little more complicated if you want to be able to modify the Python object, because the length of the pickle will change. In that case, you can use an EArray (for the case when the pickle grows), and store the number of bytes as an attribute. Storing the number of bytes handles the case when the pickle shrinks and doesn't use the full length of the on-disk array. To load it, use pickle.loads(hdf5_file.root.data_1.fit[:num_bytes]), where num_bytes is the previously stored attribute. To modify it, just overwrite the array with the new version, expanding if necessary, then update the num_bytes attribute. Using a PyTables VLArray with an 'object' atom uses a similar technique under the hood, so that may be easier. It doesn't allow resizing though. Hope that helps, Josh On Tue, Jun 25, 2013 at 1:33 AM, Andreas Hilboll <li...@hi...> wrote: > On 25.06.2013 10:26, Andre' Walker-Loud wrote: > > Dear PyTables users, > > > > I am trying to figure out the best way to write some metadata into some > files I have. > > > > The hdf5 file looks like > > > > /root/data_1/stat > > /root/data_1/sys > > > > where "stat" and "sys" are Arrays containing statistical and systematic > fluctuations of numerical fits to some data I have. What I would like to > do is add another object > > > > /root/data_1/fit > > > > where "fit" is just a metadata key that describes all the choices I made > in performing the fit, such as seed for the random number generator, and > many choices for fitting options, like initial guess values of parameters, > fitting range, etc. > > > > I began to follow the example in the PyTables manual, in Section 1.2 > "The Object Tree", where first a class is defined > > > > class Particle(tables.IsDescription): > > identity = tables.StringCol(itemsize=22, dflt=" ", pos=0) > > ... > > > > and then this class is used to populate a table. > > > > In my case, I won't have a table, but really just want a single object > containing my metadata. I am wondering if there is a recommended way to do > this? The "Table" does not seem optimal, but I don't see what else I would > use. > > For complex information I'd probably indeed use a table object. It > doesn't matter if the table only has one row, but still you have all the > information there nicely structured. > > -- Andreas. > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Andreas H. <li...@hi...> - 2013-06-25 08:34:15
|
On 25.06.2013 10:26, Andre' Walker-Loud wrote: > Dear PyTables users, > > I am trying to figure out the best way to write some metadata into some files I have. > > The hdf5 file looks like > > /root/data_1/stat > /root/data_1/sys > > where "stat" and "sys" are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object > > /root/data_1/fit > > where "fit" is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. > > I began to follow the example in the PyTables manual, in Section 1.2 "The Object Tree", where first a class is defined > > class Particle(tables.IsDescription): > identity = tables.StringCol(itemsize=22, dflt=" ", pos=0) > ... > > and then this class is used to populate a table. > > In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The "Table" does not seem optimal, but I don't see what else I would use. For complex information I'd probably indeed use a table object. It doesn't matter if the table only has one row, but still you have all the information there nicely structured. -- Andreas. |
From: Andre' Walker-L. <wal...@gm...> - 2013-06-25 08:25:42
|
Dear PyTables users, I am trying to figure out the best way to write some metadata into some files I have. The hdf5 file looks like /root/data_1/stat /root/data_1/sys where "stat" and "sys" are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object /root/data_1/fit where "fit" is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. I began to follow the example in the PyTables manual, in Section 1.2 "The Object Tree", where first a class is defined class Particle(tables.IsDescription): identity = tables.StringCol(itemsize=22, dflt=" ", pos=0) ... and then this class is used to populate a table. In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The "Table" does not seem optimal, but I don't see what else I would use. Thanks, Andre |
From: Wagner S. <Seb...@ai...> - 2013-06-25 07:36:31
|
Hi Anthony and Antonio, Thanks for your fast responses. It's great to hear all features are now free to use, though I needed one and a half week to get this. The first reference I read to learn the usage of PyTables was Hints for SQL Users [1], where is stated several times, for example in the section ' Creating an index': > Indexing is supported in the commercial version of PyTables (PyTablesPro). I would suggest that these texts should be updated. Being convinced it's only available in Pro-Version after I read it so often, I also overread the warning in the PyTables Pro page[2] (As I were only interested in the features not available in the free version I just scrolled down immediately, diagonal reading...). So the next suggestion is to give a color to the warning text there :) [1] http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex http://www.pytables.org/moin/HintsForSQLUsers#Selectingdata [2] http://www.pytables.org/moin/PyTablesPro regards, Sebastian On Mon, Jun 24, 2013 at 4:25 AM, Wagner Sebastian < Seb...@ai...> wrote: > Dear PyTables-Users,**** > > ** ** > > For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and > 3xFloat) with 750k rows, the total file size about 90MB. As the free > version does no support indexing I thought that a search (full-table) > on this database would last a least one or two seconds, because the > file has to be loaded first (throttleneck I/O), and then the search > over ~20k rows can begin. But PyTables took only 0.05 seconds for a > full table search (in-kernel, so near C-speed, but nevertheless full > table), while my bisecting algorithm with a precomputed sorted list > wrapped around PyTables (but saved in there), took about 0.5 > seconds.**** > > ** ** > > So the thing I don?t understand: How can PyTables be so fast without > any Indexing? > Hi Sebastian, First, there is no longer a non-free version of PyTables and v3.0 *does* have indexing capabilities. However, you have to enable them so you probably weren't using them. PyTables is fast because HDF5 is a binary format, it using pthreads under the covers to parallelize some tasks, and it uses numexpr (which is also parallel) to evaluate many expressions. All of these things help make PyTables great! Be Well Anthony Il 24/06/2013 11:25, Wagner Sebastian ha scritto: > Dear PyTables-Users, > > For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. > > So the thing I don't understand: How can PyTables be so fast without any Indexing? > > I'm using 3.0.0rc2 coming with WinPython > > Regards, > Sebastian The indexing features of PyTables Pro are now available in the open source version of PyTables since version 2.3 (please see [1]). [1] http://pytables.github.io/release-notes/RELEASE_NOTES_v2.3.x.html#changes-from-2-2-1-to-2-3 ciao -- Antonio Valentino |
From: Antonio V. <ant...@ti...> - 2013-06-24 18:24:00
|
Hi Sebastian, Il 24/06/2013 11:25, Wagner Sebastian ha scritto: > Dear PyTables-Users, > > For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. > > So the thing I don't understand: How can PyTables be so fast without any Indexing? > > I'm using 3.0.0rc2 coming with WinPython > > Regards, > Sebastian The indexing features of PyTables Pro are now available in the open source version of PyTables since version 2.3 (please see [1]). [1] http://pytables.github.io/release-notes/RELEASE_NOTES_v2.3.x.html#changes-from-2-2-1-to-2-3 ciao -- Antonio Valentino |
From: Anthony S. <sc...@gm...> - 2013-06-24 18:17:35
|
On Mon, Jun 24, 2013 at 4:25 AM, Wagner Sebastian < Seb...@ai...> wrote: > Dear PyTables-Users,**** > > ** ** > > For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and > 3xFloat) with 750k rows, the total file size about 90MB. As the free > version does no support indexing I thought that a search (full-table) on > this database would last a least one or two seconds, because the file has > to be loaded first (throttleneck I/O), and then the search over ~20k rows > can begin. But PyTables took only 0.05 seconds for a full table search > (in-kernel, so near C-speed, but nevertheless full table), while my > bisecting algorithm with a precomputed sorted list wrapped around PyTables > (but saved in there), took about 0.5 seconds.**** > > ** ** > > So the thing I don’t understand: How can PyTables be so fast without any > Indexing? > Hi Sebastian, First, there is no longer a non-free version of PyTables and v3.0 *does* have indexing capabilities. However, you have to enable them so you probably weren't using them. PyTables is fast because HDF5 is a binary format, it using pthreads under the covers to parallelize some tasks, and it uses numexpr (which is also parallel) to evaluate many expressions. All of these things help make PyTables great! Be Well Anthony > **** > > ** ** > > I’m using 3.0.0rc2 coming with WinPython**** > > ** ** > > Regards,**** > > Sebastian**** > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-06-24 18:11:38
|
Hello Giovanni, Great to hear that everything is working much better for you now and that everything is much faster and smaller than NPY ;) Do you know how the default value is set btw? This is computed via a magical heuristic algorithm written by Francesc (?) called computechunksize(). This is really optimized for dense data (Tables) so it is not surprising that in performs poorly in your case. Any updates you want to make to PyTables to also handle sparse data well out of the box would be very welcome ;) 1. https://github.com/PyTables/PyTables/blob/develop/tables/idxutils.py#L54 On Mon, Jun 24, 2013 at 10:51 AM, Giovanni Luca Ciampaglia < glc...@gm...> wrote: > Hi Anthony, > > thanks for the explanation and the links, it's much clearer now. So without > compression a CArray is really a smarter type of sparse file, but you have > to > set a sensible chunk shape. Do you know how the default value is set btw? > I am > asking because I didn't see any change in performance from using the > default > value and using (1, N), where (N,N) is the shape of the matrix. I guess > that the > write performance depends crucially on the size of the I/O buffer, so the > default must be choosing a similar setting. > > Anyway I have played a bit with other values of the chunk shape in > conjunction > with the compression level and using a shape (1,100) and a complevel=5 > gives > speeds that are only 10-15% slower than what I get at shape=(1,1) and > complevel=0. The resulting file is 10 times smaller, and something like 35 > times > smaller than a NPY sparse file, btw! > > Thanks! > > Giovanni > > On 06/24/2013 05:25 AM, pyt...@li...wrote: > > Hi Giovanni! > > > > I think that you may have some misunderstanding about how chucking works, > > which is leading you to get terrible performance. In fact what you > > describe is a great strategy (right all and zip) for using normal Arrays. > > > > However, chunking and CArrays don't work like this. If a chunk contains > no > > data, it is not written at all! Also, all zipping takes place on the > chunk > > level. Thus for very small chunks you can actually increase the file > size > > and access time by using compression. > > > > For sparse matrices and CArrays, you need to play around with the > > chunkshape argument to create_carray() and compression. Performance is > > going to be affected how dense the matrix is and how grouped it is. For > > example, for a very dense and randomly distributed matrix, chunkshape=1 > and > > no compression is best. For block diagonal matrices, the chunkshape > should > > be the nominal block shape. Compression is only useful here if the > blocks > > all have similar values or the block shape is large. For example > > > > 1 1 0 0 0 0 > > 1 1 0 0 0 0 > > 0 0 1 1 0 0 > > 0 0 1 1 0 0 > > 0 0 0 0 1 1 > > 0 0 0 0 1 1 > > > > is well suited to a chunkshape=(2, 2) > > > > For more information on the HDF model please see my talk slides and video > > [1,2] I hope this helps. > > > > Be Well > > Anthony > > > > PS. Glad to see you using the new API > > > > 1.https://github.com/scopatz/hdf5-is-for-lovers > > 2.http://www.youtube.com/watch?v=Nzx0HAd3FiI > > > -- > Giovanni Luca Ciampaglia > > Postdoctoral fellow > Center for Complex Networks and Systems Research > Indiana University > > ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 > ☞ http://cnets.indiana.edu/ > ✉ gci...@in... > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Giovanni L. C. <glc...@gm...> - 2013-06-24 15:52:13
|
Hi Anthony, thanks for the explanation and the links, it's much clearer now. So without compression a CArray is really a smarter type of sparse file, but you have to set a sensible chunk shape. Do you know how the default value is set btw? I am asking because I didn't see any change in performance from using the default value and using (1, N), where (N,N) is the shape of the matrix. I guess that the write performance depends crucially on the size of the I/O buffer, so the default must be choosing a similar setting. Anyway I have played a bit with other values of the chunk shape in conjunction with the compression level and using a shape (1,100) and a complevel=5 gives speeds that are only 10-15% slower than what I get at shape=(1,1) and complevel=0. The resulting file is 10 times smaller, and something like 35 times smaller than a NPY sparse file, btw! Thanks! Giovanni On 06/24/2013 05:25 AM, pyt...@li... wrote: > Hi Giovanni! > > I think that you may have some misunderstanding about how chucking works, > which is leading you to get terrible performance. In fact what you > describe is a great strategy (right all and zip) for using normal Arrays. > > However, chunking and CArrays don't work like this. If a chunk contains no > data, it is not written at all! Also, all zipping takes place on the chunk > level. Thus for very small chunks you can actually increase the file size > and access time by using compression. > > For sparse matrices and CArrays, you need to play around with the > chunkshape argument to create_carray() and compression. Performance is > going to be affected how dense the matrix is and how grouped it is. For > example, for a very dense and randomly distributed matrix, chunkshape=1 and > no compression is best. For block diagonal matrices, the chunkshape should > be the nominal block shape. Compression is only useful here if the blocks > all have similar values or the block shape is large. For example > > 1 1 0 0 0 0 > 1 1 0 0 0 0 > 0 0 1 1 0 0 > 0 0 1 1 0 0 > 0 0 0 0 1 1 > 0 0 0 0 1 1 > > is well suited to a chunkshape=(2, 2) > > For more information on the HDF model please see my talk slides and video > [1,2] I hope this helps. > > Be Well > Anthony > > PS. Glad to see you using the new API > > 1.https://github.com/scopatz/hdf5-is-for-lovers > 2.http://www.youtube.com/watch?v=Nzx0HAd3FiI -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gci...@in... |
From: Wagner S. <Seb...@ai...> - 2013-06-24 09:25:23
|
Dear PyTables-Users, For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. So the thing I don't understand: How can PyTables be so fast without any Indexing? I'm using 3.0.0rc2 coming with WinPython Regards, Sebastian |
From: Anthony S. <sc...@gm...> - 2013-06-23 06:00:38
|
Hi Giovanni! I think that you may have some misunderstanding about how chucking works, which is leading you to get terrible performance. In fact what you describe is a great strategy (right all and zip) for using normal Arrays. However, chunking and CArrays don't work like this. If a chunk contains no data, it is not written at all! Also, all zipping takes place on the chunk level. Thus for very small chunks you can actually increase the file size and access time by using compression. For sparse matrices and CArrays, you need to play around with the chunkshape argument to create_carray() and compression. Performance is going to be affected how dense the matrix is and how grouped it is. For example, for a very dense and randomly distributed matrix, chunkshape=1 and no compression is best. For block diagonal matrices, the chunkshape should be the nominal block shape. Compression is only useful here if the blocks all have similar values or the block shape is large. For example 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 is well suited to a chunkshape=(2, 2) For more information on the HDF model please see my talk slides and video :) [1,2] I hope this helps. Be Well Anthony PS. Glad to see you using the new API ;) 1. https://github.com/scopatz/hdf5-is-for-lovers 2. http://www.youtube.com/watch?v=Nzx0HAd3FiI On Sat, Jun 22, 2013 at 6:34 PM, Giovanni Luca Ciampaglia < glc...@gm...> wrote: > Hi all, > > I have a sparse 3.4M x 3.4M adjacency matrix with nnz = 23M and wanted > to see if CArray was an appropriate solution for storing it. Right now I > am using the NumPy binary format for storing the data in coordinate > format and loading the matrix with Scipy's sparse coo_matrix class. As > far as I understand, with CArray the matrix would be written in full > (zeros included) but a) since it's chunked accessing it does not take > memory and b) with compression enabled it would possible to keep the > size of the file reasonable. > > If my assumptions are correct, then here is my problem: I am running > into problems when writing the CArray to disk. I adapted the example > from the documentation [1] and when I run the code on a 6000x6000 matrix > with nnz = 17K I achieve a decent speed of roughly 4100 elements/s. > However, when I try it on the full matrix the writing speed drops to 4 > elements/s. Am I doing something wrong? Any feedback would be greatly > appreciated! > > Code: https://gist.github.com/junkieDolphin/5843064 > > Cheers, > > Giovanni > > [1] > > http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-carray-class > > -- > Giovanni Luca Ciampaglia > > ☞ http://www.inf.usi.ch/phd/ciampaglia/ > ✆ (812) 287-3471 > ✉ glc...@gm... > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Giovanni L. C. <glc...@gm...> - 2013-06-22 23:34:20
|
Hi all, I have a sparse 3.4M x 3.4M adjacency matrix with nnz = 23M and wanted to see if CArray was an appropriate solution for storing it. Right now I am using the NumPy binary format for storing the data in coordinate format and loading the matrix with Scipy's sparse coo_matrix class. As far as I understand, with CArray the matrix would be written in full (zeros included) but a) since it's chunked accessing it does not take memory and b) with compression enabled it would possible to keep the size of the file reasonable. If my assumptions are correct, then here is my problem: I am running into problems when writing the CArray to disk. I adapted the example from the documentation [1] and when I run the code on a 6000x6000 matrix with nnz = 17K I achieve a decent speed of roughly 4100 elements/s. However, when I try it on the full matrix the writing speed drops to 4 elements/s. Am I doing something wrong? Any feedback would be greatly appreciated! Code: https://gist.github.com/junkieDolphin/5843064 Cheers, Giovanni [1] http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-carray-class -- Giovanni Luca Ciampaglia ☞ http://www.inf.usi.ch/phd/ciampaglia/ ✆ (812) 287-3471 ✉ glc...@gm... |
From: Francesc A. <fa...@gm...> - 2013-06-10 21:35:31
|
Ah, that's good to know. Yes, I see definitely the warning is helping people to flush periodically and helping preventing data corruption. Thanks for the feedback, Francesc On 6/10/13 5:16 PM, Edward Vogel wrote: > I initially didn't sync at all until after completing writing - about > 1 million rows total. My main concern was preventing data corruption. > After seeing the warning I had a sync for every iteration of the inner > loop, which was slow. Syncing after the inner loop is a little slower > than not syncing, but seems fine. > Thanks, > Ed > > > On Mon, Jun 10, 2013 at 4:37 PM, Francesc Alted <fa...@gm... > <mailto:fa...@gm...>> wrote: > > Hi Ed, > > After fixing the issue, does performance has been enhanced? I'm > the one > who put the warning, so I'm curious on whether this actually helps > people or not. > > Thanks, > Francesc > > On 6/10/13 3:28 PM, Edward Vogel wrote: > > Yes, exactly. > > I'm pulling data out of C that has a 1 to many relationship, and > > dumping it into pytables for easier analysis. I'm creating extension > > classes in cython to get access to the C structures. > > It looks like this (basically, each cv1 has several cv2s): > > > > h5.create_table('/', 'cv1', schema_cv1) > > h5.create_table('/', 'cv2', schema_cv2) > > cv1_row = h5.root.cv1.row > > cv2_row = h5.root.cv2.row > > for cv in sf.itercv(): > > cv1_row['addr'] = cv['addr'] > > ... > > cv1_row.append() > > for cv2 in cv.itercv2(): > > cv2_row['cv1_addr'] = cv['addr'] > > cv2_row['foo'] = cv2_row['foo'] > > ... > > cv2_row.append() > > h5.root.cv2.flush() # This fixes issue > > > > Adding the flush after the inner loop does fix the issue. (Thanks!) > > So, my followup question, why do I need a flush after the inner > loop, > > but not when moving from the outer loop to the inner loop? > > > > Thanks! > > > > > > > > On Mon, Jun 10, 2013 at 2:48 PM, Anthony Scopatz > <sc...@gm... <mailto:sc...@gm...> > > <mailto:sc...@gm... <mailto:sc...@gm...>>> wrote: > > > > Hi Ed, > > > > Are you inside of a nested loop? You probably just need to > flush > > after the innermost loop. > > > > Do you have some sample code you can share? > > > > Be Well > > Anthony > > > > > > On Mon, Jun 10, 2013 at 1:44 PM, Edward Vogel > > <edw...@gm... <mailto:edw...@gm...> > <mailto:edw...@gm... <mailto:edw...@gm...>>> > wrote: > > > > I have a dataset that I want to split between two > tables. But, > > when I iterate over the data and append to both tables, > I get > > a warning: > > > > /usr/local/lib/python2.7/site-packages/tables/table.py:2967: > > PerformanceWarning: table ``/cv2`` is being preempted from > > alive nodes without its buffers being flushed or with some > > index being dirty. This may lead to very ineficient use of > > resources and even to fatal errors in certain situations. > > Please do a call to the .flush() or .reindex_dirty() > methods > > on this table before start using other nodes. > > > > However, if I flush after every append, I get awful > performance. > > Is there a correct way to append to two tables without > doing a > > flush? > > Note, I don't have any indices defined, so it seems > > reindex_dirty() doesn't apply. > > > > Thanks, > > Ed > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by Windows: > > > > Build for Windows Store. > > > > http://p.sf.net/sfu/windows-dev2dev > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > <mailto:Pyt...@li...> > > <mailto:Pyt...@li... > <mailto:Pyt...@li...>> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by Windows: > > > > Build for Windows Store. > > > > http://p.sf.net/sfu/windows-dev2dev > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > <mailto:Pyt...@li...> > > <mailto:Pyt...@li... > <mailto:Pyt...@li...>> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by Windows: > > > > Build for Windows Store. > > > > http://p.sf.net/sfu/windows-dev2dev > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Edward V. <edw...@gm...> - 2013-06-10 21:16:17
|
I initially didn't sync at all until after completing writing - about 1 million rows total. My main concern was preventing data corruption. After seeing the warning I had a sync for every iteration of the inner loop, which was slow. Syncing after the inner loop is a little slower than not syncing, but seems fine. Thanks, Ed On Mon, Jun 10, 2013 at 4:37 PM, Francesc Alted <fa...@gm...> wrote: > Hi Ed, > > After fixing the issue, does performance has been enhanced? I'm the one > who put the warning, so I'm curious on whether this actually helps > people or not. > > Thanks, > Francesc > > On 6/10/13 3:28 PM, Edward Vogel wrote: > > Yes, exactly. > > I'm pulling data out of C that has a 1 to many relationship, and > > dumping it into pytables for easier analysis. I'm creating extension > > classes in cython to get access to the C structures. > > It looks like this (basically, each cv1 has several cv2s): > > > > h5.create_table('/', 'cv1', schema_cv1) > > h5.create_table('/', 'cv2', schema_cv2) > > cv1_row = h5.root.cv1.row > > cv2_row = h5.root.cv2.row > > for cv in sf.itercv(): > > cv1_row['addr'] = cv['addr'] > > ... > > cv1_row.append() > > for cv2 in cv.itercv2(): > > cv2_row['cv1_addr'] = cv['addr'] > > cv2_row['foo'] = cv2_row['foo'] > > ... > > cv2_row.append() > > h5.root.cv2.flush() # This fixes issue > > > > Adding the flush after the inner loop does fix the issue. (Thanks!) > > So, my followup question, why do I need a flush after the inner loop, > > but not when moving from the outer loop to the inner loop? > > > > Thanks! > > > > > > > > On Mon, Jun 10, 2013 at 2:48 PM, Anthony Scopatz <sc...@gm... > > <mailto:sc...@gm...>> wrote: > > > > Hi Ed, > > > > Are you inside of a nested loop? You probably just need to flush > > after the innermost loop. > > > > Do you have some sample code you can share? > > > > Be Well > > Anthony > > > > > > On Mon, Jun 10, 2013 at 1:44 PM, Edward Vogel > > <edw...@gm... <mailto:edw...@gm...>> wrote: > > > > I have a dataset that I want to split between two tables. But, > > when I iterate over the data and append to both tables, I get > > a warning: > > > > /usr/local/lib/python2.7/site-packages/tables/table.py:2967: > > PerformanceWarning: table ``/cv2`` is being preempted from > > alive nodes without its buffers being flushed or with some > > index being dirty. This may lead to very ineficient use of > > resources and even to fatal errors in certain situations. > > Please do a call to the .flush() or .reindex_dirty() methods > > on this table before start using other nodes. > > > > However, if I flush after every append, I get awful performance. > > Is there a correct way to append to two tables without doing a > > flush? > > Note, I don't have any indices defined, so it seems > > reindex_dirty() doesn't apply. > > > > Thanks, > > Ed > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by Windows: > > > > Build for Windows Store. > > > > http://p.sf.net/sfu/windows-dev2dev > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by Windows: > > > > Build for Windows Store. > > > > http://p.sf.net/sfu/windows-dev2dev > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > > > ------------------------------------------------------------------------------ > > This SF.net email is sponsored by Windows: > > > > Build for Windows Store. > > > > http://p.sf.net/sfu/windows-dev2dev > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@gm...> - 2013-06-10 20:37:43
|
Hi Ed, After fixing the issue, does performance has been enhanced? I'm the one who put the warning, so I'm curious on whether this actually helps people or not. Thanks, Francesc On 6/10/13 3:28 PM, Edward Vogel wrote: > Yes, exactly. > I'm pulling data out of C that has a 1 to many relationship, and > dumping it into pytables for easier analysis. I'm creating extension > classes in cython to get access to the C structures. > It looks like this (basically, each cv1 has several cv2s): > > h5.create_table('/', 'cv1', schema_cv1) > h5.create_table('/', 'cv2', schema_cv2) > cv1_row = h5.root.cv1.row > cv2_row = h5.root.cv2.row > for cv in sf.itercv(): > cv1_row['addr'] = cv['addr'] > ... > cv1_row.append() > for cv2 in cv.itercv2(): > cv2_row['cv1_addr'] = cv['addr'] > cv2_row['foo'] = cv2_row['foo'] > ... > cv2_row.append() > h5.root.cv2.flush() # This fixes issue > > Adding the flush after the inner loop does fix the issue. (Thanks!) > So, my followup question, why do I need a flush after the inner loop, > but not when moving from the outer loop to the inner loop? > > Thanks! > > > > On Mon, Jun 10, 2013 at 2:48 PM, Anthony Scopatz <sc...@gm... > <mailto:sc...@gm...>> wrote: > > Hi Ed, > > Are you inside of a nested loop? You probably just need to flush > after the innermost loop. > > Do you have some sample code you can share? > > Be Well > Anthony > > > On Mon, Jun 10, 2013 at 1:44 PM, Edward Vogel > <edw...@gm... <mailto:edw...@gm...>> wrote: > > I have a dataset that I want to split between two tables. But, > when I iterate over the data and append to both tables, I get > a warning: > > /usr/local/lib/python2.7/site-packages/tables/table.py:2967: > PerformanceWarning: table ``/cv2`` is being preempted from > alive nodes without its buffers being flushed or with some > index being dirty. This may lead to very ineficient use of > resources and even to fatal errors in certain situations. > Please do a call to the .flush() or .reindex_dirty() methods > on this table before start using other nodes. > > However, if I flush after every append, I get awful performance. > Is there a correct way to append to two tables without doing a > flush? > Note, I don't have any indices defined, so it seems > reindex_dirty() doesn't apply. > > Thanks, > Ed > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2013-06-10 19:42:45
|
On Mon, Jun 10, 2013 at 2:28 PM, Edward Vogel <edw...@gm...>wrote: > Yes, exactly. > I'm pulling data out of C that has a 1 to many relationship, and dumping > it into pytables for easier analysis. I'm creating extension classes in > cython to get access to the C structures. > It looks like this (basically, each cv1 has several cv2s): > > h5.create_table('/', 'cv1', schema_cv1) > h5.create_table('/', 'cv2', schema_cv2) > cv1_row = h5.root.cv1.row > cv2_row = h5.root.cv2.row > for cv in sf.itercv(): > cv1_row['addr'] = cv['addr'] > ... > cv1_row.append() > for cv2 in cv.itercv2(): > cv2_row['cv1_addr'] = cv['addr'] > cv2_row['foo'] = cv2_row['foo'] > ... > cv2_row.append() > h5.root.cv2.flush() # This fixes issue > > Adding the flush after the inner loop does fix the issue. (Thanks!) > No problem! I am glad this worked. > So, my followup question, why do I need a flush after the inner loop, but > not when moving from the outer loop to the inner loop? > It has to do with when the write buffer gets created / filled / flushed. These steps need to happen at the proper time or you can mess loose the data you were writing, overflow memory, etc. Be Well Anthony > > Thanks! > > > > On Mon, Jun 10, 2013 at 2:48 PM, Anthony Scopatz <sc...@gm...>wrote: > >> Hi Ed, >> >> Are you inside of a nested loop? You probably just need to flush after >> the innermost loop. >> >> Do you have some sample code you can share? >> >> Be Well >> Anthony >> >> >> On Mon, Jun 10, 2013 at 1:44 PM, Edward Vogel <edw...@gm...>wrote: >> >>> I have a dataset that I want to split between two tables. But, when I >>> iterate over the data and append to both tables, I get a warning: >>> >>> /usr/local/lib/python2.7/site-packages/tables/table.py:2967: >>> PerformanceWarning: table ``/cv2`` is being preempted from alive nodes >>> without its buffers being flushed or with some index being dirty. This may >>> lead to very ineficient use of resources and even to fatal errors in >>> certain situations. Please do a call to the .flush() or .reindex_dirty() >>> methods on this table before start using other nodes. >>> >>> However, if I flush after every append, I get awful performance. >>> Is there a correct way to append to two tables without doing a flush? >>> Note, I don't have any indices defined, so it seems reindex_dirty() >>> doesn't apply. >>> >>> Thanks, >>> Ed >>> >>> >>> ------------------------------------------------------------------------------ >>> This SF.net email is sponsored by Windows: >>> >>> Build for Windows Store. >>> >>> http://p.sf.net/sfu/windows-dev2dev >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> >> >> >> ------------------------------------------------------------------------------ >> This SF.net email is sponsored by Windows: >> >> Build for Windows Store. >> >> http://p.sf.net/sfu/windows-dev2dev >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Edward V. <edw...@gm...> - 2013-06-10 19:28:59
|
Yes, exactly. I'm pulling data out of C that has a 1 to many relationship, and dumping it into pytables for easier analysis. I'm creating extension classes in cython to get access to the C structures. It looks like this (basically, each cv1 has several cv2s): h5.create_table('/', 'cv1', schema_cv1) h5.create_table('/', 'cv2', schema_cv2) cv1_row = h5.root.cv1.row cv2_row = h5.root.cv2.row for cv in sf.itercv(): cv1_row['addr'] = cv['addr'] ... cv1_row.append() for cv2 in cv.itercv2(): cv2_row['cv1_addr'] = cv['addr'] cv2_row['foo'] = cv2_row['foo'] ... cv2_row.append() h5.root.cv2.flush() # This fixes issue Adding the flush after the inner loop does fix the issue. (Thanks!) So, my followup question, why do I need a flush after the inner loop, but not when moving from the outer loop to the inner loop? Thanks! On Mon, Jun 10, 2013 at 2:48 PM, Anthony Scopatz <sc...@gm...> wrote: > Hi Ed, > > Are you inside of a nested loop? You probably just need to flush after > the innermost loop. > > Do you have some sample code you can share? > > Be Well > Anthony > > > On Mon, Jun 10, 2013 at 1:44 PM, Edward Vogel <edw...@gm...>wrote: > >> I have a dataset that I want to split between two tables. But, when I >> iterate over the data and append to both tables, I get a warning: >> >> /usr/local/lib/python2.7/site-packages/tables/table.py:2967: >> PerformanceWarning: table ``/cv2`` is being preempted from alive nodes >> without its buffers being flushed or with some index being dirty. This may >> lead to very ineficient use of resources and even to fatal errors in >> certain situations. Please do a call to the .flush() or .reindex_dirty() >> methods on this table before start using other nodes. >> >> However, if I flush after every append, I get awful performance. >> Is there a correct way to append to two tables without doing a flush? >> Note, I don't have any indices defined, so it seems reindex_dirty() >> doesn't apply. >> >> Thanks, >> Ed >> >> >> ------------------------------------------------------------------------------ >> This SF.net email is sponsored by Windows: >> >> Build for Windows Store. >> >> http://p.sf.net/sfu/windows-dev2dev >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-06-10 18:48:32
|
Hi Ed, Are you inside of a nested loop? You probably just need to flush after the innermost loop. Do you have some sample code you can share? Be Well Anthony On Mon, Jun 10, 2013 at 1:44 PM, Edward Vogel <edw...@gm...>wrote: > I have a dataset that I want to split between two tables. But, when I > iterate over the data and append to both tables, I get a warning: > > /usr/local/lib/python2.7/site-packages/tables/table.py:2967: > PerformanceWarning: table ``/cv2`` is being preempted from alive nodes > without its buffers being flushed or with some index being dirty. This may > lead to very ineficient use of resources and even to fatal errors in > certain situations. Please do a call to the .flush() or .reindex_dirty() > methods on this table before start using other nodes. > > However, if I flush after every append, I get awful performance. > Is there a correct way to append to two tables without doing a flush? > Note, I don't have any indices defined, so it seems reindex_dirty() > doesn't apply. > > Thanks, > Ed > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Edward V. <edw...@gm...> - 2013-06-10 18:44:58
|
I have a dataset that I want to split between two tables. But, when I iterate over the data and append to both tables, I get a warning: /usr/local/lib/python2.7/site-packages/tables/table.py:2967: PerformanceWarning: table ``/cv2`` is being preempted from alive nodes without its buffers being flushed or with some index being dirty. This may lead to very ineficient use of resources and even to fatal errors in certain situations. Please do a call to the .flush() or .reindex_dirty() methods on this table before start using other nodes. However, if I flush after every append, I get awful performance. Is there a correct way to append to two tables without doing a flush? Note, I don't have any indices defined, so it seems reindex_dirty() doesn't apply. Thanks, Ed |
From: Anthony S. <sc...@gm...> - 2013-06-06 02:29:03
|
Thanks Tim! You are the best. Hopefully I will get to this later tonight. Be Well Anthony On Wed, Jun 5, 2013 at 9:20 PM, Tim Burgess <tim...@ma...> wrote: > > > On Jun 06, 2013, at 04:19 AM, Anthony Scopatz <sc...@gm...> wrote: > > Thanks Antonio and Tim! > > These are great. I think that one of these should definitely make it into > the examples/ dir. > > Be Well > Anthony > > > OK. I have put up a pull request with the code added. > https://github.com/PyTables/PyTables/pull/266 > > Cheers, Tim > > > > ------------------------------------------------------------------------------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Tim B. <tim...@ma...> - 2013-06-06 02:21:46
|
On Jun 06, 2013, at 04:19 AM, Anthony Scopatz <sc...@gm...> wrote: Thanks Antonio and Tim! These are great. I think that one of these should definitely make it into the examples/ dir. Be Well Anthony OK. I have put up a pull request with the code added. https://github.com/PyTables/PyTables/pull/266 Cheers, Tim |
From: Anthony S. <sc...@gm...> - 2013-06-05 18:31:37
|
Hi Jeff, I have made some comments in the issue. Thanks for investigating this so thoroughly. Be Well Anthony On Tue, Jun 4, 2013 at 8:16 PM, Jeff Reback <jr...@ya...> wrote: > Anthony, > > I created an issue with more info > > I am not sure if this is a bug, or just a way both ne/pytables treat > strings that need to touch an encoded value; > > I found workaround by specifying the condvars to readWhere. Any more > thoughts on this? > > thanks Jeff > > > https://github.com/PyTables/PyTables/issues/265 > > I can be reached on my cell (917)971-6387 > *From:* Anthony Scopatz <sc...@gm...> > *To:* Jeff Reback <je...@re...> > *Cc:* Discussion list for PyTables <pyt...@li...> > *Sent:* Tuesday, June 4, 2013 6:39 PM > > *Subject:* Re: [Pytables-users] pytable 30 - encoding > > Hi Jeff, > > Hmmm, Could you try doing the same thing on just an in-memory numpy array > using numexpr. If this succeeds it tells us that the problem is in > PyTables, not numexpr. > > Be Well > Anthony > > > On Tue, Jun 4, 2013 at 11:35 AM, Jeff Reback <jr...@ya...> wrote: > > Anthony, > > I am using numexpr 2.1 (latest) > > this is puzzling; doesn't matter what I pass (bytes or str) , same result? > > (column == 'str-2') > > /mnt/code/arb/test/pytables-3.py(38)<module>() > -> result = handle.root.test.table.readWhere(selector) > (Pdb) handle.root.test.table.readWhere(selector) > *** TypeError: string argument without an encoding > (Pdb) handle.root.test.table.readWhere(selector.encode(encoding)) > *** TypeError: string argument without an encoding > (Pdb) > > > *From:* Anthony Scopatz <sc...@gm...> > *To:* Jeff Reback <je...@re...>; Discussion list for PyTables < > pyt...@li...> > *Sent:* Tuesday, June 4, 2013 12:25 PM > *Subject:* Re: [Pytables-users] pytable 30 - encoding > > Hi Jeff, > > Have you also updated numexpr to the most recent version? The error is > coming from numexpr not compiling the expression correctly. Also, you might > try making selector a str, rather than bytes: > > selector = "(column == 'str-2')" > > rather than > > selector = "(column == 'str-2')".encode(encoding) > > Be Well > Anthony > > > On Tue, Jun 4, 2013 at 8:51 AM, Jeff Reback <jr...@ya...> wrote: > > anthony,where am I going wrong here? > #!/usr/local/bin/python3 > import tables > import numpy as np > import datetime, time > encoding = 'UTF-8' > test_file = 'test_select.h5' > handle = tables.openFile(test_file, "w") > node = handle.createGroup(handle.root, 'test') > table = handle.createTable(node, 'table', dict( > index = tables.Int64Col(), > column = tables.StringCol(25), > values = tables.FloatCol(shape=(3)), > )) > > # add data > r = table.row > for i in range(10): > r['index'] = i > r['column'] = ("str-%d" % (i % 5)).encode(encoding) > r['values'] = np.arange(3) > r.append() > table.flush() > handle.close() > # read > handle = tables.openFile(test_file,"r") > result = handle.root.test.table.read() > print("table data\n") > print(result) > # where > print("\nselector\n") > selector = "(column == 'str-2')".encode(encoding) > print(selector) > result = handle.root.test.table.readWhere(selector) > print(result) > and the following out: > > [sheep-jreback-/code/arb/test] python3 pytables-3.py > table data > [(b'str-0', 0, [0.0, 1.0, 2.0]) (b'str-1', 1, [0.0, 1.0, 2.0]) > (b'str-2', 2, [0.0, 1.0, 2.0]) (b'str-3', 3, [0.0, 1.0, 2.0]) > (b'str-4', 4, [0.0, 1.0, 2.0]) (b'str-0', 5, [0.0, 1.0, 2.0]) > (b'str-1', 6, [0.0, 1.0, 2.0]) (b'str-2', 7, [0.0, 1.0, 2.0]) > (b'str-3', 8, [0.0, 1.0, 2.0]) (b'str-4', 9, [0.0, 1.0, 2.0])] > selector > b"(column == 'str-2')" > Traceback (most recent call last): > File "pytables-3.py", line 37, in <module> > result = handle.root.test.table.readWhere(selector) > File > "/usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/_past.py", > line 35, in oldfunc > return obj(*args, **kwargs) > File > "/usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py", > line 1522, in read_where > self._where(condition, condvars, start, stop, step)] > File > "/usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py", > line 1484, in _where > compiled = self._compile_condition(condition, condvars) > File > "/usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/table.py", > line 1358, in _compile_condition > compiled = compile_condition(condition, typemap, indexedcols) > File > "/usr/local/lib/python3.3/site-packages/tables-3.0.0-py3.3-linux-x86_64.egg/tables/conditions.py", > line 419, in compile_condition > func = NumExpr(expr, signature) > File > "/usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py", > line 559, in NumExpr > precompile(ex, signature, context) > File > "/usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py", > line 511, in precompile > constants_order, constants = getConstants(ast) > File > "/usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py", > line 294, in getConstants > for a in constants_order] > File > "/usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py", > line 294, in <listcomp> > for a in constants_order] > File > "/usr/local/lib/python3.3/site-packages/numexpr-2.1-py3.3-linux-x86_64.egg/numexpr/necompiler.py", > line 284, in convertConstantToKind > return kind_to_type[kind](x) > TypeError: string argument without an encoding > Closing remaining open files: test_select.h5... done > > > ------------------------------------------------------------------------------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > |
From: Anthony S. <sc...@gm...> - 2013-06-05 18:20:16
|
Thanks Antonio and Tim! These are great. I think that one of these should definitely make it into the examples/ dir. Be Well Anthony On Wed, Jun 5, 2013 at 8:10 AM, Francesc Alted <fa...@gm...> wrote: > On 6/5/13 11:45 AM, Andreas Hilboll wrote: > > On 05.06.2013 10:31, Andreas Hilboll wrote: > >> On 05.06.2013 03:29, Tim Burgess wrote: > >>> I was playing around with in-memory HDF5 prior to the 3.0 release. > >>> Here's an example based on what I was doing. > >>> I looked over the docs and it does mention that there is an option to > >>> throw away the 'file' rather than write it to disk. > >>> Not sure how to do that and can't actually think of a use case where I > >>> would want to :-) > >>> > >>> And be wary, it is H5FD_CORE. > >>> > >>> > >>> On Jun 05, 2013, at 08:38 AM, Anthony Scopatz <sc...@gm...> > wrote: > >>>> I think that you want to set parameters.DRIVER to H5DF_CORE [1]. I > >>>> haven't ever used this personally, but it would be great to have an > >>>> example script, if someone wants to write one ;) > >>>> > >>> > >>> > >>> import numpy as np > >>> import tables > >>> > >>> CHUNKY = 30 > >>> CHUNKX = 8640 > >>> > >>> if __name__ == '__main__': > >>> > >>> # create dataset and add global attrs > >>> > >>> file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX) > >>> > >>> with tables.open_file(file_path, 'w', title='PyTables HDF5 > In-memory > >>> example', driver='H5FD_CORE') as h5f: > >>> > >>> # dummy some data > >>> lats = np.empty([4320]) > >>> lons = np.empty([8640]) > >>> > >>> # create some simple arrays > >>> lat_node = h5f.create_array('/', 'lat', lats, > title='latitude') > >>> lon_node = h5f.create_array('/', 'lon', lons, > title='longitude') > >>> > >>> # create a 365 x 4320 x 8640 CArray of 32bit float > >>> shape = (365, 4320, 8640) > >>> atom = tables.Float32Atom(dflt=np.nan) > >>> > >>> # chunk into daily slices and then further chunk days > >>> sst_node = h5f.create_carray(h5f.root, 'sst', atom, shape, > >>> chunkshape=(1, CHUNKY, CHUNKX)) > >>> > >>> # dummy up an ndarray > >>> sst = np.empty([4320, 8640], dtype=np.float32) > >>> sst.fill(30.0) > >>> > >>> # write ndarray to a 2D plane in the HDF5 > >>> sst_node[0] = sst > >> Thanks Tim, > >> > >> I adapted your example for my use case (I'm using the EArray class, > >> because I need to continuously update my database), and it works well. > >> > >> However, when I use this with my own data (but also creating the arrays > >> like you did), I'm running into errors like "Could not wait on barrier". > >> It seems like the HDF library is spawing several threads. > >> > >> Any idea what's going wrong? Can I somehow avoid HDF5 multithreading at > >> runtime? > > Update: > > > > When setting max_blosc_threads=2 and max_numexpr_threads=2, everything > > seems to work as expected (but a bit on the slow side ...). > > BTW, can you really notice the difference between using 1, 2 or 4 > threads? Can you show some figures? Just curious. > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |