You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Jon W. <js...@fn...> - 2012-06-06 15:24:44
|
Hi Anthony, On 06/06/2012 12:45 AM, Anthony Scopatz wrote: > > I think something like > histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & > abs(col3) < 5')).eval()) > would be ideal, but since where() returns a row iterator, and not > something that I can extract Column objects from, I don't see any > way to make it work. > > > You are probably looking for the readWhere() method > <http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which > normally returns a numpy structured array. The line you are looking > for is thus: > > histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & > abs(col3) < 5')).eval()) > > This will likely be fast in both cases. I hope this helps. Oddly, it doesn't work with tables.Expr, but does work with numexpr.evaluate. In the case I talked about before with 7M rows, when selecting very few rows, it does just fine (between the other two solutions), but when selecting all rows, it is still about 2.75x slower than the technique of using tables.Expr for both the histogram var and the condition. I think that this is because .readWhere() pulls all the table rows satisfying the where condition into memory first, and it furthermore does so for all columns of all selected rows, so, for a table with many columns, it has to read many times as much data into memory. I can use the field parameter, but it only accepts one single field, so I would have to perform the query once per variable used in the histogram variable expression to do that. Using .readWhere() gives a medium-fast performance in both cases, but I still feel like it is not quite the right thing because it reads the data completely into memory instead of allowing the computation to be performed out-of-core. Perhaps it is not really feasible, but I think the ideal would be to have a .where type query operator that returns Column objects or a Table object, with a "view" imposed in either case. Regards, Jon |
From: Anthony S. <sc...@gm...> - 2012-06-06 05:45:40
|
On Tue, Jun 5, 2012 at 10:32 PM, Jon Wilson <js...@fn...> wrote: [snip] > I think something like > histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & > abs(col3) < 5')).eval()) > would be ideal, but since where() returns a row iterator, and not > something that I can extract Column objects from, I don't see any way to > make it work. > You are probably looking for the readWhere() method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which normally returns a numpy structured array. The line you are looking for is thus: histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & abs(col3) < 5')).eval()) This will likely be fast in both cases. I hope this helps. Be Well Anthony > > So, am I missing some way to compute the histogram variable in the numexpr > kernel, but only for rows I'm interested in? > Regards, > Jon > > > On 06/05/2012 09:45 PM, Anthony Scopatz wrote: > > Hello Jon, > > I believe that the where() method just uses the Expr / numexpr > functionality under the covers. Anything that you can do in Expr you > should be able to do in where(). Can you provide a short example where > this is not the case? > > Be Well > Anthony > > On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn...> wrote: > >> Hi all, >> In looking through the docs, I see two very nice features: the .where() >> query method, and the tables.Expr computation mechanism. But, it >> doesn't appear to be possible to combine the two. It appears that, if I >> want to compute some function of my columns, but only for certain rows, >> I have two options. >> - I can use tables.Expr to compute the function, and then filter the >> results in python >> - I can use mytable.where() to select the rows I'm interested in, and >> then compute the function in python >> >> Am I missing anything? Is it possible to perform fast out-of-core >> computations with numexpr, but only on a subset of the existing rows? >> Regards, >> Jon Wilson >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > Pytables-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Jon W. <js...@fn...> - 2012-06-06 03:32:07
|
Hi Anthony, Allow me to clarify. I wish to perform a reduction (histogramming, specifically) over a function of some values, but only including certain rows. For instance, say I have a table with three columns, col0 -- col3. I would like to create a histogram of col0 + col1**2, but only where col2 > 15 and abs(col3) < 5. As far as I understand, I can do the following: histogram(array([row['col0'] + row['col1']**2 for row in mytable.where('col2 > 15 & abs(col3) < 5')])) And this does produce the desired histogram. However, mytable.where() returns an iterator over rows, and then the list comprehension computes col0 + col1**2 for each row in python space, which lacks the optimization and multithreading of the numexpr kernel. It seems as though it should be possible to have both the condition and the histogramming variable (col0 + col1**2) computed in the parallelized and optimized numexpr kernel, but I do not see a way to do this using where(). The alternative that I can see would be to do something like: histvar = tables.Expr('col0 + col1**2', vars(mytable.cols)).eval() selection = tables.Expr('col2 > 15 & abs(col3) < 5', vars(mytable.cols)).eval() histogram(histvar, weights = selection) This should produce the same histogram as above, and it does compute both the histogram variable and the query condition in the numexpr kernel, but it requires the computation of the histogram variable even for rows I do not wish to include in the histogram. If the table is very large and relatively few rows are selected, or if computing the histogram variable is expensive, this is quite undesirable. So, it seems that I can either a) use the fast query operator where(); or, b) perform all computation in numexpr. But not both. FWIW, a quick timeit test shows that, on a table with ~1M rows, for a very simple condition and a very simple histogram variable, the first method is faster than the second method even when all rows are selected. For a table with ~7M rows, for a more complex histogram variable and still a very simple condition, the first method is faster than the second method when only a few rows are selected, but when all rows are selected, the second method is more than 10x faster. (2.16s vs 3.27s for few rows, 43.1s vs 3.19s for all 7M rows) So it is clear that in some cases, method 2 could be sped up substantially, and in other cases, method 1 could be sped up enormously. I think something like histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & abs(col3) < 5')).eval()) would be ideal, but since where() returns a row iterator, and not something that I can extract Column objects from, I don't see any way to make it work. So, am I missing some way to compute the histogram variable in the numexpr kernel, but only for rows I'm interested in? Regards, Jon On 06/05/2012 09:45 PM, Anthony Scopatz wrote: > Hello Jon, > > I believe that the where() method just uses the Expr / numexpr > functionality under the covers. Anything that you can do in Expr you > should be able to do in where(). Can you provide a short example > where this is not the case? > > Be Well > Anthony > > On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn... > <mailto:js...@fn...>> wrote: > > Hi all, > In looking through the docs, I see two very nice features: the > .where() > query method, and the tables.Expr computation mechanism. But, it > doesn't appear to be possible to combine the two. It appears > that, if I > want to compute some function of my columns, but only for certain > rows, > I have two options. > - I can use tables.Expr to compute the function, and then filter the > results in python > - I can use mytable.where() to select the rows I'm interested in, and > then compute the function in python > > Am I missing anything? Is it possible to perform fast out-of-core > computations with numexpr, but only on a subset of the existing rows? > Regards, > Jon Wilson > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Anthony S. <sc...@gm...> - 2012-06-06 02:46:17
|
Hello Jon, I believe that the where() method just uses the Expr / numexpr functionality under the covers. Anything that you can do in Expr you should be able to do in where(). Can you provide a short example where this is not the case? Be Well Anthony On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn...> wrote: > Hi all, > In looking through the docs, I see two very nice features: the .where() > query method, and the tables.Expr computation mechanism. But, it > doesn't appear to be possible to combine the two. It appears that, if I > want to compute some function of my columns, but only for certain rows, > I have two options. > - I can use tables.Expr to compute the function, and then filter the > results in python > - I can use mytable.where() to select the rows I'm interested in, and > then compute the function in python > > Am I missing anything? Is it possible to perform fast out-of-core > computations with numexpr, but only on a subset of the existing rows? > Regards, > Jon Wilson > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jon W. <js...@fn...> - 2012-06-05 23:31:17
|
Hi all, In looking through the docs, I see two very nice features: the .where() query method, and the tables.Expr computation mechanism. But, it doesn't appear to be possible to combine the two. It appears that, if I want to compute some function of my columns, but only for certain rows, I have two options. - I can use tables.Expr to compute the function, and then filter the results in python - I can use mytable.where() to select the rows I'm interested in, and then compute the function in python Am I missing anything? Is it possible to perform fast out-of-core computations with numexpr, but only on a subset of the existing rows? Regards, Jon Wilson |
From: Francesc A. <fa...@py...> - 2012-06-02 09:58:43
|
Hi Chao, On 6/2/12 11:55 AM, Chao YUE wrote: > if I use gdalinfo to check the file: > chaoyue@chaoyue-Aspire-4750:~/Downloads/LISOTD$ gdalinfo > LISOTD_HRMC_V2.3.2011.hdf > Driver: HDF4/Hierarchical Data Format Release 4 [clip] This says that the file has HDF4 format, not HDF5. Please note that PyTables only can deal with HDF5 files. For HDF4 I'd rather use pyhdf: http://pysclint.sourceforge.net/pyhdf/ -- Francesc Alted |
From: Chao Y. <cha...@gm...> - 2012-06-02 09:56:02
|
if I use gdalinfo to check the file: chaoyue@chaoyue-Aspire-4750:~/Downloads/LISOTD$ gdalinfo LISOTD_HRMC_V2.3.2011.hdf Driver: HDF4/Hierarchical Data Format Release 4 Files: LISOTD_HRMC_V2.3.2011.hdf Size is 512, 512 Coordinate System is `' Subdatasets: SUBDATASET_1_NAME=HDF4_SDS:UNKNOWN:"LISOTD_HRMC_V2.3.2011.hdf":0 SUBDATASET_1_DESC=[360x720x12] HRMC_COM_FR (32-bit floating-point) SUBDATASET_2_NAME=HDF4_SDS:UNKNOWN:"LISOTD_HRMC_V2.3.2011.hdf":4 SUBDATASET_2_DESC=[360x720x12] HRSC_COM_FR (32-bit floating-point) Corner Coordinates: Upper Left ( 0.0, 0.0) Lower Left ( 0.0, 512.0) Upper Right ( 512.0, 0.0) Lower Right ( 512.0, 512.0) Center ( 256.0, 256.0) Chao 2012/6/2 Chao YUE <cha...@gm...> > Dear all, > > I tried to use pytalbes to read a hdf file, but I got error: > I searched a little bit online, there might be cases you have more than 2 > file handlers for the same file and they are opened for both read and > write, you'll probably have this error. > But it's not my case that I open it only for the first time. From the > error message, it seems that there is no root group. > > In [1]: h5file=tables.openFile('LISOTD_HRMC_V2.3.2011.hdf','r') > --------------------------------------------------------------------------- > HDF5ExtError Traceback (most recent call last) > /home/chaoyue/Downloads/LISOTD/<ipython-input-1-b53d861308cf> in <module>() > ----> 1 h5file=tables.openFile('LISOTD_HRMC_V2.3.2011.hdf','r') > > /usr/local/lib/python2.7/dist-packages/tables/file.pyc in > openFile(filename, mode, title, rootUEP, filters, **kwargs) > 256 return filehandle > 257 # Finally, create the File instance, and return it > > --> 258 return File(filename, mode, title, rootUEP, filters, **kwargs) > 259 > 260 > > /usr/local/lib/python2.7/dist-packages/tables/file.pyc in __init__(self, > filename, mode, title, rootUEP, filters, **kwargs) > 565 > 566 # Get the root group from this file > > --> 567 self.root = root = self.__getRootGroup(rootUEP, title, > filters) > 568 # Complete the creation of the root node > > 569 # (see the explanation in ``RootGroup.__init__()``. > > > /usr/local/lib/python2.7/dist-packages/tables/file.pyc in > __getRootGroup(self, rootUEP, title, filters) > 614 # Create new attributes for the root Group instance and > > 615 # create the object tree > > --> 616 return RootGroup(self, rootUEP, title=title, new=new, > filters=filters) > 617 > 618 > > /usr/local/lib/python2.7/dist-packages/tables/group.pyc in __init__(self, > ptFile, name, title, new, filters) > 1155 self._g_new(ptFile, name, init=True) > 1156 # Open the node and get its object ID. > > -> 1157 self._v_objectID = self._g_open() > 1158 > 1159 # Set disk attributes and read children names. > > > /usr/local/lib/python2.7/dist-packages/tables/hdf5Extension.so in > tables.hdf5Extension.Group._g_open (tables/hdf5Extension.c:5521)() > > HDF5ExtError: Can't open the group: '/'. > > If I open the file with this command, it gave no error: > > In [2]: h5file=tables.openFile('LISOTD_HRMC_V2.3.2011.hdf',mode='r') > > But when I try to print the file information, I get: > > In [3]: print h5file > Exception tables.exceptions.HDF5ExtError: HDF5ExtError('Problems closing > the Group /',) in ignored > --------------------------------------------------------------------------- > AttributeError Traceback (most recent call last) > /home/chaoyue/Downloads/LISOTD/<ipython-input-3-64c76de88957> in <module>() > ----> 1 print h5file > > /usr/local/lib/python2.7/dist-packages/tables/file.pyc in __str__(self) > 2197 # Print all the nodes (Group and Leaf objects) on object > tree > > 2198 date = > time.asctime(time.localtime(os.stat(self.filename)[8])) > -> 2199 astring = self.filename + ' (File) ' + repr(self.title) + > '\n' > 2200 # astring += 'rootUEP :=' + repr(self.rootUEP) + '; ' > > 2201 # astring += 'format_version := ' + self.format_version + > '\n' > > > /usr/local/lib/python2.7/dist-packages/tables/file.pyc in _gettitle(self) > 474 > 475 def _gettitle(self): > --> 476 return self.root._v_title > 477 def _settitle(self, title): > 478 self.root._v_title = title > > AttributeError: 'File' object has no attribute 'root' > > I have not too much experience handling HDF data but it's the second time > I have this problem. > In both cases the data are downloaded from official release of research > data so I think it's unlikely that the data itself are badly produced. > But if anyone has any interest trying to have a look of the issue, the > data is at: > ftp://ghrc.nsstc.nasa.gov/pub/lis/climatology/HRMC/data/ > > The ftp is anonymous and the data released by NASA. > > thanks for any help in advance, > > best regards, > > Chao > > -- > > *********************************************************************************** > Chao YUE > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > UMR 1572 CEA-CNRS-UVSQ > Batiment 712 - Pe 119 > 91191 GIF Sur YVETTE Cedex > Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16 > > ************************************************************************************ > > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16 ************************************************************************************ |
From: Chao Y. <cha...@gm...> - 2012-06-02 09:36:21
|
Dear all, I tried to use pytalbes to read a hdf file, but I got error: I searched a little bit online, there might be cases you have more than 2 file handlers for the same file and they are opened for both read and write, you'll probably have this error. But it's not my case that I open it only for the first time. From the error message, it seems that there is no root group. In [1]: h5file=tables.openFile('LISOTD_HRMC_V2.3.2011.hdf','r') --------------------------------------------------------------------------- HDF5ExtError Traceback (most recent call last) /home/chaoyue/Downloads/LISOTD/<ipython-input-1-b53d861308cf> in <module>() ----> 1 h5file=tables.openFile('LISOTD_HRMC_V2.3.2011.hdf','r') /usr/local/lib/python2.7/dist-packages/tables/file.pyc in openFile(filename, mode, title, rootUEP, filters, **kwargs) 256 return filehandle 257 # Finally, create the File instance, and return it --> 258 return File(filename, mode, title, rootUEP, filters, **kwargs) 259 260 /usr/local/lib/python2.7/dist-packages/tables/file.pyc in __init__(self, filename, mode, title, rootUEP, filters, **kwargs) 565 566 # Get the root group from this file --> 567 self.root = root = self.__getRootGroup(rootUEP, title, filters) 568 # Complete the creation of the root node 569 # (see the explanation in ``RootGroup.__init__()``. /usr/local/lib/python2.7/dist-packages/tables/file.pyc in __getRootGroup(self, rootUEP, title, filters) 614 # Create new attributes for the root Group instance and 615 # create the object tree --> 616 return RootGroup(self, rootUEP, title=title, new=new, filters=filters) 617 618 /usr/local/lib/python2.7/dist-packages/tables/group.pyc in __init__(self, ptFile, name, title, new, filters) 1155 self._g_new(ptFile, name, init=True) 1156 # Open the node and get its object ID. -> 1157 self._v_objectID = self._g_open() 1158 1159 # Set disk attributes and read children names. /usr/local/lib/python2.7/dist-packages/tables/hdf5Extension.so in tables.hdf5Extension.Group._g_open (tables/hdf5Extension.c:5521)() HDF5ExtError: Can't open the group: '/'. If I open the file with this command, it gave no error: In [2]: h5file=tables.openFile('LISOTD_HRMC_V2.3.2011.hdf',mode='r') But when I try to print the file information, I get: In [3]: print h5file Exception tables.exceptions.HDF5ExtError: HDF5ExtError('Problems closing the Group /',) in ignored --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /home/chaoyue/Downloads/LISOTD/<ipython-input-3-64c76de88957> in <module>() ----> 1 print h5file /usr/local/lib/python2.7/dist-packages/tables/file.pyc in __str__(self) 2197 # Print all the nodes (Group and Leaf objects) on object tree 2198 date = time.asctime(time.localtime(os.stat(self.filename)[8])) -> 2199 astring = self.filename + ' (File) ' + repr(self.title) + '\n' 2200 # astring += 'rootUEP :=' + repr(self.rootUEP) + '; ' 2201 # astring += 'format_version := ' + self.format_version + '\n' /usr/local/lib/python2.7/dist-packages/tables/file.pyc in _gettitle(self) 474 475 def _gettitle(self): --> 476 return self.root._v_title 477 def _settitle(self, title): 478 self.root._v_title = title AttributeError: 'File' object has no attribute 'root' I have not too much experience handling HDF data but it's the second time I have this problem. In both cases the data are downloaded from official release of research data so I think it's unlikely that the data itself are badly produced. But if anyone has any interest trying to have a look of the issue, the data is at: ftp://ghrc.nsstc.nasa.gov/pub/lis/climatology/HRMC/data/ The ftp is anonymous and the data released by NASA. thanks for any help in advance, best regards, Chao -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16 ************************************************************************************ |
From: Francesc A. <fa...@py...> - 2012-05-22 09:18:16
|
On 5/21/12 9:31 PM, Josh Ayers wrote: > Hi Alex, > > Reading a PyTables file in another platform should be easy, as long as > you use a compression library that is supported on both platforms. > The most widely available is likely to be zlib, since it is included > in the pre-built binaries available from the HDF group's website. > There are C and Fortran versions available here: > http://www.hdfgroup.org/HDF5/release/obtain5.html. It looks like > there's also a partial .NET wrapper of the library here: http://hdf5.net/. > > Recent versions of Matlab also have support for HDF5 (the v7.3 > "mat-file" format is based on it). Since I have it available, I just > verified that Matlab R2011b can read PyTables files in uncompressed > and zlib compressed formats, using Matlab's h5read function. It > failed when the PyTables file was compressed with bzip2, lzo, or > blosc. I only tested it with a PyTables table, which is read into > Matlab as a struct. Blosc is prepared to interact with the generic HDF5 library, so your current HDF5 applications can read datasets compressed with it. But you need to re-compile your HDF5 library for getting this support: https://github.com/FrancescAlted/blosc/tree/master/hdf5 > > As far as writing files on another platform and then reading them in > PyTables, that will be a little more difficult. There are certain > HDF5 attributes that are required by PyTables on each group and > dataset. All the details are documented here: > http://pytables.github.com/usersguide/file_format.html. Nope. These attributes are not required, they are optional. PyTables generally makes a good job at accessing HDF5 without this info. FYI, these attributes are a superset of the High Level HDF5 library: http://www.hdfgroup.org/HDF5/hdf5_hl/ -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2012-05-21 20:45:07
|
Hi Alex, In general, HDF5 files are very portable to many platforms and many languages. Indeed, that is sort of the purpose behind the HDF Group. While there are some incompatible edge cases, you sort of have to look for them. Josh did a very good job of outlining the support for HDF5 across the board. However, I would like to add that I have been using HDF5 / PyTables for 3 - 4 years and have never had a compatibility issue. Be Well Anthony On Mon, May 21, 2012 at 2:31 PM, Josh Ayers <jos...@gm...> wrote: > Hi Alex, > > Reading a PyTables file in another platform should be easy, as long as you > use a compression library that is supported on both platforms. The most > widely available is likely to be zlib, since it is included in the > pre-built binaries available from the HDF group's website. There are C and > Fortran versions available here: > http://www.hdfgroup.org/HDF5/release/obtain5.html. It looks like there's > also a partial .NET wrapper of the library here: http://hdf5.net/. > > Recent versions of Matlab also have support for HDF5 (the v7.3 "mat-file" > format is based on it). Since I have it available, I just verified that > Matlab R2011b can read PyTables files in uncompressed and zlib compressed > formats, using Matlab's h5read function. It failed when the PyTables file > was compressed with bzip2, lzo, or blosc. I only tested it with a PyTables > table, which is read into Matlab as a struct. > > As far as writing files on another platform and then reading them in > PyTables, that will be a little more difficult. There are certain HDF5 > attributes that are required by PyTables on each group and dataset. All > the details are documented here: > http://pytables.github.com/usersguide/file_format.html. > > Hope that helps, > > Josh > > > On Mon, May 21, 2012 at 10:12 AM, Alex Liberzon <ale...@gm...>wrote: > >> Dear PyTables developers, >> >> Thanks for the great project. >> >> I would like to suggest to a small scientific community (Lagrangian >> particle tracking velocimetry and numerical simulations of turbulent flows) >> to start using PyTables as a common platform for exchanging of large >> datasets (few gigas to tens of terabytes). The major advantage I see in the >> great query and on-disk analysis capabilities that are not present in the >> original HDF5. However, one major drawback from some of the groups is the >> question of software: people work with C, C#, Fortran, Python, Matlab and >> use a wide range of visualization software platforms. In order to get to a >> common ground we need something that I couldn't find so far: C, Fortran >> libraries to access PyTables HDF files created by this great Python >> library. What are the suggestions? Does somebody have a similar experience >> of sharing data between groups that do not use Python? >> >> Thank you, >> Alex Liberzon >> Turbulence Structure Laboratory >> Tel Aviv University >> >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Josh A. <jos...@gm...> - 2012-05-21 19:31:12
|
Hi Alex, Reading a PyTables file in another platform should be easy, as long as you use a compression library that is supported on both platforms. The most widely available is likely to be zlib, since it is included in the pre-built binaries available from the HDF group's website. There are C and Fortran versions available here: http://www.hdfgroup.org/HDF5/release/obtain5.html. It looks like there's also a partial .NET wrapper of the library here: http://hdf5.net/. Recent versions of Matlab also have support for HDF5 (the v7.3 "mat-file" format is based on it). Since I have it available, I just verified that Matlab R2011b can read PyTables files in uncompressed and zlib compressed formats, using Matlab's h5read function. It failed when the PyTables file was compressed with bzip2, lzo, or blosc. I only tested it with a PyTables table, which is read into Matlab as a struct. As far as writing files on another platform and then reading them in PyTables, that will be a little more difficult. There are certain HDF5 attributes that are required by PyTables on each group and dataset. All the details are documented here: http://pytables.github.com/usersguide/file_format.html. Hope that helps, Josh On Mon, May 21, 2012 at 10:12 AM, Alex Liberzon <ale...@gm...>wrote: > Dear PyTables developers, > > Thanks for the great project. > > I would like to suggest to a small scientific community (Lagrangian > particle tracking velocimetry and numerical simulations of turbulent flows) > to start using PyTables as a common platform for exchanging of large > datasets (few gigas to tens of terabytes). The major advantage I see in the > great query and on-disk analysis capabilities that are not present in the > original HDF5. However, one major drawback from some of the groups is the > question of software: people work with C, C#, Fortran, Python, Matlab and > use a wide range of visualization software platforms. In order to get to a > common ground we need something that I couldn't find so far: C, Fortran > libraries to access PyTables HDF files created by this great Python > library. What are the suggestions? Does somebody have a similar experience > of sharing data between groups that do not use Python? > > Thank you, > Alex Liberzon > Turbulence Structure Laboratory > Tel Aviv University > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Alex L. <ale...@gm...> - 2012-05-21 17:12:58
|
Dear PyTables developers, Thanks for the great project. I would like to suggest to a small scientific community (Lagrangian particle tracking velocimetry and numerical simulations of turbulent flows) to start using PyTables as a common platform for exchanging of large datasets (few gigas to tens of terabytes). The major advantage I see in the great query and on-disk analysis capabilities that are not present in the original HDF5. However, one major drawback from some of the groups is the question of software: people work with C, C#, Fortran, Python, Matlab and use a wide range of visualization software platforms. In order to get to a common ground we need something that I couldn't find so far: C, Fortran libraries to access PyTables HDF files created by this great Python library. What are the suggestions? Does somebody have a similar experience of sharing data between groups that do not use Python? Thank you, Alex Liberzon Turbulence Structure Laboratory Tel Aviv University |
From: Anthony S. <sc...@gm...> - 2012-05-21 14:34:41
|
Hi Uwe, Sorry, I wrote this when I was away from my computer and so I couldn't test it. Our documentation is clearly wrong then. However, what you *can* do is take the dtype from a known VideoNode table and then compare using this. known_dtype = f.root.path_to_a_video_node.dtype bar = filter(x.dtype == known_dtype for x in f.walkNodes('/', 'Table')) Note that in your file your file you could create an empty table with the VideoNode description at a specific location just so that you can read out this dtype. Be Well Anthony On Mon, May 21, 2012 at 6:20 AM, Uwe Mayer <uwe...@df...> wrote: > Hi Anthony, > > On 05/19/2012 08:12 PM, Anthony Scopatz wrote: > > Hello Uwe, > > > > Why don't you try something like: > > > > bar = filter(x.description == VideoNode for x in f.walkNodes('/', > 'Table')) > > > > or > > > > bar = filter(x.dtype == VideoNode._v_dtype for x in f.walkNodes('/', > > 'Table')) > > > > to compare the dtype / description directly? > > correction on my behalf, that would be exactly what I needed, but: > > - x.description compares false to a (correct) subclass of > tables.IsDescription > > - a subclass of tables.IsDescription has no property _v_dtype to compare > to x.dtype (from your example above) > > > Any other ideas? > > Thanks in advance, > Uwe > > > > On May 18, 2012 8:00 AM, "Uwe Mayer" <uwe...@df... > > <mailto:uwe...@df...>> wrote: > > > > Hi, > > > > I have several leaf nodes of the same table dtype: > > > > class VideoNode(tables.IsDescription): > > ... > > > > Not all tables in the hdf5 file are of the same type, however. How > > do I iterate > > over all leafes which are tables of the above class, while ignoring > > tables with > > different signatures? > > > > i.e. I'd like to write something like: > > <code> > > f = tables.openFile(...) > > foo = f.walkNodes('/', classname='VideoNode') > > </code> > > which does not work because only the class name is "Table"... > > > > or > > <code> > > bar = filter(isinstance(x, VideoNode) for x in f.walkNodes('/', > > 'Table'))) > > </code> > > > > which does not work, because x is never an instance of VideoNode. > > > > Any ideas? > > > > Thanks in advance > > Uwe > > > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. > > Discussions > > will include endpoint security, mobile security and the latest in > > malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Uwe M. <uwe...@df...> - 2012-05-21 11:20:50
|
Hi Anthony, On 05/19/2012 08:12 PM, Anthony Scopatz wrote: > Hello Uwe, > > Why don't you try something like: > > bar = filter(x.description == VideoNode for x in f.walkNodes('/', 'Table')) > > or > > bar = filter(x.dtype == VideoNode._v_dtype for x in f.walkNodes('/', > 'Table')) > > to compare the dtype / description directly? correction on my behalf, that would be exactly what I needed, but: - x.description compares false to a (correct) subclass of tables.IsDescription - a subclass of tables.IsDescription has no property _v_dtype to compare to x.dtype (from your example above) Any other ideas? Thanks in advance, Uwe > On May 18, 2012 8:00 AM, "Uwe Mayer" <uwe...@df... > <mailto:uwe...@df...>> wrote: > > Hi, > > I have several leaf nodes of the same table dtype: > > class VideoNode(tables.IsDescription): > ... > > Not all tables in the hdf5 file are of the same type, however. How > do I iterate > over all leafes which are tables of the above class, while ignoring > tables with > different signatures? > > i.e. I'd like to write something like: > <code> > f = tables.openFile(...) > foo = f.walkNodes('/', classname='VideoNode') > </code> > which does not work because only the class name is "Table"... > > or > <code> > bar = filter(isinstance(x, VideoNode) for x in f.walkNodes('/', > 'Table'))) > </code> > > which does not work, because x is never an instance of VideoNode. > > Any ideas? > > Thanks in advance > Uwe > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Uwe M. <uwe...@df...> - 2012-05-21 08:15:04
|
Hi Anthony, On 05/19/2012 08:12 PM, Anthony Scopatz wrote: > Why don't you try something like: > > bar = filter(x.description == VideoNode for x in f.walkNodes('/', 'Table')) > > or > > bar = filter(x.dtype == VideoNode._v_dtype for x in f.walkNodes('/', > 'Table')) > > to compare the dtype / description directly? Oh. *g* This is exactly what I was looking for. I did not know how to use the class in a comparison for this. Thank you! Uwe > On May 18, 2012 8:00 AM, "Uwe Mayer" <uwe...@df... > <mailto:uwe...@df...>> wrote: > > Hi, > > I have several leaf nodes of the same table dtype: > > class VideoNode(tables.IsDescription): > ... > > Not all tables in the hdf5 file are of the same type, however. How > do I iterate > over all leafes which are tables of the above class, while ignoring > tables with > different signatures? > > i.e. I'd like to write something like: > <code> > f = tables.openFile(...) > foo = f.walkNodes('/', classname='VideoNode') > </code> > which does not work because only the class name is "Table"... > > or > <code> > bar = filter(isinstance(x, VideoNode) for x in f.walkNodes('/', > 'Table'))) > </code> > > which does not work, because x is never an instance of VideoNode. > > Any ideas? > > Thanks in advance > Uwe > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Anthony S. <sc...@gm...> - 2012-05-19 18:13:03
|
Hello Uwe, Why don't you try something like: bar = filter(x.description == VideoNode for x in f.walkNodes('/', 'Table')) or bar = filter(x.dtype == VideoNode._v_dtype for x in f.walkNodes('/', 'Table')) to compare the dtype / description directly? Be Well Anthony On May 18, 2012 8:00 AM, "Uwe Mayer" <uwe...@df...> wrote: > Hi, > > I have several leaf nodes of the same table dtype: > > class VideoNode(tables.IsDescription): > ... > > Not all tables in the hdf5 file are of the same type, however. How do I > iterate > over all leafes which are tables of the above class, while ignoring tables > with > different signatures? > > i.e. I'd like to write something like: > <code> > f = tables.openFile(...) > foo = f.walkNodes('/', classname='VideoNode') > </code> > which does not work because only the class name is "Table"... > > or > <code> > bar = filter(isinstance(x, VideoNode) for x in f.walkNodes('/', 'Table'))) > </code> > > which does not work, because x is never an instance of VideoNode. > > Any ideas? > > Thanks in advance > Uwe > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-05-19 17:57:50
|
Hello Nikola, Thanks for reporting this issue (and sorry about the delayed reply). I have two requests for you: 1. could you come up with a self contained example that reproduces this behaviour? 2. and could you maybe make a github issue related to this problem? #1 is much more important. Thanks a ton! Be Well Anthony On May 18, 2012 5:01 AM, "nikola stevanovic" <nid...@gm...> wrote: > *Hi,* > > Couple days ago, I make some experiments with pytables. I was curious > about reading and writing speed for my future project. > So, I decided make some tests. In my hdf5 files I have only one table > named *Table_1*. I started tests with one million rows and after that > keep continue testing with 100 000 000 and 500 000 000. This is how looks > table structure: > > /Table_1 (Table(500000000,)) '' > description := { > "Device_ID": StringCol(itemsize=14, shape=(), dflt='', pos=0), > "DateTime": Time32Col(shape=(), dflt=0, pos=1), > "Value": Float32Col(shape=(), dflt=0.0, pos=2), > "Status": StringCol(itemsize=10, shape=(), dflt='', pos=3)} > byteorder := 'little' > chunkshape := (2048,0) > autoIndex := True > colindexes := { > "DateTime": Index(9, full, shuffle, zlib(1)).is_CSI=True} > > > I didn't change chunkshape (default from creating table > chunkshape=(2048,0)). Only thing I did is creating index on column > DateTime. Everything worked fine. But, after 500 000 000 rows, I decide > compare this table and table whith chunkshape=(65536). So I copy this table > in other hdf5 file using ptrepack tool: > > ptrepack --chunkshape='(65536,0)' /home/azura/a.h5:/Table_1 > /home/azura/b.h5:/ > > My new table work fine until I create index (CSIndex()) on DateTime > column. Index creation was successful, but calling methods as *where(), > getWhereList()* throws following exception: > > query = '(DateTime > 1293836400.0) & (DateTime < 1297292400.0)' > a = numpy.array([ (x['Device_ID'],x['DateTime'],x['Value']) for x in > tbl.where(query) ]) > Traceback (most recent call last): > File "<pyshell#100>", line 1, in <module> > a = numpy.array([ (x['Device_ID'],x['DateTime'],x['Value']) for x in > tbl.where(query) ]) > File "tableExtension.pyx", line 858, in > tables.tableExtension.Row.__next__ (tables/tableExtension.c:7788) > File "tableExtension.pyx", line 879, in > tables.tableExtension.Row.__next__indexed (tables/tableExtension.c:7922) > AssertionError > > > Then I decide make same table without ptrepack tool. So I created new > table and fill with 500 000 000 rows (same chunkshape, same record > structure). Everythings works fine, so my conclusion is that there is a bug > in ptrepack tool. Note that exception appear in copied table after creating > CS index. I'm just curious about this. What can be wrong? > > I'm using Ubuntu 12.04TLS with ext4 > Processor: Intel® Core™ i3 CPU M 380 @ 2.53GHz × 4 > RAM: 4GB > HARD DISK: 500GB > > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > PyTables version: 2.3.1 > HDF5 version: 1.8.4-patch1 > NumPy version: 1.6.0 > Numexpr version: 2.0.1 (not using Intel's VML/MKL) > Zlib version: 1.2.3.4 (in Python interpreter) > Blosc version: 1.1.2 (2010-11-04) > Cython version: 0.16 > Python version: 2.7.3 (default, Apr 20 2012, 22:44:07) > [GCC 4.6.3] > Platform: linux2-i686 > Byte-ordering: little > Detected cores: 4 > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > > File(filename=/home/azura/b.h5, title='', mode='a', rootUEP='/', > filters=Filters(complevel=0, shuffle=False, fletcher32=False)) > / (RootGroup) '' > /Table_1 (Table(500000000,)) '' > description := { > "Device_ID": StringCol(itemsize=14, shape=(), dflt='', pos=0), > "DateTime": Time32Col(shape=(), dflt=0, pos=1), > "Value": Float32Col(shape=(), dflt=0.0, pos=2), > "Status": StringCol(itemsize=10, shape=(), dflt='', pos=3)} > byteorder := 'little' > chunkshape := (65536,) > autoIndex := True > colindexes := { > "DateTime": Index(9, full, shuffle, zlib(1)).is_CSI=True} > > *Cheers!* > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Uwe M. <uwe...@df...> - 2012-05-18 13:00:18
|
Hi, I have several leaf nodes of the same table dtype: class VideoNode(tables.IsDescription): ... Not all tables in the hdf5 file are of the same type, however. How do I iterate over all leafes which are tables of the above class, while ignoring tables with different signatures? i.e. I'd like to write something like: <code> f = tables.openFile(...) foo = f.walkNodes('/', classname='VideoNode') </code> which does not work because only the class name is "Table"... or <code> bar = filter(isinstance(x, VideoNode) for x in f.walkNodes('/', 'Table'))) </code> which does not work, because x is never an instance of VideoNode. Any ideas? Thanks in advance Uwe |
From: nikola s. <nid...@gm...> - 2012-05-18 10:01:10
|
*Hi,* Couple days ago, I make some experiments with pytables. I was curious about reading and writing speed for my future project. So, I decided make some tests. In my hdf5 files I have only one table named *Table_1*. I started tests with one million rows and after that keep continue testing with 100 000 000 and 500 000 000. This is how looks table structure: /Table_1 (Table(500000000,)) '' description := { "Device_ID": StringCol(itemsize=14, shape=(), dflt='', pos=0), "DateTime": Time32Col(shape=(), dflt=0, pos=1), "Value": Float32Col(shape=(), dflt=0.0, pos=2), "Status": StringCol(itemsize=10, shape=(), dflt='', pos=3)} byteorder := 'little' chunkshape := (2048,0) autoIndex := True colindexes := { "DateTime": Index(9, full, shuffle, zlib(1)).is_CSI=True} I didn't change chunkshape (default from creating table chunkshape=(2048,0)). Only thing I did is creating index on column DateTime. Everything worked fine. But, after 500 000 000 rows, I decide compare this table and table whith chunkshape=(65536). So I copy this table in other hdf5 file using ptrepack tool: ptrepack --chunkshape='(65536,0)' /home/azura/a.h5:/Table_1 /home/azura/b.h5:/ My new table work fine until I create index (CSIndex()) on DateTime column. Index creation was successful, but calling methods as *where(), getWhereList()* throws following exception: query = '(DateTime > 1293836400.0) & (DateTime < 1297292400.0)' a = numpy.array([ (x['Device_ID'],x['DateTime'],x['Value']) for x in tbl.where(query) ]) Traceback (most recent call last): File "<pyshell#100>", line 1, in <module> a = numpy.array([ (x['Device_ID'],x['DateTime'],x['Value']) for x in tbl.where(query) ]) File "tableExtension.pyx", line 858, in tables.tableExtension.Row.__next__ (tables/tableExtension.c:7788) File "tableExtension.pyx", line 879, in tables.tableExtension.Row.__next__indexed (tables/tableExtension.c:7922) AssertionError Then I decide make same table without ptrepack tool. So I created new table and fill with 500 000 000 rows (same chunkshape, same record structure). Everythings works fine, so my conclusion is that there is a bug in ptrepack tool. Note that exception appear in copied table after creating CS index. I'm just curious about this. What can be wrong? I'm using Ubuntu 12.04TLS with ext4 Processor: Intel® Core™ i3 CPU M 380 @ 2.53GHz × 4 RAM: 4GB HARD DISK: 500GB -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= PyTables version: 2.3.1 HDF5 version: 1.8.4-patch1 NumPy version: 1.6.0 Numexpr version: 2.0.1 (not using Intel's VML/MKL) Zlib version: 1.2.3.4 (in Python interpreter) Blosc version: 1.1.2 (2010-11-04) Cython version: 0.16 Python version: 2.7.3 (default, Apr 20 2012, 22:44:07) [GCC 4.6.3] Platform: linux2-i686 Byte-ordering: little Detected cores: 4 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= File(filename=/home/azura/b.h5, title='', mode='a', rootUEP='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False)) / (RootGroup) '' /Table_1 (Table(500000000,)) '' description := { "Device_ID": StringCol(itemsize=14, shape=(), dflt='', pos=0), "DateTime": Time32Col(shape=(), dflt=0, pos=1), "Value": Float32Col(shape=(), dflt=0.0, pos=2), "Status": StringCol(itemsize=10, shape=(), dflt='', pos=3)} byteorder := 'little' chunkshape := (65536,) autoIndex := True colindexes := { "DateTime": Index(9, full, shuffle, zlib(1)).is_CSI=True} *Cheers!* |
From: Francesc A. <fa...@py...> - 2012-05-14 21:11:39
|
On 5/14/12 3:12 PM, Anthony Scopatz wrote: > > > On Mon, May 14, 2012 at 3:05 PM, Francesc Alted <fa...@py... > <mailto:fa...@py...>> wrote: > > [snip] > > However, do not expect to use all your cores at full speed in this > cases, as the reductions in numexpr can only make use of one > thread (this is because this has not been implemented yet, not due > to a intrinsic limitation of numexpr). > > > Hello Francesc, > > Not to side track the discussion too much, but is there a ticket open > for this in numexpr? There is: http://code.google.com/p/numexpr/issues/detail?id=73 > It seems that at least for certain reductions (sum, mult, etc), > splitting this up over many cores would be pretty easy. I may to > wrong about this though ;) Apparently should be easy, but the reality proves it to be a bit harder ;) I remember to spent some quality time on this, but did not get able to solve the problem. But it is *solvable* for sure. Anyway, after looking into the ticket above, the next code could be faster: fn_str = '(a - (b + %s))**2' % db expr = Expr(fn_str,uservars=uv) # returning the "sum of squares" return expr.eval().sum() Which is basically what you was suggesting: using numpy.sum(). But definitely, the elegant solution would be to make reductions use multiple cores on numexpr. Francesc > > Be Well > Anthony > > > Francesc > > >> >> I hope this helps. If you need other tips on speeding up the >> sum operation, please let us know. >> >> Be Well >> Anthony >> >> Timer unit: 1e-06 s >> >> File: pytables_expr_test.py >> Function: fn at line 66 >> Total time: 1.63254 s >> >> Line # Hits Time Per Hit % Time Line Contents >> ============================================================== >> 66 def fn(p, h5table): >> 67 ''' >> 68 actual >> function we are going to minimize. It consists of >> 69 the >> pytables Table object and a list of parameters. >> 70 ''' >> 71 1 14 14.0 0.0 uv = >> h5table.colinstances >> 72 >> 73 # store >> parameters in a dict object with names >> 74 # like p0, >> p1, p2, etc. so they can be used in >> 75 # the Expr >> object. >> 76 4 21 5.2 0.0 for i in >> xrange(len(p)): >> 77 3 19 6.3 0.0 k = >> 'p'+str(i) >> 78 3 14 4.7 0.0 uv[k] = p[i] >> 79 >> 80 # systematic >> shift on b is a polynomial in a >> 81 1 4 4.0 0.0 db = 'p0 * >> a*a + p1 * a + p2' >> 82 >> 83 # the >> element-wise function >> 84 1 6 6.0 0.0 fn_str = '(a >> - (b + %s))**2' % db >> 85 >> 86 1 16427 16427.0 1.0 expr = >> Expr(fn_str,uservars=uv) >> 87 1 21438 21438.0 1.3 expr.eval() >> 88 >> 89 # returning >> the "sum of squares" >> 90 1 1594600 1594600.0 97.7 return >> sum(expr) >> >> >> >> >> On Mon, May 14, 2012 at 1:59 PM, Johann Goetz <jg...@uc... >> <mailto:jg...@uc...>> wrote: >> >> SHORT VERSION: >> >> Please take a look at the fn() function in the attached file >> (pasted below). When I run this with 10M events or more I >> notice that the total CPU usage never goes above the >> percentage I get using single-threaded eval(). Am I at some >> other limit or can I improve performance by doing something else? >> >> LONG VERSION: >> >> I have been trying to use the tables.Expr object to speed up >> a sophisticated calculation over an entire dataset (a >> pytables Table object). The calculation took so long that I >> had to create a simple example to make sure I knew what I was >> doing. I apologize in advance for the lengthy code below, but >> I wanted the example to mimic exactly what I'm trying to do >> and to be totally self-contained. >> >> I have attached a file (and pasted it below) in which I >> create a hdf5 file with a single large Table of two columns. >> As you can see, I'm not worried about writing speed at all - >> I'm concerned about read speed. >> >> I would like to draw your attention to the fn() function. >> This is where I evaluate a "chi-squared" value on the >> dataset. My strategy is to populate the >> "h5table.colinstances" dict object with several parameters >> which I call p0, p1, etc and then create the Expr object >> using these and the column names from the Table. >> >> If I create 10M rows (77 MB file) in the Table (with the >> command below), the evaluation seems to be CPU bound (one of >> my cores is at 100% - the others are idle) and it takes about >> 7 seconds (about 10 MB/s). Similarly, I get about 70 seconds >> for 100M events. >> >> python pytables_expr_test.py 10000000 >> python pytables_expr_test.py 100000000 >> >> So my question: It seems to me that I am not fully using the >> CPU power available on my computer (see next paragraph). Am I >> missing something or doing something wrong in the fn() >> function below? >> >> A few side-notes: My hard-disk is capable of over 200 MB/s in >> sequential reading (sustained and tested with large files >> using the iozone program), I have two 4-core CPU's on this >> machine but the total CPU usage during eval() never goes >> above the percentage I get using single-threaded mode with >> "numexpr.set_num_threads(1)". >> >> I am using pytables 2.3.1 and numexpr 2.0.1 >> >> -- >> Johann T. Goetz, PhD. >> <http://sites.google.com/site/theodoregoetz/> >> jg...@uc... <mailto:jg...@uc...> >> Nefkens Group, UCLA Dept. of Physics & Astronomy >> Hall-B, Jefferson Lab, Newport News, VA >> >> >> ### BEGIN file: pytables_expr_test.py >> >> from tables import openFile, Expr >> >> ### Control of the number of threads used when issuing the >> ### Expr::eval() command >> #import numexpr >> #numexpr.set_num_threads(2) >> >> def create_ntuple_file(filename, npoints, pmodel): >> ''' >> create an hdf5 file with a single table which contains >> npoints number of rows of type row_t (defined below) >> ''' >> from numpy import random, poly1d >> from tables import IsDescription, Float32Col >> >> class row_t(IsDescription): >> ''' >> the rows of the table to be created >> ''' >> a = Float32Col() >> b = Float32Col() >> >> def append_row(h5row, pmodel): >> ''' >> consider this a single "event" being appended >> to the dataset (table) >> ''' >> h5row['a'] = random.uniform(0,10) >> >> h5row['b'] = h5row['a'] # reality (or model) >> h5row['b'] = h5row['b'] - poly1d(pmodel)(h5row['a']) >> # systematics >> h5row['b'] = h5row['b'] + random.normal(0,0.1) # noise >> >> h5row.append() >> >> h5file = openFile(filename, 'w') >> h5table = h5file.createTable('/', 'table', row_t, "Data") >> h5row = h5table.row >> >> # recording data to file... >> for n in xrange(npoints): >> append_row(h5row, pmodel) >> >> h5file.close() >> >> def create_ntuple_file_if_needed(filename, npoints, pmodel): >> ''' >> looks to see if the file is already there and if so, >> it makes sure its the right size. Otherwise, it >> removes the existing file and creates a new one. >> ''' >> from os import path, remove >> >> print 'model parameters:', pmodel >> >> if path.exists(filename): >> h5file = openFile(filename, 'r') >> h5table = h5file.root.table >> if len(h5table) != npoints: >> h5file.close() >> remove(filename) >> >> if not path.exists(filename): >> create_ntuple_file(filename, npoints, pmodel) >> >> def fn(p, h5table): >> ''' >> actual function we are going to minimize. It consists of >> the pytables Table object and a list of parameters. >> ''' >> uv = h5table.colinstances >> >> # store parameters in a dict object with names >> # like p0, p1, p2, etc. so they can be used in >> # the Expr object. >> for i in xrange(len(p)): >> k = 'p'+str(i) >> uv[k] = p[i] >> >> # systematic shift on b is a polynomial in a >> db = 'p0 * a*a + p1 * a + p2' >> >> # the element-wise function >> fn_str = '(a - (b + %s))**2' % db >> >> expr = Expr(fn_str,uservars=uv) >> expr.eval() >> >> # returning the "sum of squares" >> return sum(expr) >> >> if __name__ == '__main__': >> ''' >> usage: >> python pytables_expr_test.py [npoints] >> >> Hint: try this with 10M points >> ''' >> from sys import argv >> from time import time >> >> npoints = 1000000 >> if len(argv) > 1: >> npoints = int(argv[1]) >> >> filename = 'tmp.'+str(npoints)+'.hdf5' >> >> pmodel = [-0.04,0.002,0.001] >> >> print 'creating file (if it doesn\'t exist)...' >> create_ntuple_file_if_needed(filename, npoints, pmodel) >> >> h5file = openFile(filename, 'r') >> h5table = h5file.root.table >> >> print 'evaluating function' >> starttime = time() >> print fn([0.,0.,0.], h5table) >> print 'evaluated file in',time()-starttime,'seconds.' >> >> #EOF >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. >> Discussions >> will include endpoint security, mobile security and the >> latest in malware >> threats. >> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> <mailto:Pyt...@li...> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats.http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... <mailto:Pyt...@li...> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2012-05-14 20:12:56
|
On Mon, May 14, 2012 at 3:05 PM, Francesc Alted <fa...@py...> wrote: [snip] However, do not expect to use all your cores at full speed in this cases, > as the reductions in numexpr can only make use of one thread (this is > because this has not been implemented yet, not due to a intrinsic > limitation of numexpr). > Hello Francesc, Not to side track the discussion too much, but is there a ticket open for this in numexpr? It seems that at least for certain reductions (sum, mult, etc), splitting this up over many cores would be pretty easy. I may to wrong about this though ;) Be Well Anthony > > Francesc > > > > > I hope this helps. If you need other tips on speeding up the > sum operation, please let us know. > > Be Well > Anthony > > Timer unit: 1e-06 s > > File: pytables_expr_test.py > Function: fn at line 66 > Total time: 1.63254 s > > Line # Hits Time Per Hit % Time Line Contents > ============================================================== > 66 def fn(p, h5table): > 67 ''' > 68 actual function > we are going to minimize. It consists of > 69 the pytables > Table object and a list of parameters. > 70 ''' > 71 1 14 14.0 0.0 uv = > h5table.colinstances > 72 > 73 # store parameters in > a dict object with names > 74 # like p0, p1, p2, > etc. so they can be used in > 75 # the Expr object. > 76 4 21 5.2 0.0 for i in > xrange(len(p)): > 77 3 19 6.3 0.0 k = 'p'+str(i) > 78 3 14 4.7 0.0 uv[k] = p[i] > 79 > 80 # systematic shift on > b is a polynomial in a > 81 1 4 4.0 0.0 db = 'p0 * a*a + p1 > * a + p2' > 82 > 83 # the element-wise > function > 84 1 6 6.0 0.0 fn_str = '(a - (b + > %s))**2' % db > 85 > 86 1 16427 16427.0 1.0 expr = > Expr(fn_str,uservars=uv) > 87 1 21438 21438.0 1.3 expr.eval() > 88 > 89 # returning the "sum > of squares" > 90 1 1594600 1594600.0 97.7 return sum(expr) > > > > > On Mon, May 14, 2012 at 1:59 PM, Johann Goetz <jg...@uc...> wrote: > >> SHORT VERSION: >> >> Please take a look at the fn() function in the attached file (pasted >> below). When I run this with 10M events or more I notice that the total CPU >> usage never goes above the percentage I get using single-threaded eval(). >> Am I at some other limit or can I improve performance by doing something >> else? >> >> LONG VERSION: >> >> I have been trying to use the tables.Expr object to speed up a >> sophisticated calculation over an entire dataset (a pytables Table object). >> The calculation took so long that I had to create a simple example to make >> sure I knew what I was doing. I apologize in advance for the lengthy code >> below, but I wanted the example to mimic exactly what I'm trying to do and >> to be totally self-contained. >> >> I have attached a file (and pasted it below) in which I create a hdf5 >> file with a single large Table of two columns. As you can see, I'm not >> worried about writing speed at all - I'm concerned about read speed. >> >> I would like to draw your attention to the fn() function. This is where I >> evaluate a "chi-squared" value on the dataset. My strategy is to populate >> the "h5table.colinstances" dict object with several parameters which I call >> p0, p1, etc and then create the Expr object using these and the column >> names from the Table. >> >> If I create 10M rows (77 MB file) in the Table (with the command below), >> the evaluation seems to be CPU bound (one of my cores is at 100% - the >> others are idle) and it takes about 7 seconds (about 10 MB/s). Similarly, I >> get about 70 seconds for 100M events. >> >> python pytables_expr_test.py 10000000 >> python pytables_expr_test.py 100000000 >> >> So my question: It seems to me that I am not fully using the CPU power >> available on my computer (see next paragraph). Am I missing something or >> doing something wrong in the fn() function below? >> >> A few side-notes: My hard-disk is capable of over 200 MB/s in sequential >> reading (sustained and tested with large files using the iozone program), I >> have two 4-core CPU's on this machine but the total CPU usage during eval() >> never goes above the percentage I get using single-threaded mode with >> "numexpr.set_num_threads(1)". >> >> I am using pytables 2.3.1 and numexpr 2.0.1 >> >> -- >> Johann T. Goetz, PhD. <http://sites.google.com/site/theodoregoetz/> >> jg...@uc... >> Nefkens Group, UCLA Dept. of Physics & Astronomy >> Hall-B, Jefferson Lab, Newport News, VA >> >> >> ### BEGIN file: pytables_expr_test.py >> >> from tables import openFile, Expr >> >> ### Control of the number of threads used when issuing the >> ### Expr::eval() command >> #import numexpr >> #numexpr.set_num_threads(2) >> >> def create_ntuple_file(filename, npoints, pmodel): >> ''' >> create an hdf5 file with a single table which contains >> npoints number of rows of type row_t (defined below) >> ''' >> from numpy import random, poly1d >> from tables import IsDescription, Float32Col >> >> class row_t(IsDescription): >> ''' >> the rows of the table to be created >> ''' >> a = Float32Col() >> b = Float32Col() >> >> def append_row(h5row, pmodel): >> ''' >> consider this a single "event" being appended >> to the dataset (table) >> ''' >> h5row['a'] = random.uniform(0,10) >> >> h5row['b'] = h5row['a'] # reality (or model) >> h5row['b'] = h5row['b'] - poly1d(pmodel)(h5row['a']) # systematics >> h5row['b'] = h5row['b'] + random.normal(0,0.1) # noise >> >> h5row.append() >> >> h5file = openFile(filename, 'w') >> h5table = h5file.createTable('/', 'table', row_t, "Data") >> h5row = h5table.row >> >> # recording data to file... >> for n in xrange(npoints): >> append_row(h5row, pmodel) >> >> h5file.close() >> >> def create_ntuple_file_if_needed(filename, npoints, pmodel): >> ''' >> looks to see if the file is already there and if so, >> it makes sure its the right size. Otherwise, it >> removes the existing file and creates a new one. >> ''' >> from os import path, remove >> >> print 'model parameters:', pmodel >> >> if path.exists(filename): >> h5file = openFile(filename, 'r') >> h5table = h5file.root.table >> if len(h5table) != npoints: >> h5file.close() >> remove(filename) >> >> if not path.exists(filename): >> create_ntuple_file(filename, npoints, pmodel) >> >> def fn(p, h5table): >> ''' >> actual function we are going to minimize. It consists of >> the pytables Table object and a list of parameters. >> ''' >> uv = h5table.colinstances >> >> # store parameters in a dict object with names >> # like p0, p1, p2, etc. so they can be used in >> # the Expr object. >> for i in xrange(len(p)): >> k = 'p'+str(i) >> uv[k] = p[i] >> >> # systematic shift on b is a polynomial in a >> db = 'p0 * a*a + p1 * a + p2' >> >> # the element-wise function >> fn_str = '(a - (b + %s))**2' % db >> >> expr = Expr(fn_str,uservars=uv) >> expr.eval() >> >> # returning the "sum of squares" >> return sum(expr) >> >> if __name__ == '__main__': >> ''' >> usage: >> python pytables_expr_test.py [npoints] >> >> Hint: try this with 10M points >> ''' >> from sys import argv >> from time import time >> >> npoints = 1000000 >> if len(argv) > 1: >> npoints = int(argv[1]) >> >> filename = 'tmp.'+str(npoints)+'.hdf5' >> >> pmodel = [-0.04,0.002,0.001] >> >> print 'creating file (if it doesn\'t exist)...' >> create_ntuple_file_if_needed(filename, npoints, pmodel) >> >> h5file = openFile(filename, 'r') >> h5table = h5file.root.table >> >> print 'evaluating function' >> starttime = time() >> print fn([0.,0.,0.], h5table) >> print 'evaluated file in',time()-starttime,'seconds.' >> >> #EOF >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > Pytables-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Francesc A. <fa...@py...> - 2012-05-14 20:05:13
|
On 5/14/12 2:51 PM, Anthony Scopatz wrote: > Hi Johann, > > Thanks for bring this up. I believe that I have determined that this > is not a PyTables / pthreads issue. Doing some profiling > npoints=1000000, I found that most of the time (97%) was being spent > in the sum() call (see below). This ratio doesn't change much with > different values of npoints. Since there is no implicit parallelism > here, I would recommend using numpy.sum() instead of Python's. Also, I have noticed that Johann is not using tables.Expr optimally, i.e. this code: fn_str = '(a - (b + %s))**2' % db expr = Expr(fn_str,uservars=uv) expr.eval() # [1] # returning the "sum of squares" return sum(expr) performs the evaluation of the expression and returns it as a NumPy object [1], but the result is not bound to any variable, so it is lost. A better version would be: fn_str = 'sum((a - (b + %s))**2)' % db expr = Expr(fn_str,uservars=uv) # returning the "sum of squares" return expr.eval() However, do not expect to use all your cores at full speed in this cases, as the reductions in numexpr can only make use of one thread (this is because this has not been implemented yet, not due to a intrinsic limitation of numexpr). Francesc > > I hope this helps. If you need other tips on speeding up the > sum operation, please let us know. > > Be Well > Anthony > > Timer unit: 1e-06 s > > File: pytables_expr_test.py > Function: fn at line 66 > Total time: 1.63254 s > > Line # Hits Time Per Hit % Time Line Contents > ============================================================== > 66 def fn(p, h5table): > 67 ''' > 68 actual > function we are going to minimize. It consists of > 69 the pytables > Table object and a list of parameters. > 70 ''' > 71 1 14 14.0 0.0 uv = > h5table.colinstances > 72 > 73 # store > parameters in a dict object with names > 74 # like p0, p1, > p2, etc. so they can be used in > 75 # the Expr object. > 76 4 21 5.2 0.0 for i in > xrange(len(p)): > 77 3 19 6.3 0.0 k = 'p'+str(i) > 78 3 14 4.7 0.0 uv[k] = p[i] > 79 > 80 # systematic > shift on b is a polynomial in a > 81 1 4 4.0 0.0 db = 'p0 * a*a + > p1 * a + p2' > 82 > 83 # the > element-wise function > 84 1 6 6.0 0.0 fn_str = '(a - (b > + %s))**2' % db > 85 > 86 1 16427 16427.0 1.0 expr = > Expr(fn_str,uservars=uv) > 87 1 21438 21438.0 1.3 expr.eval() > 88 > 89 # returning the > "sum of squares" > 90 1 1594600 1594600.0 97.7 return sum(expr) > > > > > On Mon, May 14, 2012 at 1:59 PM, Johann Goetz <jg...@uc... > <mailto:jg...@uc...>> wrote: > > SHORT VERSION: > > Please take a look at the fn() function in the attached file > (pasted below). When I run this with 10M events or more I notice > that the total CPU usage never goes above the percentage I get > using single-threaded eval(). Am I at some other limit or can I > improve performance by doing something else? > > LONG VERSION: > > I have been trying to use the tables.Expr object to speed up a > sophisticated calculation over an entire dataset (a pytables Table > object). The calculation took so long that I had to create a > simple example to make sure I knew what I was doing. I apologize > in advance for the lengthy code below, but I wanted the example to > mimic exactly what I'm trying to do and to be totally self-contained. > > I have attached a file (and pasted it below) in which I create a > hdf5 file with a single large Table of two columns. As you can > see, I'm not worried about writing speed at all - I'm concerned > about read speed. > > I would like to draw your attention to the fn() function. This is > where I evaluate a "chi-squared" value on the dataset. My strategy > is to populate the "h5table.colinstances" dict object with several > parameters which I call p0, p1, etc and then create the Expr > object using these and the column names from the Table. > > If I create 10M rows (77 MB file) in the Table (with the command > below), the evaluation seems to be CPU bound (one of my cores is > at 100% - the others are idle) and it takes about 7 seconds (about > 10 MB/s). Similarly, I get about 70 seconds for 100M events. > > python pytables_expr_test.py 10000000 > python pytables_expr_test.py 100000000 > > So my question: It seems to me that I am not fully using the CPU > power available on my computer (see next paragraph). Am I missing > something or doing something wrong in the fn() function below? > > A few side-notes: My hard-disk is capable of over 200 MB/s in > sequential reading (sustained and tested with large files using > the iozone program), I have two 4-core CPU's on this machine but > the total CPU usage during eval() never goes above the percentage > I get using single-threaded mode with "numexpr.set_num_threads(1)". > > I am using pytables 2.3.1 and numexpr 2.0.1 > > -- > Johann T. Goetz, PhD. <http://sites.google.com/site/theodoregoetz/> > jg...@uc... <mailto:jg...@uc...> > Nefkens Group, UCLA Dept. of Physics & Astronomy > Hall-B, Jefferson Lab, Newport News, VA > > > ### BEGIN file: pytables_expr_test.py > > from tables import openFile, Expr > > ### Control of the number of threads used when issuing the > ### Expr::eval() command > #import numexpr > #numexpr.set_num_threads(2) > > def create_ntuple_file(filename, npoints, pmodel): > ''' > create an hdf5 file with a single table which contains > npoints number of rows of type row_t (defined below) > ''' > from numpy import random, poly1d > from tables import IsDescription, Float32Col > > class row_t(IsDescription): > ''' > the rows of the table to be created > ''' > a = Float32Col() > b = Float32Col() > > def append_row(h5row, pmodel): > ''' > consider this a single "event" being appended > to the dataset (table) > ''' > h5row['a'] = random.uniform(0,10) > > h5row['b'] = h5row['a'] # reality (or model) > h5row['b'] = h5row['b'] - poly1d(pmodel)(h5row['a']) # > systematics > h5row['b'] = h5row['b'] + random.normal(0,0.1) # noise > > h5row.append() > > h5file = openFile(filename, 'w') > h5table = h5file.createTable('/', 'table', row_t, "Data") > h5row = h5table.row > > # recording data to file... > for n in xrange(npoints): > append_row(h5row, pmodel) > > h5file.close() > > def create_ntuple_file_if_needed(filename, npoints, pmodel): > ''' > looks to see if the file is already there and if so, > it makes sure its the right size. Otherwise, it > removes the existing file and creates a new one. > ''' > from os import path, remove > > print 'model parameters:', pmodel > > if path.exists(filename): > h5file = openFile(filename, 'r') > h5table = h5file.root.table > if len(h5table) != npoints: > h5file.close() > remove(filename) > > if not path.exists(filename): > create_ntuple_file(filename, npoints, pmodel) > > def fn(p, h5table): > ''' > actual function we are going to minimize. It consists of > the pytables Table object and a list of parameters. > ''' > uv = h5table.colinstances > > # store parameters in a dict object with names > # like p0, p1, p2, etc. so they can be used in > # the Expr object. > for i in xrange(len(p)): > k = 'p'+str(i) > uv[k] = p[i] > > # systematic shift on b is a polynomial in a > db = 'p0 * a*a + p1 * a + p2' > > # the element-wise function > fn_str = '(a - (b + %s))**2' % db > > expr = Expr(fn_str,uservars=uv) > expr.eval() > > # returning the "sum of squares" > return sum(expr) > > if __name__ == '__main__': > ''' > usage: > python pytables_expr_test.py [npoints] > > Hint: try this with 10M points > ''' > from sys import argv > from time import time > > npoints = 1000000 > if len(argv) > 1: > npoints = int(argv[1]) > > filename = 'tmp.'+str(npoints)+'.hdf5' > > pmodel = [-0.04,0.002,0.001] > > print 'creating file (if it doesn\'t exist)...' > create_ntuple_file_if_needed(filename, npoints, pmodel) > > h5file = openFile(filename, 'r') > h5table = h5file.root.table > > print 'evaluating function' > starttime = time() > print fn([0.,0.,0.], h5table) > print 'evaluated file in',time()-starttime,'seconds.' > > #EOF > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2012-05-14 19:51:37
|
Hi Johann, Thanks for bring this up. I believe that I have determined that this is not a PyTables / pthreads issue. Doing some profiling npoints=1000000, I found that most of the time (97%) was being spent in the sum() call (see below). This ratio doesn't change much with different values of npoints. Since there is no implicit parallelism here, I would recommend using numpy.sum() instead of Python's. I hope this helps. If you need other tips on speeding up the sum operation, please let us know. Be Well Anthony Timer unit: 1e-06 s File: pytables_expr_test.py Function: fn at line 66 Total time: 1.63254 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 66 def fn(p, h5table): 67 ''' 68 actual function we are going to minimize. It consists of 69 the pytables Table object and a list of parameters. 70 ''' 71 1 14 14.0 0.0 uv = h5table.colinstances 72 73 # store parameters in a dict object with names 74 # like p0, p1, p2, etc. so they can be used in 75 # the Expr object. 76 4 21 5.2 0.0 for i in xrange(len(p)): 77 3 19 6.3 0.0 k = 'p'+str(i) 78 3 14 4.7 0.0 uv[k] = p[i] 79 80 # systematic shift on b is a polynomial in a 81 1 4 4.0 0.0 db = 'p0 * a*a + p1 * a + p2' 82 83 # the element-wise function 84 1 6 6.0 0.0 fn_str = '(a - (b + %s))**2' % db 85 86 1 16427 16427.0 1.0 expr = Expr(fn_str,uservars=uv) 87 1 21438 21438.0 1.3 expr.eval() 88 89 # returning the "sum of squares" 90 1 1594600 1594600.0 97.7 return sum(expr) On Mon, May 14, 2012 at 1:59 PM, Johann Goetz <jg...@uc...> wrote: > SHORT VERSION: > > Please take a look at the fn() function in the attached file (pasted > below). When I run this with 10M events or more I notice that the total CPU > usage never goes above the percentage I get using single-threaded eval(). > Am I at some other limit or can I improve performance by doing something > else? > > LONG VERSION: > > I have been trying to use the tables.Expr object to speed up a > sophisticated calculation over an entire dataset (a pytables Table object). > The calculation took so long that I had to create a simple example to make > sure I knew what I was doing. I apologize in advance for the lengthy code > below, but I wanted the example to mimic exactly what I'm trying to do and > to be totally self-contained. > > I have attached a file (and pasted it below) in which I create a hdf5 file > with a single large Table of two columns. As you can see, I'm not worried > about writing speed at all - I'm concerned about read speed. > > I would like to draw your attention to the fn() function. This is where I > evaluate a "chi-squared" value on the dataset. My strategy is to populate > the "h5table.colinstances" dict object with several parameters which I call > p0, p1, etc and then create the Expr object using these and the column > names from the Table. > > If I create 10M rows (77 MB file) in the Table (with the command below), > the evaluation seems to be CPU bound (one of my cores is at 100% - the > others are idle) and it takes about 7 seconds (about 10 MB/s). Similarly, I > get about 70 seconds for 100M events. > > python pytables_expr_test.py 10000000 > python pytables_expr_test.py 100000000 > > So my question: It seems to me that I am not fully using the CPU power > available on my computer (see next paragraph). Am I missing something or > doing something wrong in the fn() function below? > > A few side-notes: My hard-disk is capable of over 200 MB/s in sequential > reading (sustained and tested with large files using the iozone program), I > have two 4-core CPU's on this machine but the total CPU usage during eval() > never goes above the percentage I get using single-threaded mode with > "numexpr.set_num_threads(1)". > > I am using pytables 2.3.1 and numexpr 2.0.1 > > -- > Johann T. Goetz, PhD. <http://sites.google.com/site/theodoregoetz/> > jg...@uc... > Nefkens Group, UCLA Dept. of Physics & Astronomy > Hall-B, Jefferson Lab, Newport News, VA > > > ### BEGIN file: pytables_expr_test.py > > from tables import openFile, Expr > > ### Control of the number of threads used when issuing the > ### Expr::eval() command > #import numexpr > #numexpr.set_num_threads(2) > > def create_ntuple_file(filename, npoints, pmodel): > ''' > create an hdf5 file with a single table which contains > npoints number of rows of type row_t (defined below) > ''' > from numpy import random, poly1d > from tables import IsDescription, Float32Col > > class row_t(IsDescription): > ''' > the rows of the table to be created > ''' > a = Float32Col() > b = Float32Col() > > def append_row(h5row, pmodel): > ''' > consider this a single "event" being appended > to the dataset (table) > ''' > h5row['a'] = random.uniform(0,10) > > h5row['b'] = h5row['a'] # reality (or model) > h5row['b'] = h5row['b'] - poly1d(pmodel)(h5row['a']) # systematics > h5row['b'] = h5row['b'] + random.normal(0,0.1) # noise > > h5row.append() > > h5file = openFile(filename, 'w') > h5table = h5file.createTable('/', 'table', row_t, "Data") > h5row = h5table.row > > # recording data to file... > for n in xrange(npoints): > append_row(h5row, pmodel) > > h5file.close() > > def create_ntuple_file_if_needed(filename, npoints, pmodel): > ''' > looks to see if the file is already there and if so, > it makes sure its the right size. Otherwise, it > removes the existing file and creates a new one. > ''' > from os import path, remove > > print 'model parameters:', pmodel > > if path.exists(filename): > h5file = openFile(filename, 'r') > h5table = h5file.root.table > if len(h5table) != npoints: > h5file.close() > remove(filename) > > if not path.exists(filename): > create_ntuple_file(filename, npoints, pmodel) > > def fn(p, h5table): > ''' > actual function we are going to minimize. It consists of > the pytables Table object and a list of parameters. > ''' > uv = h5table.colinstances > > # store parameters in a dict object with names > # like p0, p1, p2, etc. so they can be used in > # the Expr object. > for i in xrange(len(p)): > k = 'p'+str(i) > uv[k] = p[i] > > # systematic shift on b is a polynomial in a > db = 'p0 * a*a + p1 * a + p2' > > # the element-wise function > fn_str = '(a - (b + %s))**2' % db > > expr = Expr(fn_str,uservars=uv) > expr.eval() > > # returning the "sum of squares" > return sum(expr) > > if __name__ == '__main__': > ''' > usage: > python pytables_expr_test.py [npoints] > > Hint: try this with 10M points > ''' > from sys import argv > from time import time > > npoints = 1000000 > if len(argv) > 1: > npoints = int(argv[1]) > > filename = 'tmp.'+str(npoints)+'.hdf5' > > pmodel = [-0.04,0.002,0.001] > > print 'creating file (if it doesn\'t exist)...' > create_ntuple_file_if_needed(filename, npoints, pmodel) > > h5file = openFile(filename, 'r') > h5table = h5file.root.table > > print 'evaluating function' > starttime = time() > print fn([0.,0.,0.], h5table) > print 'evaluated file in',time()-starttime,'seconds.' > > #EOF > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2012-05-11 14:24:05
|
Hello Nikola, In general, larger chunk sizes will increase read speed. Additionally, your problem sounds like a perfect place to use compression, which can both decrease storage space and increase read speed (use blosc compression this). Please refer to [1] for more information. In general, if you know *a priori* that you have a hard maximum table size that you will never go over, you can simply set your chunksize to this value. On the other hand, if you know a minimum size that you will be removing and this size is "large enough" then it makes sense to use this as the chunksize sometimes too. Be Well Anthony 1. http://pytables.github.com/usersguide/optimization.html On Fri, May 11, 2012 at 5:15 AM, nikola stevanovic <nid...@gm...>wrote: > *Hi everyone, * > > I'm new member and it's nice to meet you all. > I need some advices about my work with pytables. The problem is next. I'm > working on some kind of database using pytables and of course hdf5 format. > I created table with *six columns, row size 92B*. One column in table is > Time32Col. This column will be *indexed*. Table *will be updated* every > couple days (rows will be appended on existing table). *Between every > update users can create queries on table and consume data*. My question > is how efficiently balance chunksize between updates, because numbers of > rows in table will be start from *0 to 10 000 000 000* during the time? > After this number I will start archiving process, i.e. for example remove > first five billions rows and store in some other table for archiving. Of > course, I need this balance because *reading speed*. So, what is most > efficient way for setting chunksize for my problem? Sorry for my english. > > * > Thanks for advice guys. > Cheers! > Nikola* > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: nikola s. <nid...@gm...> - 2012-05-11 10:15:18
|
*Hi everyone, * I'm new member and it's nice to meet you all. I need some advices about my work with pytables. The problem is next. I'm working on some kind of database using pytables and of course hdf5 format. I created table with *six columns, row size 92B*. One column in table is Time32Col. This column will be *indexed*. Table *will be updated* every couple days (rows will be appended on existing table). *Between every update users can create queries on table and consume data*. My question is how efficiently balance chunksize between updates, because numbers of rows in table will be start from *0 to 10 000 000 000* during the time? After this number I will start archiving process, i.e. for example remove first five billions rows and store in some other table for archiving. Of course, I need this balance because *reading speed*. So, what is most efficient way for setting chunksize for my problem? Sorry for my english. * Thanks for advice guys. Cheers! Nikola* |