From: chuck c. <cc...@zi...> - 2004-04-28 17:15:43
|
I just started using PyTables last week and I'd not come across HDF5 before so I'm still getting up to speed. However, I think using PyTables to complement a system I already have is going to work out well. I am responsible for the backend of a large website. This backend uses Jini and is distributed across 20 machines and could be classified as a Services Oriented Architecture. Each machine has at least one VM (but possibly up to four VMs) running and each VM has anywhere from 15-30 services running. Each VM is printing out performance statistics (total number of requests, number of successful, failed or active requests, processing time for each request, time spent waiting for other backend systems to respond etc) for every running service at one minute intervals. As you can imagine this is a lot of data but we have a strict SLA which in addition to specifying uptime requirements has response time requirements. Each month I generate a report based on these files. Initially this was done entirely in Python. Then I moved to loading these files into MySQL and then into PostgreSQL. Now I've decided to store the actual log files in an HDF5 format and use PyTables to compute hourly, daily, weekly and monthly "roll-ups" of averages and standard deviations of response times. These roll-ups will be stored in PostgreSQL since many people query them. Relatively few people query out a log file entry for a specific minute. Currently I'm converting the log files from a pipe delimited file to HDF5 using PyTables. Eventually I'd like to have the application generate the file in HDF5 format to avoid the transform step. Then I plan to use PyTables to compute the "roll-ups". So far everything is working out well but I can't say that I've used PyTables enough yet to make any suggestions. As I get deeper into it I'll be sure to post. Also I'd like to express my thanks for the great documentation that goes along with this project. It is responsible for me getting this far without having to post to the list. However, I do have a few questions: Is there a way to get a Column of a table returned as a numarray so I can computed means and standard deviations with NumPy? Or do I just read the column as a Python list and then create a numarray out of it? One of my fields is a timestamp. I don't see such a datatype in Appendix A. However, upon digging further it appears that timestamps are supported by the HDF5 spec but have not yet been implemented. Is this correct? If so, how are other people getting around this? I use the excellent mx.DateTime library and am heading down the path of calling .ticks() on any timestamp fields and storing that as a Float32Col. Thanks, chuck |
From: Francesc A. <fa...@py...> - 2004-04-28 21:31:17
|
Hi Chuck, A Dimecres 28 Abril 2004 19:15, chuck clark va escriure: > So far everything is working out well but I can't say that I've used > PyTables enough yet to make any suggestions. As I get deeper into it I'll > be sure to post. Also I'd like to express my thanks for the great > documentation that goes along with this project. It is responsible for me > getting this far without having to post to the list. Documentation is an area that I very much want to care about. I'm happy that you like it :). > However, I do have a few questions: > Is there a way to get a Column of a table returned as a numarray so I can > computed means and standard deviations with NumPy? Or do I just read the > column as a Python list and then create a numarray out of it? There are several ways to do that. I consider the one introduced in the latest release (0.8) to be the most convenient in this case. Say that your column is named "energy" and you want to compute the mean and standard deviation. Look at that: >>> fileh = tables.openFile("test.h5", "r") >>> fileh.root.tuple0.cols.energy /tuple0.cols.energy (Column(1,), Float64) # elements of column are scalars >>> fileh.root.tuple0.cols.energy[:].nelements() 20000 # 20,000 elements in the columns >>> fileh.root.tuple0.cols.energy[:].mean() 19999.0 # self-explanatory >>> fileh.root.tuple0.cols.energy[:].stddev() 11547.294055318762 # self-explanatory This is explained in documentation too (with some examples) in: http://pytables.sourceforge.net/html-doc/usersguide4.html#subsection4.5.5 and http://pytables.sourceforge.net/html-doc/usersguide4.html#subsection4.5.6 > > One of my fields is a timestamp. I don't see such a datatype in Appendix > A. However, upon digging further it appears that timestamps are supported > by the HDF5 spec but have not yet been implemented. Is this correct? Yeah, it is. I've have not (free) time to implement that yet. > If so, how are other people getting around this? I use the excellent > mx.DateTime library and am heading down the path of calling .ticks() on > any timestamp fields and storing that as a Float32Col. That could be a nice workaround, of course. Regards, -- Francesc Alted |