From: chuck c. <cc...@zi...> - 2004-04-28 17:15:43
|
I just started using PyTables last week and I'd not come across HDF5 before so I'm still getting up to speed. However, I think using PyTables to complement a system I already have is going to work out well. I am responsible for the backend of a large website. This backend uses Jini and is distributed across 20 machines and could be classified as a Services Oriented Architecture. Each machine has at least one VM (but possibly up to four VMs) running and each VM has anywhere from 15-30 services running. Each VM is printing out performance statistics (total number of requests, number of successful, failed or active requests, processing time for each request, time spent waiting for other backend systems to respond etc) for every running service at one minute intervals. As you can imagine this is a lot of data but we have a strict SLA which in addition to specifying uptime requirements has response time requirements. Each month I generate a report based on these files. Initially this was done entirely in Python. Then I moved to loading these files into MySQL and then into PostgreSQL. Now I've decided to store the actual log files in an HDF5 format and use PyTables to compute hourly, daily, weekly and monthly "roll-ups" of averages and standard deviations of response times. These roll-ups will be stored in PostgreSQL since many people query them. Relatively few people query out a log file entry for a specific minute. Currently I'm converting the log files from a pipe delimited file to HDF5 using PyTables. Eventually I'd like to have the application generate the file in HDF5 format to avoid the transform step. Then I plan to use PyTables to compute the "roll-ups". So far everything is working out well but I can't say that I've used PyTables enough yet to make any suggestions. As I get deeper into it I'll be sure to post. Also I'd like to express my thanks for the great documentation that goes along with this project. It is responsible for me getting this far without having to post to the list. However, I do have a few questions: Is there a way to get a Column of a table returned as a numarray so I can computed means and standard deviations with NumPy? Or do I just read the column as a Python list and then create a numarray out of it? One of my fields is a timestamp. I don't see such a datatype in Appendix A. However, upon digging further it appears that timestamps are supported by the HDF5 spec but have not yet been implemented. Is this correct? If so, how are other people getting around this? I use the excellent mx.DateTime library and am heading down the path of calling .ticks() on any timestamp fields and storing that as a Float32Col. Thanks, chuck |