You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: David S. <da...@dr...> - 2004-01-20 17:42:32
|
Hello, I am prototyping an application with pytables acting as the storage layer. I am wanting to add a simple index of the data being collected in a table. Data collection for this table will occur over weeks and months. I was wanting to store a timestamp for each day of recorded data in a dictionary attached to this table with the day timestamp as the key and the index of the first row being added to the table for this day as the value. What I was hoping to accomplish was an ability to lookup a timestamp in the dictionary, returning the row index and then utitlize that for selection of records or removal of records for particular days. What I have done is added a an attribute like so: table.attrs.dayIndex = {} When I first create the hdf5 file and create the groups and tables and add data, the dayIndex attribute functions as a dictionary should. When I flush and save this data to the hdf5 file, close and reopen the file, the attribute becomes a string and no longer functions as a dictionary. My question is, how may I accomplish this and after a reopening of the hdf5 file, still utilize python dictionaries stored as attributes? Are there other methods that I may use that would be as or more efficient. Like storing another table as a leaf of the table that stores the items I want to index. Thanks, David Sumner |
From: Nicola L. <ni...@te...> - 2003-12-05 11:12:13
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > You can regard this as a temporary workaround ...but a quite nice one, for sure... :^) > until I find time to implement fully variable-length columns in Tables. ...and that'll be even better! Thanks, Francesc. :^))) - -- "I can think of two reasons for running Windows software. A: Your fascist employer is a nitwit and approves of nonsense like Lotus Notes, which oughta be illegal. B: You're running games." -- Helgi Hrafn Gunnarsson Nicola Larosa - ni...@te... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQE/0GCMXv0hgDImBm4RAnM5AKCBHqSL0zLbWcgbDSr3TddVA1w/ugCgn58K AnTHBEPcOGMd80AorfV+T0k= =HhMG -----END PGP SIGNATURE----- |
From: Francesc A. <fa...@op...> - 2003-12-05 08:23:38
|
A Divendres 05 Desembre 2003 04:37, Joseph Turian va escriure: > Francesc, > > > Good news is that VLArray (i.e. variable length arrays) objects will be > > introduced in the forthcoming 0.8 release (among other niceties, like > > enlargeable Array objects, support for boolean types and more). > > VLAs look like a good workaround, thank you for the reply. > > When do you think 0.8 will be released? Currently, I'm having some problems with enlargeable arrays, but I hope to be able to release 0.8 by the end of the year. Cheers, -- Francesc Alted |
From: Joseph T. <tu...@cs...> - 2003-12-05 03:37:22
|
Francesc, > Good news is that VLArray (i.e. variable length arrays) objects will be > introduced in the forthcoming 0.8 release (among other niceties, like > enlargeable Array objects, support for boolean types and more). VLAs look like a good workaround, thank you for the reply. When do you think 0.8 will be released? Joseph |
From: Francesc A. <fa...@op...> - 2003-12-04 11:06:46
|
Hi Joseph, A Dijous 04 Desembre 2003 06:35, vareu escriure: > Francesc, > > I have a PyTables question I was hoping you could help me out with. > I want to program a table of tables. > > Specifically, each row in the top level table will contain another table, > of variable length. > > For example, one table row might contain [9, 2, 60, 30, 29, 54], and the > second table row might contain [2, 6, 3], and the third row might contain > [1]. > > In other words, I want to have arrays (or tables) within my top-level > tables. > > The documentation for pytables does not suggest how to do this. > Nonetheless, is it possible for me to achieve this data structure using > pytables? > Well, I've both bad and good news to you. Bad news is that pytables, up to 0.7.2 (and most probably 0.8 as well), only allows scalars or *fixed* length numarray objects to be elements of columns, so you can't define a column to be as *variable* length. Good news is that VLArray (i.e. variable length arrays) objects will be introduced in the forthcoming 0.8 release (among other niceties, like enlargeable Array objects, support for boolean types and more). In fact, the VLArray class is already in CVS, with proper unittests in place, and fairly bug-free, as fas as I can tell. So, you can start using it today as I don't think I'm going to change the API too much for the final version. However, documentation has not been updated to include this new feature yet. Meanwhile, you can look into the examples/vlarray1.py small script to get a feeling of what you can do with this new creature. Just to wet your appetite, some tiny example follows: import tables from numarray import * # Create a VLArray: fileh = tables.openFile("vlarray2.h5", mode = "w") root = fileh.root vlarray = fileh.createVLArray(root, 'vlarray1', tables.Int32Atom(), "ragged array of ints") vlarray.append(array([5, 6])) vlarray.append(array([5, 6, 7])) vlarray.append([5, 6, 9, 8]) vlarray.append(5, 6, 9, 10, 12) print "Created VLArray:", repr(vlarray) # Now, read it through the use of the iterator: for x in vlarray: print vlarray.name, "[", vlarray.nrow, "]-->", x # Close the file fileh.close() and the output is: Created VLArray: /vlarray1 (VLArray(4,)) 'ragged array of ints' atom = Atom(type=Int32, shape=1, flavor='Numeric') nrows = 4 flavor = 'Numeric' byteorder = 'little' vlarray1 [ 0 ]--> [5 6] vlarray1 [ 1 ]--> [5 6 7] vlarray1 [ 2 ]--> [5 6 9 8] vlarray1 [ 3 ]--> [ 5 6 9 10 12] So, by joining VLArray and Table entities, you can simulate the variable-length records you want to, just by defining a VLArray where you will put your variable-length objects, and a new column in your table with the row number of the VLArray where your desired data is. You can regard this as a temporary workaround until I find time to implement fully variable-length columns in Tables. Cheers, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-11-07 18:38:25
|
Hi, PyTables has entered Debian Linux in the unstable (sid) section during this week. That means that, if everything goes well, pytables will be in the next stable release of Debian, scheduled to be out in few months. For those of you that already have Debian sid installed, just do: apt-get update apt-get install python-tables and you will be done! Enjoy!, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-10-14 19:24:40
|
A Dissabte 11 Octubre 2003 04:23, vareu escriure: > Francesc, > > I have implemented your suggestions to the best of my ability, but the > resulting code is running very slowly. Could you take a quick look at the > code and see if I am making any obvious mistakes (other than not using > Psyco, which I may implement). Thank you again for your time. Your code looks pretty good. No, the problem is the approach, which is very inefficient. I've just made a new Table.copy() method that can do the job much more efficiently. This method let's you copy one table to another location and sort it by any column you want (except for String columns, that might be implemented later on). Here is and example of use: import tables fileh= tables.openFile("data.nobackup/test-big.h5","a") if hasattr(fileh.root, "newtable"): fileh.removeNode(fileh.root, "newtable") fileh.root.tuple0.copy("newtable", "var3") fileh.close() In this example, the /tuple0 table has been copied to /newtable but ordered by the column "var3" of the Table source. My preliminary timings shows that you can copy&sort a table with 100.000 rows in 1.2 s and one with 1.000.000 in 27 s (albeit this depends on how much your initial list is shuffled). So, the increment in time is not linear (maybe more similar to n*sqrt(n)), and if you want to sort a table with 20.000.000 entries, that could take little more than 40 minutes (in a similar processor than mine, a P4 @ 2 GHz). Well, it's not fast, but it's affordable. You can find the new code in the pytables CVS repository (http://sourceforge.net/cvs/?group_id=63486). If you are using Windows, you will need the MSVC 6.0 compiler to install it. Ah!, I almost forgot that. You will also need to update the HDF5 library at least to 1.6.0 post4. In the original 1.6.0 there were some bugs that prevent the new code from running correctly. Cheers, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-10-10 12:53:53
|
A Dijous 09 Octubre 2003 23:37, Michael Lefsky va escriure: > PyTables Users, > > I hope you can take a moment to answer a question about the optimal way to > use PyTables for my application. > > I have as many as 20 million records to read from one file and write to a > Pytable, and in the original file these records are not is the order that I > want them to be. I could write them out in the same order as they appear in > the original file, and use the Pytable commands to select the records I > need, but I am afraid that will take a long time if the records are not > adjacent. I could order the records in memory, but I am trying to avoid > having all of them in memory. What I want to do is to read in a record, > determine where it should go in the output file (according to an index) and > then write the record to the correct spot. I would record the first and > last index of the records that belong with each code, and then retrieve the > block of records as a whole.That way, when I need to retrieve a block of > records, they will all be from the area of the disk, and I assume that will > be faster. > > Is this possible, and if so, what is the fastest way to do this? Am I > mis-understanding the problem? I'm afraid you are asking for something equivalent to indexed fields, so that you can accelerate the row access according with some patterns, isn't it? Well, until indexing arrives (I would like to implement it before 1.0 release), you can try the next approach (the final implementation should follow the same guidelines, more or less). Let's imagine that you have a field and you want to arrange your rows in such a way that the values in this field are sorted in, say, ascending order (if you don't have such a classificatory field, it should be easy to create *another* table and add it). Now, read the column you want to order the Table by: sort_field=src_table.read(field="your_field") then, get an array with the indexes of the original array, but as if its values were sorted: import numarray neworder=argsort(sort_field) now, read the rows of the original table in this order and write them to a new table (choose another file if you don't want to duplicate your data): dst_row = dst_table.row for i in neworder: src_row = src_table.read(i) dst_row.field1 = src_row.field('field1')[0] <add fields as needed> dst_row.fieldn = src_row.field('fieldn')[0] dst_row.append() dst_row.flush() This operation may be slow (I did not measured how much) but once you have it done, your lookup's can be accelerated if you have your index at hand, as you wanted. Of course, this is supposing that you can put one of your table columns entirely in memory. The case were this is not possible is left as an exercise for the reader (hint: use temporary buffers). And please, if you find a good solution for that, share it! -- Francesc Alted |
From: Michael L. <le...@cn...> - 2003-10-09 21:37:45
|
PyTables Users, I hope you can take a moment to answer a question about the optimal way to use PyTables for my application. I have as many as 20 million records to read from one file and write to a Pytable, and in the original file these records are not is the order that I want them to be. I could write them out in the same order as they appear in the original file, and use the Pytable commands to select the records I need, but I am afraid that will take a long time if the records are not adjacent. I could order the records in memory, but I am trying to avoid having all of them in memory. What I want to do is to read in a record, determine where it should go in the output file (according to an index) and then write the record to the correct spot. I would record the first and last index of the records that belong with each code, and then retrieve the block of records as a whole.That way, when I need to retrieve a block of records, they will all be from the area of the disk, and I assume that will be faster. Is this possible, and if so, what is the fastest way to do this? Am I mis-understanding the problem? M Michael Lefsky, Assistant Professor Department of Forest, Rangeland and Watershed Stewardship Colorado State University, Fort Collins, CO 80523 970/491-0602 970/491-6754 (FAX) le...@cn... |
From: Francesc A. <fa...@op...> - 2003-10-03 16:47:41
|
Hi, I've created Debian linux packages for pytables 0.7.2. They are accessible in http://pytables.sourceforge.net/Debian/. Read the instrucctions available there on how to install these packages on your system. Please, note that the package is only available for unstable Debian, as it depends on library versions only available there (hdf5 1.6.0 or higher and numarray 0.7 or higher). Feedback on these packages is welcome. Hopefully, the packages will appear in the official distribution some day. I'll do my best to achieve that goal. Cheers, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-09-22 17:36:09
|
Announcing PyTables 0.7.2 ------------------------- As promised, here you have the latest and coolest pytables encarnation!. On this release you will not find any exciting new features. It is mainly a maintenance release where the next issues has been addressed: - a memory leak was fixed - memory consumption is being addressed and lowered - much faster opening of files - Some important index patterns cases in table reads has been optimized More in detail: What's new ----------- - Fixed a nasty memory leak located on the C libraries (it was happening during HDF5 attribute writes). After that, the memory consumption when using large object trees has dropped quite a bit. However, there remains some small leaks that has been tracked down to the underlying numarray library. These leaks has been reported, and hopefully they should be fixed more sooner than later. - Table buffers are built dinamically now, so if Tables are not accessed for reading or writing this memory will not be booked. This will help to reduce the memory consumption. - The opening of files with lots of nodes has been accelerated between a factor 2 or 3. For example, a file with 10 groups and 3000 tables that takes 9.3 seconds to open in 0.7.1, now takes only 2.8 seconds. - The Table.read() method has been refactored and optimized and some parts of its code has been moved to Pyrex. In particular, in the special case of step=1, up to a factor 5 of speedup (reaching 160 MB/s on a Pentium4 @ 2 GHz) when reading table contents can be achieved now. - Done some cosmetic changes in the user manual, but, as no new features has been added, you won't need to read the manual again :-) Enjoy!, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-09-22 10:08:37
|
A Dissabte 20 Setembre 2003 09:43, vareu escriure: > On a separate note, is there Matlab support for HDF5? I notice that > they have support for HDF4, but... Well, I think that some support for HDF5 exists for Octave. No experience with Matlab itself. Perhaps someone in this list? Cheers, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-09-22 09:40:45
|
A Dissabte 20 Setembre 2003 18:19, Nicola Larosa va escriure: > > - Implement support for Variable Length values (mostly in enlargeable > > Arrays, but possibly also in Tables). > > +5 in Tables Yeah, I've been think about that, and VL types would be most useful in Tables. Ok. I'll see what can I do... > > > - Improve speed of reading and selecting values in Leaf (Table and Array) > > objects. > > +4 Just wait a bit and check the new improvements in speed for 0.7.2 (some of them are already present in 0.7.1) :-). > > - What feature do you miss more (not necessarily listed before)? > > Automatic indexes, as in relational databases. Don't even know if it's > possible to define them in a general way, though. Yes, that would be very interesting. The problem is time... Thanks, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-09-22 09:24:00
|
A Dissabte 20 Setembre 2003 00:22, st...@st... va escriure: > > - What feature do you miss more (not necessarily listed before)? > > I wish I could write root.sfc.sfc_eqang25.lifted_index > instead of root.sfc.sfc_eqang25.read(field='lifted_index') > (You may have already implemented this...) Well, I already *tried* to implement that, but I encountered the next problem: a column name cannot be the same of a standard Table attribute (like for example "nrows" or "rowsize"). For doing that, I would have implemented Tables attributes as in Group (i.e. with a "_v_" prefix so as to not collide with column names). Frankly, I do not think it would be worth the change: I think it's better to continue with the present naming schema. Do you agree? > > > - Do you prefer seeing pytables to become (even) faster or less memory > > demanding? > > Less memory demanding. > > NumPy is already optimized for "fast, use much memory". > > PyTables complements NumPy best if optimized for "slow, uses little > memory". > > If you write pytables to optimize for space and I need one section > optimized for speed, I can always read some data into an ordinary > numpy array, which will be kept in memory. But if you optimize for > speed and I need one section optimized for space, I will have to do a > lot of work breaking my dataset into separate files and opening each > only when needed. I agree in that pytables uses now too much memory. The addition of the rootUEP parameter in openFile() function in 0.7 version it's a great relief for dealing with that. However, I'm thinking about building a new, lightweight access schema to the objects on files, instead of the current full-fledged, but heavyweighted, object tree. > > > - Which present limitations do you find more annoying? > > My code in 0.5.1 leaked memory. (I don't know whether this leak was > in my code, in pytables, or in numeric or numarray). If this leak > was in pytables, that was the most annoying limitation. In 0.5.1 there are memory leaks both in pytables and numarray. The memory leak in ptables has been fixed now, and will be included in the next version (0.7.2) that I'll be releasing very soon. The leak in numarray seems to be unsolved yet, but anyway, the pytables 0.7.2 should be quite less memory demanding that previous releases. Thanks for your time filling the questionnaire!, -- Francesc Alted |
From: Nicola L. <ni...@te...> - 2003-09-20 16:20:24
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > Following are my plans for the next few months, before releasing the 1.0 > version. Could you respond to these questions by giving a punctuation > ranging from +5 (I absolutely need that) to 0 (I can pass without this) to > these planned features? > > - Implement enlargeable (i.e. chunked) Array objects. +2 > - Implement support for Variable Length values (mostly in enlargeable > Arrays, but possibly also in Tables). +5 in Tables > - Implement relationships (apart from the existing hierarchical ones) > between objects in the object tree. 0 > - Make pytables less memory demanding. +2 > - Improve speed of opening files in pytables +2 > - Improve speed of reading and selecting values in Leaf (Table and Array) > objects. +4 > Now, answer the next questions. If you don't like to respond all of them, > this is not necessary. > > - What feature do you like more (not necessarily listed before)? Compactness and speed. > - What feature do you miss more (not necessarily listed before)? Automatic indexes, as in relational databases. Don't even know if it's possible to define them in a general way, though. > - Do you prefer seeing pytables to become (even) faster or less memory > demanding? Faster. > - Which present limitations do you find more annoying? Fixed size of columns. ;^) > - In which field of science or bussiness are you using pytables? Collection and evaluation of network traffic statistics. > Ok, that's all. I hope some of you will take some of his precious time to > fillup the questionnaire. Even only one answer would be much better than > my sole opinion!. It surely was *much* less time than the amount you put into PyTables. :^) - -- "You can't get what you want till you know what you want." -- Joe Jackson Nicola Larosa - ni...@te... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQE/bH34Xv0hgDImBm4RAr+YAJ4/zepc7JVJdTfX/r1uoWNZk3nO1wCeLKiY 4jBhdCw/jWRcIbm9rmP+ECA= =bqDG -----END PGP SIGNATURE----- |
From: <st...@st...> - 2003-09-19 22:24:37
|
Francesc Alted <fa...@op...> writes: > Hi List, > > Now that almost a year has passed from the first public release of pytables, > I need your feedback. So, please, if you are using pytables and like this > effort to be continued and improved, take some time to respond to this > questionnaire about pytables; it will help me to decide what to do in the > next few months. > > Following are my plans for the next few months, before releasing the 1.0 > version. Could you respond to these questions by giving a punctuation > ranging from +5 (I absolutely need that) to 0 (I can pass without this) to > these planned features? > > - Implement enlargeable (i.e. chunked) Array objects. +0 > - Implement support for Variable Length values (mostly in enlargeable > Arrays, but possibly also in Tables). +0 > - Implement relationships (apart from the existing hierarchical ones) > between objects in the object tree. +0 > - Make pytables less memory demanding. +5 > - Improve speed of opening files in pytables +2 > - Improve speed of reading and selecting values in Leaf (Table and Array) > objects. +2 > Now, answer the next questions. If you don't like to respond all of them, > this is not necessary. > > - What feature do you like more (not necessarily listed before)? Natural naming. > - What feature do you miss more (not necessarily listed before)? I wish I could write root.sfc.sfc_eqang25.lifted_index instead of root.sfc.sfc_eqang25.read(field='lifted_index') (You may have already implemented this...) > - Do you prefer seeing pytables to become (even) faster or less memory > demanding? Less memory demanding. NumPy is already optimized for "fast, use much memory". PyTables complements NumPy best if optimized for "slow, uses little memory". If you write pytables to optimize for space and I need one section optimized for speed, I can always read some data into an ordinary numpy array, which will be kept in memory. But if you optimize for speed and I need one section optimized for space, I will have to do a lot of work breaking my dataset into separate files and opening each only when needed. > - Which present limitations do you find more annoying? My code in 0.5.1 leaked memory. (I don't know whether this leak was in my code, in pytables, or in numeric or numarray). If this leak was in pytables, that was the most annoying limitation. > - In which field of science or bussiness are you using pytables? accessing hdf(4) data from NASA. (I convert to hdf5 first.) -- Stan |
From: Francesc A. <fa...@op...> - 2003-09-19 19:44:19
|
Hi List, Now that almost a year has passed from the first public release of pytables, I need your feedback. So, please, if you are using pytables and like this effort to be continued and improved, take some time to respond to this questionnaire about pytables; it will help me to decide what to do in the next few months. Following are my plans for the next few months, before releasing the 1.0 version. Could you respond to these questions by giving a punctuation ranging from +5 (I absolutely need that) to 0 (I can pass without this) to these planned features? - Implement enlargeable (i.e. chunked) Array objects. - Implement support for Variable Length values (mostly in enlargeable Arrays, but possibly also in Tables). - Implement relationships (apart from the existing hierarchical ones) between objects in the object tree. - Make pytables less memory demanding. - Improve speed of opening files in pytables - Improve speed of reading and selecting values in Leaf (Table and Array) objects. Now, answer the next questions. If you don't like to respond all of them, this is not necessary. - What feature do you like more (not necessarily listed before)? - What feature do you miss more (not necessarily listed before)? - Do you prefer seeing pytables to become (even) faster or less memory demanding? - Which present limitations do you find more annoying? - In which field of science or bussiness are you using pytables? Ok, that's all. I hope some of you will take some of his precious time to fillup the questionnaire. Even only one answer would be much better than my sole opinion!. Thanks, -- Francesc Alted |
From: <st...@st...> - 2003-09-19 19:38:12
|
Francesc Alted <fa...@op...> writes: > As some of you will know, PyTables uses the same table objects that > provides HDF5_HL and the NCSA people has asked me for some "real > world" examples from the pytables users to include them in new > documentation. I use PyTables to read data from the NASA/NASDA Tropical Rainfall Measuring Mission. This is HDF4; I convert that to HDF5, then manipulate the HDF5 using PyTables. Thank you, Francesc, for this excellent software. For me, PyTables has proved much easier to use than IDL. -- Stan |
From: Francesc A. <fa...@op...> - 2003-09-18 15:45:23
|
Hello List, The NCSA is going to merge the HDF5_HL (http://hdf.ncsa.uiuc.edu/HDF5/hdf5_hl/) library into the HDF5 library. As some of you will know, PyTables uses the same table objects that provides HDF5_HL and the NCSA people has asked me for some "real world" examples from the pytables users to include them in new documentation. So if you are willing to provide your examples, I can collect, rearrange them and re-send them to the NCSA. Besides, this would serve to me to realize for what purposes are you using pytables for better a feature implementation planning. Cheers, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-09-18 08:50:12
|
A Dijous 18 Setembre 2003 01:10, vareu escriure: > I'll update to the latest version and try it out. I'll let you know if I > notice any slowness or excessive memory use. What do you mean by *lots* of > memory? The data itself that is written out varies from 10's to 100's of > megabytes. Is it comparable to that? By *lots* of memory I meant that, roughly, 6 KB will be booked for each single Group object, 6 KB for each Array and 12 KB for each Table. That's is where most of memory consumption goes, building the structure (i.e. the object tree) from metadata. So, a tree with 1000 Groups, 2000 Arrays and 4000 Tables can account for 66 MB (add 10 MB for the python interpreter and pytables modules). Also, the I/O buffers for Tables are larger for large datasets; so as to see the difference they can be 5 KB for small datasets (less than 100 KB), up to 60 KB for larger ones (greater than 200 MB). However, these buffers are built dinamically (this is new in the CVS version, I forgot to mention that!), so if you don't access the actual data in a Table, this memory will not be booked. I'm pondering now if I should release these buffers after use or keep them into memory for a possible use afterwards (it's a problem of balance between CPU and memory consumption), but I probably will release all buffers after a read or write operation, at expense of more CPU consumption. Array objects are not buffered (you only can read it completely or don't read at all), so the amount of actual data saved (whether your Array is 1 byte or 1 TB on size) is not going to affect too much your memory demands, except by the fact that you will need enough memory to keep a large Array if you want to actually read its data!. > > In the files, I am writing time-dependent data produced by my code. I > write the data out as arrays and not as tables since the data size varies > from step to step. The data that is written each step is seven arrays all > the same size and an integer. Would it be more optimal to create a table > for each step and write the seven arrays as elements of the table? The > total number of steps is typically on the order of one thousand. Well, I'm afraid that your best bet would be to use Variable Length Arrays, like the ones Nicola Larosa was asking for in an earlier message, but this will take some time to be implemented. In the meantime, if you use Tables, you would reduce the number of nodes by a factor of seven. On the other hand, Tables needs more memory per node than Arrays (two times more for the object, and twice more for the internal buffer, if working with small datasets), so one can conclude that, if you use Tables you will need 4/7 times the memory you are using now. In addition, Tables are quite more flexible than Array entities (you can do selections without loading all the info in memory, or just load parts of the dataset), so I would recommend you to use Tables with arrays as columns. Keep in mind too, that Array entities do not support compression on-the-flight, while Tables do. Another possibility with Tables is to define several Tables with two columns, one to store the actual array and the other one to save the actual length of the array. You can then set the series of Tables with different array column lengths on such a way that your arrays fit well on one of them (I mean, without wasting too much space). For example, if your arrays are in the range of (2,1) to (2, 100), you can setup several Tables with columns taking the values (2,10), (2,20), ..., (2,100). You can then save your array in the appropriate Table and save the actual shape in the other field. After retrieving the arrays, you can use the length field to strip out the data you are not interested in. I agree that this solution is a bit affected, but if you have a large amounts of arrays, it can be your best choice until VLArrays are done. Cheers, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-09-17 16:53:23
|
A Dimarts 16 Setembre 2003 20:45, Nicola Larosa va escriure: > Hi Francesc, thanks for PyTables, it's truly impressive. Thanks for the compliment :-) > > A feature request: what would it take to implement the support for the VL > (Variable Length) Datatype? I would be more than happy to help with that, > but need some pointers to move around without breaking anything. :^) > > I'm asking since I need to build indices pointing to rows containing each > data value, and from the HDF5 docs the VL Datatype seems the best solution > for this. > > Otherwise I would have to build a new table for each distinct data value, > and there can be thousand of those. :^( Yeah, I am thinking lately about how to implement this in a neat way, but I'm afraid that the solution is not easy. The Tables are entities that needs a fixed record length in order to use its I/O buffering capabilities and, at the moment, unbuffered I/O is not supported in Tables (after all, they are supposed to be a high performance implementation). A better approach would be to implement enlargeable Arrays (my top priority now), and make them flexible enough to support unbuffered I/O (or use buffer with variable length records, who knows!). That way the indexes might be created quite easily. However, I don't know when I will implement support for VL datatypes. If you want doing that yourself, you can start by studying the Table.Table class as well as its ancestor, hdf5Extension.Table. From there on, you can start building a new class (call it VLArray, for example) and making the necessary changes to support the VL datatype. You should make the new class a descendant of Leaf (as the Table and Array classes are), so that it will inherit the necessary methods to interact with the object tree. Finally, you only need to create some methods in File class to create your new objects. Well, this is a simplified view of what to do. If you get stuck, please, tell me and I'll help you. Don't be afraid, as long as you don't mess with other classes, you can be sure that you won't break anything. Cheers, -- Francesc Alted |
From: Francesc A. <fa...@op...> - 2003-09-17 16:18:10
|
Hi Dave, You are lucky because I was already doing that when I received your request. I've just uploaded the patches you are interested in to the CVS. It seems that all the unitests passes, so, hopefully, it won't break any existing code. Please, feel free to checkout the latest CVS version and tell me if it works for you. Note that a UserWarning is still issued if the user tries to create more than 4096 childs on a group or 4096 attributes on a node. If you don't like that, just filter it using the warnings.filterwarings() method. Also, I would be interested in your feedback when using large amounts of childs hanging from a group (or large amounts of objects in the whole object tree, in general). My experience is that when exceeding 8000 childs or so in groups, the underlying HDF5 starts to show signs of *very slow* data access. In addition, the pytables object tree takes *lots* of memory. If you have a machine plenty of memory, that should be not a problem, but I'm mainly worried with the slow data access. Please, tell me if you are experiencing such a thing. By the way, in the CVS you will find the next improvements apart of the child limit removal: - A memory leak was located (after a hunting of two weeks!) and fixed (!). It was occurring during attribute writes. As pytables uses lots of attributes for storing metadata, that fix is a good relieve on its memory hungriness. With that, no known memory leaks remains on pytables. However, I've tracked down some leaks on the underlying numarray library; I've reported that to Todd Miller (it seems to exist such a leak from numarray 0.5 on), and he should be working on fixing it. - The opening of files with lots of nodes has been optimized between a factor 2 and 3. For example, a file with 10 groups and 3000 tables that takes 9.3 seconds to open in 0.7.1, now takes only 2.8 seconds. - The Table.read() method has been refactored and optimized and some parts of its code has been moved to Pyrex. In particular, in the special case of step=1, up to a factor 5 of speedup (reaching 160 MB/s on a Pentium4 @ 2 GHz) when reading table contents can be achieved now. Enjoy!, A Dimarts 16 Setembre 2003 19:39, DPGrote va escriure: > I've come across the MAX_CHILDS_IN_GROUP limit and am wondering if this > limit is there for some fundemental reason. I notice that it only > affects the size of temporary arrays in Giterate in utils.c. Can that > routine be rewritten to dynamically allocate those temporary arrays to > whatever the size of num_obj is? This would remove that limit > altogether. The is also true for the MAX_ATTRS_IN_NODE limit - it could > be removed in the same way. > Thanks, > Dave -- Francesc Alted |
From: Nicola L. <ni...@te...> - 2003-09-16 18:51:47
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Francesc, thanks for PyTables, it's truly impressive. A feature request: what would it take to implement the support for the VL (Variable Length) Datatype? I would be more than happy to help with that, but need some pointers to move around without breaking anything. :^) I'm asking since I need to build indices pointing to rows containing each data value, and from the HDF5 docs the VL Datatype seems the best solution for this. Otherwise I would have to build a new table for each distinct data value, and there can be thousand of those. :^( - -- "Programming is to driving like unit tests are to seat belts." -- Tom Bryan Nicola Larosa - ni...@te... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQE/Z1pFXv0hgDImBm4RAnGsAJwPBQeSTr+awDQ5qVq1ls87NaWUEACfYXZB IvnNcSmyMv3WciD/bWo02LA= =8HCq -----END PGP SIGNATURE----- |
From: DPGrote <gr...@ll...> - 2003-09-16 17:36:02
|
I've come across the MAX_CHILDS_IN_GROUP limit and am wondering if this limit is there for some fundemental reason. I notice that it only affects the size of temporary arrays in Giterate in utils.c. Can that routine be rewritten to dynamically allocate those temporary arrays to whatever the size of num_obj is? This would remove that limit altogether. The is also true for the MAX_ATTRS_IN_NODE limit - it could be removed in the same way. Thanks, Dave |
From: Francesc A. <fa...@op...> - 2003-09-01 07:19:12
|
Hello Berthold!, A Divendres 22 Agost 2003 16:17, Berthold H=F6llmann va escriure: > Hello, > > I have some problems with pytables 0.7.1. The file format seems to > have changed. When I run an old h5dump on a new pytables generated > file I get lots of > > h5dump error: unknown object "order" > > messages with different object names, with a 1.6 h5dump I get > > ... > DATASET "order" { > DATATYPE H5T_STRING { > STRSIZE 1; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } > DATASPACE SIMPLE { ( 9, 36 ) / ( 9, 36 ) } > ... > > I can't see a difference to datasets accepted by the old h5dump, but > theres the error message. Well, it's perfectely possible that the internal format of the HDF5 have changed from 1.4.x to 1.6.x. The important thing is, though, that new versi= on would be *backward* compatible with the old one. So, you can install the new utilities from HDF5 1.6.x and hopefully you will be able to read *both* formats. > > Another problem I'm still investigating is that HDF5 1.6.0 does not > accept empty arrays as input. I get > > HDF5-DIAG: Error detected in HDF5 library version: 1.6.0 thread 0. Back > trace follows. #000: H5S.c line 1708 in H5Screate_simple(): zero sized > dimension for non-unlimited dimension major(01): Function arguments > minor(05): Bad value > > when trying to save > > zeros((1, 0, 3), 'd') > > We had successfully written these arrays with pytables 0.3 or 0.4 Did you? I thought that zero sized arrays were only possible with chunked arrays... Well, in any case this should be possible when chunked arrays will be finally implemented, most probably in version 0.8 of PyTables. > and I haven't found the time to upgrade until now. Unfortunately 0.7.1 > does not work with HDF5 1.4.4. There was a missing symbol when > importing. Am I struck with pytables 0.3 or is there another solution? > Wait until chunked arrays would be implemented. I hope to be able to do this real soon. Cheers, =2D-=20 =46rancesc Alted |