From: Francesc A. <fa...@op...> - 2003-10-10 12:53:53
|
A Dijous 09 Octubre 2003 23:37, Michael Lefsky va escriure: > PyTables Users, > > I hope you can take a moment to answer a question about the optimal way to > use PyTables for my application. > > I have as many as 20 million records to read from one file and write to a > Pytable, and in the original file these records are not is the order that I > want them to be. I could write them out in the same order as they appear in > the original file, and use the Pytable commands to select the records I > need, but I am afraid that will take a long time if the records are not > adjacent. I could order the records in memory, but I am trying to avoid > having all of them in memory. What I want to do is to read in a record, > determine where it should go in the output file (according to an index) and > then write the record to the correct spot. I would record the first and > last index of the records that belong with each code, and then retrieve the > block of records as a whole.That way, when I need to retrieve a block of > records, they will all be from the area of the disk, and I assume that will > be faster. > > Is this possible, and if so, what is the fastest way to do this? Am I > mis-understanding the problem? I'm afraid you are asking for something equivalent to indexed fields, so that you can accelerate the row access according with some patterns, isn't it? Well, until indexing arrives (I would like to implement it before 1.0 release), you can try the next approach (the final implementation should follow the same guidelines, more or less). Let's imagine that you have a field and you want to arrange your rows in such a way that the values in this field are sorted in, say, ascending order (if you don't have such a classificatory field, it should be easy to create *another* table and add it). Now, read the column you want to order the Table by: sort_field=src_table.read(field="your_field") then, get an array with the indexes of the original array, but as if its values were sorted: import numarray neworder=argsort(sort_field) now, read the rows of the original table in this order and write them to a new table (choose another file if you don't want to duplicate your data): dst_row = dst_table.row for i in neworder: src_row = src_table.read(i) dst_row.field1 = src_row.field('field1')[0] <add fields as needed> dst_row.fieldn = src_row.field('fieldn')[0] dst_row.append() dst_row.flush() This operation may be slow (I did not measured how much) but once you have it done, your lookup's can be accelerated if you have your index at hand, as you wanted. Of course, this is supposing that you can put one of your table columns entirely in memory. The case were this is not possible is left as an exercise for the reader (hint: use temporary buffers). And please, if you find a good solution for that, share it! -- Francesc Alted |