From: Francesc A. <fa...@py...> - 2010-10-27 18:07:57
|
A Wednesday 27 October 2010 15:38:28 Gaetan de Menten escrigué: > Hi, > > I have a table with ~300 columns and ~150,000 rows and I need to copy > it from one file to another. > > However, the simplest methods I could find: > - input_file.copyNode(...) > - input_file.root.test_table.copy(output_file.root) > or even: > - input_file.copyFile(output_path) > > are all slow as hell: they take more than 1 min, while a simple: > > data = in_table.read() > out_table.append(data) > out_table.flush() > > takes only 1.88s, and a copying in chunks of 10000 rows takes 1.34s. > > FWIW, no compression whatsoever is used in any of those cases, and > using it does not reduce the copy time. > > That behavior does not show up with a small number of columns, but > the problem seem to grow geometrically with the number of columns. > Is there a setting somewhere that could alleviate this problem or is > it a known limitation or a bug? After investigating this, I come to the conclusion that the overhead comes from PyTables when copying a couple of attributes per column (namely FIELD_N_NAME and FIELD_N_FILL, where N is the column number). I suspect that the ultimate responsible is an inefficiency in the HDF5 for dealing with these attributes (I should investigate more, though), so meanwhile I decided not copy the attributes during `Table.copy()` operations. With this, performance is good now. More info: http://pytables.org/trac/ticket/304 Anyway, I'm a bit fed up with such FIELD_N_NAME and FIELD_N_FILL attributes that are not really useful (except for some rare cases). So I'm thinking in removing them completely for PyTables 2.3, see: http://pytables.org/trac/ticket/305 If anyone is against this, please speak now or forever hold your peace! (I'll announce this in a proper thread also). -- Francesc Alted |