Re: [Pytables-users] table.copy() and friends are awfully slow with many columns

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

A Wednesday 27 October 2010 15:38:28 Gaetan de Menten escrigué:
> Hi,
> 
> I have a table with ~300 columns and ~150,000 rows and I need to copy
> it from one file to another.
> 
> However, the simplest methods I could find:
> - input_file.copyNode(...)
> - input_file.root.test_table.copy(output_file.root)
> or even:
> - input_file.copyFile(output_path)
> 
> are all slow as hell: they take more than 1 min, while a simple:
> 
> data = in_table.read()
> out_table.append(data)
> out_table.flush()
> 
> takes only 1.88s, and a copying in chunks of 10000 rows takes 1.34s.
> 
> FWIW, no compression whatsoever is used in any of those cases, and
> using it does not reduce the copy time.
> 
> That behavior does not show up with a small number of columns, but
> the problem seem to grow geometrically with the number of columns.
> Is there a setting somewhere that could alleviate this problem or is
> it a known limitation or a bug?

After investigating this, I come to the conclusion that the overhead 
comes from PyTables when copying a couple of attributes per column 
(namely FIELD_N_NAME and FIELD_N_FILL, where N is the column number).

I suspect that the ultimate responsible is an inefficiency in the HDF5 
for dealing with these attributes (I should investigate more, though), 
so meanwhile I decided not copy the attributes during `Table.copy()` 
operations.  With this, performance is good now.  More info:

http://pytables.org/trac/ticket/304

Anyway, I'm a bit fed up with such FIELD_N_NAME and FIELD_N_FILL 
attributes that are not really useful (except for some rare cases).  So 
I'm thinking in removing them completely for PyTables 2.3, see:

http://pytables.org/trac/ticket/305

If anyone is against this, please speak now or forever hold your peace! 
(I'll announce this in a proper thread also).

-- 
Francesc Alted