[Pytables-users] pytables/numarray performance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi, i have an issue with pytables performance:

This is my python code for testing:

---SNIP---
from tables import *

class PytTest(IsDescription):
    string                   = Col('CharType', 16)
    id                       = Col('Int32', 1)
    float                    = Col('Float64', 1)

h5file = openFile('probe.h5','a')

try:
    testGroup = h5file.root.testGroup
except NoSuchNodeError:
    testGroup = h5file.createGroup(
        "/", "testGroup", "Test Group")
try:
    tbTest = testGroup.test
except NoSuchNodeError:
    tbTest = h5file.createTable(
        testGroup,
        'test',
        PytTest,
        'Test table')

import time

maxRows = 10**6

### TEST1 ###
startTime = time.time()
row = tbTest.row
for i in range(0, maxRows):
    row['string'] = '1234567890123456'
    row['id'] = 1
    row['float'] = 1.0/3.0
    row.append()
tbTest.flush()
diffTime = time.time()-startTime
print 'test1: %d rows in %s seconds (%s/s)' % (maxRows,diffTime,
maxRows/diffTime)

### TEST2 ###
startTime = time.time()
row = tbTest.row
for i in range(0, maxRows):
    row['string'] = '1234567890123456'
    row['id'] = 1
    row['float'] = 1.0/3.0
diffTime = time.time()-startTime
print 'test2: %d rows in %s seconds (%s/s)' % (maxRows,diffTime,
maxRows/diffTime)

### TEST3 ###
startTime = time.time()
row = tbTest.row
row['string'] = '1234567890123456'
row['id'] = 1
row['float'] = 1.0/3.0
for i in range(0, maxRows):
    row.append()
tbTest.flush()
diffTime = time.time()-startTime
print 'test3: %d rows in %s seconds (%s/s)' % (maxRows,diffTime,
maxRows/diffTime)
h5file.close()

---SNIP---

This code try to insert maxRows (10**6) into a table. The table is
similar at table in:
http://pytables.sourceforge.net/doc/PyCon.html#section4 (small table)
used for benchmarking

class Small(IsDescription):
    var1 = Col("CharType", 16)
    var2 = Col("Int32", 1)
    var3 = Col("Float64", 1)

As you'll notice, there are 3 possible tests:
TEST 1: creation of rows and append() in loop
TEST 2: creation of rows in loop, no append (no disk use)
TEST 3: creation of row before loop, and append in loop()

flush is always out of loop, at the end.

The testbed is an AMD Athlon(tm) 64 Processor 2800+, 1GB Ram, and
5400rpm disk
I've seen same results on a Dual XEON machine, 1GB Ram, SCSI disk.

testbed:~# python test.py

test1: 1000000 rows in 22.7905650139 seconds (43877.8064252/s)
test2: 1000000 rows in 20.3718218803 seconds (49087.4113211/s)
test3: 1000000 rows in 2.01304578781 seconds (496759.68925/s)

that troughput (40-50krows/s) is less (10 times circa) than that in
http://pytables.sourceforge.net/doc/PyCon.html#section4 (small table)

seems that the row assignment:  

row[fieldName] = value

took huge amount of time and that time for writing to disk is 10 times
smaller than assignation.
I'm doing someting wrong?

I've made some test on source code, and i've realized that, on
TableExtension.pyx on
__setitem__ of Row (called when i do a row[...] = value) the line:

      self._wfields[fieldName][self._unsavednrows] = value

is responsible for that slowness.

self._wfields[fieldName] is a numarray.array, isnt't? Assignment took so
much time
related to disk?

I can do a strace of process if you need.

I've tried with pytables 1.1 and 1.2-b1 compiled from source,
and numarray 1.1.1, 1.3.2, 1.3.3 compiled from source with same results

It's a normal beaviour, on your opinion?

Thanks in advance,
kesko78