From: Francesco D. D. <ke...@li...> - 2005-09-02 14:09:10
|
Hi, i have an issue with pytables performance: This is my python code for testing: ---SNIP--- from tables import * class PytTest(IsDescription): string = Col('CharType', 16) id = Col('Int32', 1) float = Col('Float64', 1) h5file = openFile('probe.h5','a') try: testGroup = h5file.root.testGroup except NoSuchNodeError: testGroup = h5file.createGroup( "/", "testGroup", "Test Group") try: tbTest = testGroup.test except NoSuchNodeError: tbTest = h5file.createTable( testGroup, 'test', PytTest, 'Test table') import time maxRows = 10**6 ### TEST1 ### startTime = time.time() row = tbTest.row for i in range(0, maxRows): row['string'] = '1234567890123456' row['id'] = 1 row['float'] = 1.0/3.0 row.append() tbTest.flush() diffTime = time.time()-startTime print 'test1: %d rows in %s seconds (%s/s)' % (maxRows,diffTime, maxRows/diffTime) ### TEST2 ### startTime = time.time() row = tbTest.row for i in range(0, maxRows): row['string'] = '1234567890123456' row['id'] = 1 row['float'] = 1.0/3.0 diffTime = time.time()-startTime print 'test2: %d rows in %s seconds (%s/s)' % (maxRows,diffTime, maxRows/diffTime) ### TEST3 ### startTime = time.time() row = tbTest.row row['string'] = '1234567890123456' row['id'] = 1 row['float'] = 1.0/3.0 for i in range(0, maxRows): row.append() tbTest.flush() diffTime = time.time()-startTime print 'test3: %d rows in %s seconds (%s/s)' % (maxRows,diffTime, maxRows/diffTime) h5file.close() ---SNIP--- This code try to insert maxRows (10**6) into a table. The table is similar at table in: http://pytables.sourceforge.net/doc/PyCon.html#section4 (small table) used for benchmarking class Small(IsDescription): var1 = Col("CharType", 16) var2 = Col("Int32", 1) var3 = Col("Float64", 1) As you'll notice, there are 3 possible tests: TEST 1: creation of rows and append() in loop TEST 2: creation of rows in loop, no append (no disk use) TEST 3: creation of row before loop, and append in loop() flush is always out of loop, at the end. The testbed is an AMD Athlon(tm) 64 Processor 2800+, 1GB Ram, and 5400rpm disk I've seen same results on a Dual XEON machine, 1GB Ram, SCSI disk. testbed:~# python test.py test1: 1000000 rows in 22.7905650139 seconds (43877.8064252/s) test2: 1000000 rows in 20.3718218803 seconds (49087.4113211/s) test3: 1000000 rows in 2.01304578781 seconds (496759.68925/s) that troughput (40-50krows/s) is less (10 times circa) than that in http://pytables.sourceforge.net/doc/PyCon.html#section4 (small table) seems that the row assignment: row[fieldName] = value took huge amount of time and that time for writing to disk is 10 times smaller than assignation. I'm doing someting wrong? I've made some test on source code, and i've realized that, on TableExtension.pyx on __setitem__ of Row (called when i do a row[...] = value) the line: self._wfields[fieldName][self._unsavednrows] = value is responsible for that slowness. self._wfields[fieldName] is a numarray.array, isnt't? Assignment took so much time related to disk? I can do a strace of process if you need. I've tried with pytables 1.1 and 1.2-b1 compiled from source, and numarray 1.1.1, 1.3.2, 1.3.3 compiled from source with same results It's a normal beaviour, on your opinion? Thanks in advance, kesko78 |