From: Shyam P. K. <sp...@ny...> - 2013-04-11 17:18:05
|
Hello, I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables. The execution time for writing the query result to the file is close to 10 hours, which includes querying the database and then writing to the file. When I timed the entire execution, I found that it takes as much time to get the data from the database as it takes to write to the hdf5 file. Here is the small snippet(P.S: the execution time noted below is not for 122GB data, but a small subset close to 10GB): class ContactClass(table.IsDescription): name= tb.StringCol(4200) address= tb.StringCol(4200) emailAddr= tb.StringCol(180) phone= tb.StringCol(256) h5File= table.openFile(<file name>, mode="a", title= "Contacts") t= h5File.createTable(h5File.root, 'ContactClass', ContactClass, filters=table.Filters(5, 'blosc'), expectedrows=77806938) resultSet= get data from database currRow= t.row print("Before appending data: %s" % str(datetime.now())) for (attributes ..) in resultSet: currRow['name']= attribute[0] currRow['address']= attribute[1] currRow['emailAddr']= attribute[2] currRow['phone']= attribute[3] currRow.append() print("After done appending: %s" % str(datetime.now())) t.flush() print("After done flushing: %s" % str(datetime.now())) .. gives me: *Before appending data 2013-04-11 10:42:39.903713 * *After done appending: 2013-04-11 11:04:10.002712* *After done flushing: 2013-04-11 11:05:50.059893* * * it seems like append() takes a lot of time. Any suggestions on how to improve this? Thanks, Shyam |