You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Josh A. <jos...@gm...> - 2013-01-03 18:29:40
|
David, The change in issue 27 was only for iteration over a tables.Column instance. To use it, tweak Anthony's code as follows. This will iterate over the "element" column, as in your original example. Note also that this will only work with the development version of PyTables available on github. It will be very slow using the released v2.4.0. from itertools import izip with tb.openFile(...) as f: data = f.root.data.cols.element data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) Hope that helps, Josh On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz <sc...@gm...> wrote: > HI David, > > Tables and table column iteration have been overhauled fairly recently > [1]. So you might try creating two iterators, offset by one, and then > doing the comparison. I am hacking this out super quick so please forgive > me: > > from itertools import izip > > with tb.openFile(...) as f: > data = f.root.data > data_i = iter(data) > data_j = iter(data) > data_i.next() # throw the first value away > for i, j in izip(data_i, data_j): > compare(i, j) > > You get the idea ;) > > Be Well > Anthony > > 1. https://github.com/PyTables/PyTables/issues/27 > > > On Thu, Jan 3, 2013 at 9:25 AM, David Reed <dav...@gm...> wrote: > >> I was hoping someone could help me out here. >> >> This is from a post I put up on StackOverflow, >> >> I am have a fairly large dataset that I store in HDF5 and access using >> PyTables. One operation I need to do on this dataset are pairwise >> comparisons between each of the elements. This requires 2 loops, one to >> iterate over each element, and an inner loop to iterate over every other >> element. This operation thus looks at N(N-1)/2 comparisons. >> >> For fairly small sets I found it to be faster to dump the contents into a >> multdimensional numpy array and then do my iteration. I run into problems >> with large sets because of memory issues and need to access each element of >> the dataset at run time. >> >> Putting the elements into an array gives me about 600 comparisons per >> second, while operating on hdf5 data itself gives me about 300 comparisons >> per second. >> >> Is there a way to speed this process up? >> >> Example follows (this is not my real code, just an example): >> >> *Small Set*: >> >> >> with tb.openFile(h5_file, 'r') as f: >> data = f.root.data >> >> N_elements = len(data) >> elements = np.empty((N_irises, 1e5)) >> >> for ii, d in enumerate(data): >> elements[ii] = data['element'] >> >> D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): >> for jj in xrange(ii+1, N_elements): >> D[ii, jj] = compare(elements[ii], elements[jj]) >> >> *Large Set*: >> >> >> with tb.openFile(h5_file, 'r') as f: >> data = f.root.data >> >> N_elements = len(data) >> >> D = np.empty((N_irises, N_irises)) >> for ii in xrange(N_elements): >> for jj in xrange(ii+1, N_elements): >> D[ii, jj] = compare(data['element'][ii], data['element'][jj]) >> >> >> >> ------------------------------------------------------------------------------ >> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, >> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current >> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft >> MVPs and experts. ON SALE this month only -- learn more at: >> http://p.sf.net/sfu/learnmore_122712 >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft > MVPs and experts. ON SALE this month only -- learn more at: > http://p.sf.net/sfu/learnmore_122712 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-01-03 17:12:15
|
HI David, Tables and table column iteration have been overhauled fairly recently [1]. So you might try creating two iterators, offset by one, and then doing the comparison. I am hacking this out super quick so please forgive me: from itertools import izip with tb.openFile(...) as f: data = f.root.data data_i = iter(data) data_j = iter(data) data_i.next() # throw the first value away for i, j in izip(data_i, data_j): compare(i, j) You get the idea ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 On Thu, Jan 3, 2013 at 9:25 AM, David Reed <dav...@gm...> wrote: > I was hoping someone could help me out here. > > This is from a post I put up on StackOverflow, > > I am have a fairly large dataset that I store in HDF5 and access using > PyTables. One operation I need to do on this dataset are pairwise > comparisons between each of the elements. This requires 2 loops, one to > iterate over each element, and an inner loop to iterate over every other > element. This operation thus looks at N(N-1)/2 comparisons. > > For fairly small sets I found it to be faster to dump the contents into a > multdimensional numpy array and then do my iteration. I run into problems > with large sets because of memory issues and need to access each element of > the dataset at run time. > > Putting the elements into an array gives me about 600 comparisons per > second, while operating on hdf5 data itself gives me about 300 comparisons > per second. > > Is there a way to speed this process up? > > Example follows (this is not my real code, just an example): > > *Small Set*: > > > with tb.openFile(h5_file, 'r') as f: > data = f.root.data > > N_elements = len(data) > elements = np.empty((N_irises, 1e5)) > > for ii, d in enumerate(data): > elements[ii] = data['element'] > > D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): > for jj in xrange(ii+1, N_elements): > D[ii, jj] = compare(elements[ii], elements[jj]) > > *Large Set*: > > > with tb.openFile(h5_file, 'r') as f: > data = f.root.data > > N_elements = len(data) > > D = np.empty((N_irises, N_irises)) > for ii in xrange(N_elements): > for jj in xrange(ii+1, N_elements): > D[ii, jj] = compare(data['element'][ii], data['element'][jj]) > > > > ------------------------------------------------------------------------------ > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft > MVPs and experts. ON SALE this month only -- learn more at: > http://p.sf.net/sfu/learnmore_122712 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: David R. <dav...@gm...> - 2013-01-03 15:26:06
|
I was hoping someone could help me out here. This is from a post I put up on StackOverflow, I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons. For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time. Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second. Is there a way to speed this process up? Example follows (this is not my real code, just an example): *Small Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) elements = np.empty((N_irises, 1e5)) for ii, d in enumerate(data): elements[ii] = data['element'] D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(elements[ii], elements[jj]) *Large Set*: with tb.openFile(h5_file, 'r') as f: data = f.root.data N_elements = len(data) D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements): for jj in xrange(ii+1, N_elements): D[ii, jj] = compare(data['element'][ii], data['element'][jj]) |
From: Aquil H. A. <aqu...@gm...> - 2012-12-14 06:54:17
|
Hello All, I currently use PyTables to generate a dataset that is indexed by a timestamp and a symbol. The problem that I have is that the data is stored at irregular intervals. For example: *# See below for method ts_from_str* data=[{'text_ts':'2012-01-04T15:00:00Z', 'symbol':'APPL', 'price':689.00, 'timestamp':ts_from_str('2012-01-04T15:00:00Z')}, {'text_ts':'2012-01-04T15:11:00Z', 'symbol':'APPL', 'price':687.24, 'timestamp':ts_from_str('2012-01-04T15:11:00Z')}, {'text_ts':'2012-01-05T15:33:00Z', 'symbol':'APPL', 'price':688.32, 'timestamp':ts_from_str('2012-01-05T15:33:00Z')}, {'text_ts':'2012-01-04T15:01:00Z', 'symbol':'MSFT', 'price':32.30, 'timestamp':ts_from_str('2012-01-04T15:01:00Z')}, {'text_ts':'2012-01-04T16:00:00Z', 'symbol':'MSFT', 'price':36.44, 'timestamp':ts_from_str('2012-01-04T16:00:00Z')}, {'text_ts':'2012-01-05T15:19:00Z', 'symbol':'MSFT', 'price':35.89, 'timestamp':ts_from_str('2012-01-05T15:19:00Z')}] If I want to look up the price for Apple on for January 4, 2012 at 15:01:00, I will get an empty ndarry. *Is there a way to optimize the search for data "asof" a specific time other than iterating until you find data?* I've written my own price_asof method (See code below), that produces the following output. *In [63]: price_asof(dt,'APPL')* *QUERY: (timestamp == 1325707380) & (symbol == "APPL") -- text_ts: 2012-01-04T15:03:00Z* *QUERY: (timestamp == 1325707320) & (symbol == "APPL") -- text_ts: 2012-01-04T15:02:00Z* *QUERY: (timestamp == 1325707260) & (symbol == "APPL") -- text_ts: 2012-01-04T15:01:00Z* *QUERY: (timestamp == 1325707200) & (symbol == "APPL") -- text_ts: 2012-01-04T15:00:00Z* *Out[63]: * *array([(689.0, 'APPL', '2012-01-04T15:00:00Z', 1325707200)], * * dtype=[('price', '<f8'), ('symbol', 'S16'), ('text_ts', 'S26'), ('timestamp', '<i4')])* *# Code to generate data* import tables from datetime import datetime, timedelta from time import mktime import numpy as np def ts_from_str(ts_str, ts_format='%Y-%m-%dT%H:%M:%SZ'): """ Create a Unix Timestamp from an ISO 8601 timestamp string """ dt = datetime.strptime(ts_str, ts_format) return mktime(dt.timetuple()) class PriceData(tables.IsDescription): text_ts = tables.StringCol(len('2012-01-01T00:00:00+00:00 ')) symbol = tables.StringCol(16) price = tables.Float64Col() timestamp = tables.Time32Col() h5f = tables.openFile('test.h5','w', title='Price Data For Apple and Microsoft') group = h5f.createGroup('/','January', 'January Price Data') tbl = h5f.createTable(group, 'Prices',PriceData,'Apple and Microsoft Prices') data=[{'text_ts':'2012-01-04T15:00:00Z', 'symbol':'APPL', 'price':689.00, 'timestamp':ts_from_str('2012-01-04T15:00:00Z')}, {'text_ts':'2012-01-04T15:11:00Z', 'symbol':'APPL', 'price':687.24, 'timestamp':ts_from_str('2012-01-04T15:11:00Z')}, {'text_ts':'2012-01-05T15:33:00Z', 'symbol':'APPL', 'price':688.32, 'timestamp':ts_from_str('2012-01-05T15:33:00Z')}, {'text_ts':'2012-01-04T15:01:00Z', 'symbol':'MSFT', 'price':32.30, 'timestamp':ts_from_str('2012-01-04T15:01:00Z')}, {'text_ts':'2012-01-04T16:00:00Z', 'symbol':'MSFT', 'price':36.44, 'timestamp':ts_from_str('2012-01-04T16:00:00Z')}, {'text_ts':'2012-01-05T15:19:00Z', 'symbol':'MSFT', 'price':35.89, 'timestamp':ts_from_str('2012-01-05T15:19:00Z')}] price_data = tbl.row for d in data: price_data['text_ts'] = d['text_ts'] price_data['symbol'] = d['symbol'] price_data['price'] = d['price'] price_data['timestamp'] = d['timestamp'] price_data.append() tbl.flush() *# This is my price_asof function* def price_asof(dt, symbol, max_rec=1000): """ Return the price of the time dt """ ts = mktime(dt.timetuple()) query = '(timestamp == %d)' % ts if symbol: query += ' & (symbol == "%s")' % symbol data = np.ndarray(0) count = 0 while (not data) and (count <= max_rec): # print "QUERY: %s -- text_ts: %s" % (query, dt.strftime('%Y-%m-%dT%H:%M:%SZ')) data = tbl.readWhere(query) dt = dt-timedelta(seconds=60) ts = mktime(dt.timetuple()) query = '(timestamp == %d)' % ts if symbol: query += ' & (symbol == "%s")' % symbol count += 1 return data h5f.close() -- Aquil H. Abdullah aqu...@gm... |
From: Josh A. <jos...@gm...> - 2012-12-12 17:53:40
|
Jennifer, When adding a Python object to a VLArray, PyTables first pickles the object. It looks like you're trying to add something that can't be pickled. Check the type of the 'state' variable in the first line of the stack trace and make sure it's something that can be pickled. See [1] for more details. Hope that helps, Josh [1]: http://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled On Wed, Dec 12, 2012 at 4:58 AM, Jennifer Flegg <jen...@ww...>wrote: > Hi All, > I'm getting errors of this sort when I use pytables to store data in hdf5 > format. > Has anyone come across this before? Is there a fix? > Thanks, > Jennifer > > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/pymc/database/hdf5.py", line 474, > in savestate s.append(state) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/tables/vlarray.py", line 462, in append > sequence = atom.toarray(sequence) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/tables/atom.py", line 1000, in toarray > buffer_ = self._tobuffer(object_) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/tables/atom.py", line 1112, in _tobuffer > return cPickle.dumps(object_, cPickle.HIGHEST_PROTOCOL) > PicklingError: Can't pickle <type 'function'>: > attribute lookup __builtin__.function failed > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jennifer F. <jen...@ww...> - 2012-12-12 12:59:24
|
Hi All, I'm getting errors of this sort when I use pytables to store data in hdf5 format. Has anyone come across this before? Is there a fix? Thanks, Jennifer File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 /lib/python2.7/site-packages/pymc/database/hdf5.py", line 474, in savestate s.append(state) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 /lib/python2.7/site-packages/tables/vlarray.py", line 462, in append sequence = atom.toarray(sequence) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 /lib/python2.7/site-packages/tables/atom.py", line 1000, in toarray buffer_ = self._tobuffer(object_) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 /lib/python2.7/site-packages/tables/atom.py", line 1112, in _tobuffer return cPickle.dumps(object_, cPickle.HIGHEST_PROTOCOL) PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed |
From: Jennifer F. <jen...@ww...> - 2012-12-11 12:34:10
|
Thanks Anthony. I will check it out. Cheers, Jennifer |
From: Josh A. <jos...@gm...> - 2012-12-11 10:29:30
|
Alan, I haven't found the exact problem, but seems to have something to do with the node cache. Changing the 'NODE_CACHE_SLOTS' parameter to zero (which disables the node cache) or to a negative number (which allows the cache to grow without limit) also eliminates the problem, at least in the unthreaded version of your code. It can be set permanently in the parameters.py file, or passed as a parameter to the tables.openFile function. I'll open an issue on github about this problem. Thanks, Josh On Mon, Dec 10, 2012 at 10:05 AM, Alan Marchiori <al...@al...>wrote: > I think I have found a viable work around. > Previously, I was flushing the whole HDF5 file: > self.h5.flush() > > By replacing this with a flush on just the table of interest: > table = self.tables[random.randint(0, > self.num_groups-1)][random.randint(0, self.num_tables-1)] > table.flush() > > The RuntimeError seems to be gone in all versions of my test program (both > single threaded and threaded with locks). Hope this helps someone else and > eventually maybe someone will figure out what is wrong with File.flush(). > > > On Mon, Dec 10, 2012 at 10:52 AM, Alan Marchiori <al...@al...>wrote: > >> I'm continuing to fight this error. As a sanity check I rewrote my >> sample app as a single thread only. With interleaved read/writes to >> multiple tables I still get "RuntimeError: dictionary changed size during >> iteration" in flush. I still think there is some underlying problem or >> something I don't understand about pytables/hdf5. I'm far from an expert >> on either of these so I appreciate any suggestions or even confirmation >> that I'm not completely crazy? The following code should work, right? >> >> import tables >> import random >> import datetime >> >> # a simple table >> class TableValue(tables.IsDescription): >> a = tables.Int64Col(pos=1) >> b = tables.UInt32Col(pos=2) >> >> class Test(): >> def __init__(self): >> self.stats = {'read': 0, >> 'write': 0, >> 'read_error': 0, >> 'write_error': 0} >> self.h5 = None >> self.h5 = tables.openFile('/data/test.h5', mode='w') >> self.num_groups = 5 >> self.num_tables = 5 >> # create num_groups >> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in >> range(self.num_groups)] >> self.tables = [] >> # create num_tables in each group we just created >> for group in self.groups: >> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) >> for i in range(self.num_tables)] >> self.tables.append (tbls) >> for table in tbls: >> # add an index for good measure >> table.cols.a.createIndex() >> >> def write(self): >> # select a random table and write to it >> x = self.tables[random.randint(0, >> self.num_groups-1)][random.randint(0, self.num_tables-1)].row >> x['a'] = random.randint(0, 100) >> x['b'] = random.randint(0, 100) >> x.append() >> self.stats['write'] += 1 >> >> def read(self): >> # first flush any cached data >> self.h5.flush() >> # then select a random table >> table = self.tables[random.randint(0, >> self.num_groups-1)][random.randint(0, self.num_tables-1)] >> # and do some random query >> table.readWhere('a > %d'%(random.randint(0, 100))) >> self.stats['read'] += 1 >> >> def close(self): >> self.h5.close() >> >> def main(): >> t = Test() >> >> start = datetime.datetime.now() >> >> # run for 10 seconds >> while (datetime.datetime.now() - start < >> datetime.timedelta(seconds=10)): >> # randomly do a read or a write >> if random.random() > 0.5: >> t.write() >> else: >> t.read() >> >> print t.stats >> print "Done" >> t.close() >> >> if __name__ == "__main__": >> main() >> >> >> On Thu, Dec 6, 2012 at 9:55 AM, Alan Marchiori <al...@al...>wrote: >> >>> Josh, >>> >>> Thanks for the detailed response. I would like to avoid going through a >>> separate process if at all possible due to the performance penalty. I have >>> also tried your last suggestion to create a dedicated pytables thread and >>> send everything through that but still see the same problem (Runtime error >>> in flush). This leads me to believe something strange is going on behind >>> the scenes. ?? >>> >>> Updated test program with dedicated pytables thread reading an input >>> Queue.Queue: >>> >>> import tables >>> import threading >>> import random >>> import time >>> import Queue >>> >>> # a simple table >>> class TableValue(tables.IsDescription): >>> a = tables.Int64Col(pos=1) >>> b = tables.UInt32Col(pos=2) >>> >>> class TablesThread(threading.Thread): >>> def __init__(self): >>> threading.Thread.__init__(self) >>> self.name = 'HDF5 io thread' >>> # create the dummy HDF5 file >>> self.h5 = None >>> self.h5 = tables.openFile('/data/test.h5', mode='w') >>> self.num_groups = 5 >>> self.num_tables = 5 >>> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in >>> range(self.num_groups)] >>> self.tables = [] >>> for group in self.groups: >>> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) >>> for i in range(self.num_tables)] >>> self.tables.append (tbls) >>> for table in tbls: >>> # add an index for good measure >>> table.cols.a.createIndex() >>> self.stopEvt = threading.Event() >>> self.stoppedEvt = threading.Event() >>> self.inputQ = Queue.Queue() >>> >>> def run(self): >>> try: >>> while not self.stopEvt.is_set(): >>> # get a command >>> try: >>> cmd, args, result = self.inputQ.get(timeout = 0.5) >>> except Queue.Empty: >>> # poll stopEvt so we can shutdown >>> continue >>> >>> # do the command >>> if cmd == 'write': >>> x = self.tables[args[0]][args[1]].row >>> x['a'] = args[2] >>> x['b'] = args[3] >>> x.append() >>> elif cmd == 'read': >>> self.h5.flush() >>> table = self.tables[args[0]][args[1]] >>> result.value = table.readWhere('a > %d'%(args[2])) >>> else: >>> raise Exception("Command not supported: %s"%(cmd,)) >>> >>> # signal that the result is ready >>> result.event.set() >>> >>> finally: >>> # shutdown >>> self.h5.close() >>> self.stoppedEvt.set() >>> >>> def stop(self): >>> if not self.stoppedEvt.is_set(): >>> self.stopEvt.set() >>> self.stoppedEvt.wait() >>> >>> class ResultEvent(): >>> def __init__(self): >>> self.event = threading.Event() >>> self.value = None >>> >>> class Test(): >>> def __init__(self): >>> self.tables = TablesThread() >>> self.tables.start() >>> self.timeout = 5 >>> self.stats = {'read': 0, >>> 'write': 0, >>> 'read_error': 0, >>> 'write_error': 0} >>> >>> def write(self): >>> r = ResultEvent() >>> self.tables.inputQ.put(('write', >>> (random.randint(0, >>> self.tables.num_groups-1), >>> random.randint(0, >>> self.tables.num_tables-1), >>> random.randint(0, 100), >>> random.randint(0, 100)), >>> r)) >>> r.event.wait(timeout = self.timeout) >>> if r.event.is_set(): >>> self.stats['write'] += 1 >>> else: >>> self.stats['write_error'] += 1 >>> >>> def read(self): >>> r = ResultEvent() >>> self.tables.inputQ.put(('read', >>> (random.randint(0, >>> self.tables.num_groups-1), >>> random.randint(0, >>> self.tables.num_tables-1), >>> random.randint(0, 100)), >>> r)) >>> r.event.wait(timeout = self.timeout) >>> if r.event.is_set(): >>> self.stats['read'] += 1 >>> #print 'Query got %d hits'%(len(r.value)) >>> else: >>> self.stats['read_error'] += 1 >>> >>> >>> def close(self): >>> self.tables.stop() >>> >>> def __del__(self): >>> self.close() >>> >>> class Worker(threading.Thread): >>> def __init__(self, method): >>> threading.Thread.__init__(self) >>> self.method = method >>> self.stopEvt = threading.Event() >>> >>> def run(self): >>> while not self.stopEvt.is_set(): >>> try: >>> self.method() >>> except Exception, x: >>> print 'Worker thread failed with: %s'%(x,) >>> time.sleep(random.random()/100.0) >>> >>> def stop(self): >>> self.stopEvt.set() >>> >>> def main(): >>> t = Test() >>> >>> threads = [Worker(t.write) for _i in range(10)] >>> threads.extend([Worker(t.read) for _i in range(10)]) >>> >>> for thread in threads: >>> thread.start() >>> >>> time.sleep(5) >>> >>> for thread in threads: >>> thread.stop() >>> >>> for thread in threads: >>> thread.join() >>> >>> t.close() >>> >>> print t.stats >>> >>> if __name__ == "__main__": >>> main() >>> >>> >>> On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...>wrote: >>> >>>> Alan, >>>> >>>> Unfortunately, the underlying HDF5 library isn't thread-safe by >>>> default. It can be built in a thread-safe mode that serializes all API >>>> calls, but still doesn't allow actual parallel access to the disk. See [1] >>>> for more details. Here's [2] another interesting discussion concerning >>>> whether multithreaded access is actually beneficial for an I/O limited >>>> library like HDF5. Ultimately, if one thread can read at the disk's >>>> maximum transfer rate, then multiple threads don't provide any benefit. >>>> >>>> Beyond the limitations of HDF5, PyTables also maintains global state in >>>> various module-level variables. One example is the _open_file cache in the >>>> file.py module. I made an attempt in the past to work around this to allow >>>> read-only access from multiple threads, but didn't make much progress. >>>> >>>> In general, I think your best bet is to serialize all access through a >>>> single process. There is another example in the PyTables/examples >>>> directory that benchmarks different methods of transferring data from >>>> PyTables to another process [3]. It compares Python's >>>> multiprocessing.Queue, sockets, and memory-mapped files. In my testing, >>>> the latter two are 5-10x faster than using a queue. >>>> >>>> Another option would be to use multiple threads, but handle all access >>>> to the HDF5 file in one thread. PyTables will release the GIL when making >>>> HDF5 library calls, so the other threads will be able to run. You could >>>> use a Queue.Queue or some other mechanism to transfer data between >>>> threads. No actual copying would be needed since their memory is shared, >>>> which should make it faster than the multi-process techniques. >>>> >>>> Hope that helps. >>>> >>>> Josh Ayers >>>> >>>> >>>> [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread >>>> >>>> [2]: >>>> https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5 >>>> >>>> [3]: >>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py >>>> >>>> >>>> On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote: >>>> >>>>> I am trying to allow multiple threads read/write access to pytables >>>>> data and found it is necessary to call flush() before any read. If not, >>>>> the latest data is not returned. However, this can cause a RuntimeError. >>>>> I have tried protecting pytables access with both locks and queues as done >>>>> by joshayers ( >>>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py). >>>>> In either case I still get RuntimeError: dictionary changed size during >>>>> iteration when doing the flush. (incidentally using the Locks appears to >>>>> be much faster than using queues in my unscientific tests...) >>>>> >>>>> I have tried versions 2.4 and 2.3.1 with the same results. >>>>> Interestingly this only appears to happen if there are multiple >>>>> tables/groups in the H5 file. To investigate this behavior further I >>>>> create a test program to illustrate (below). When run with num_groups = >>>>> 5 num_tables = 5 (or greater) I see the runtime error every time. When >>>>> these values are smaller than this it doesn't (at least in a short test >>>>> period). >>>>> >>>>> I might be doing something unexpected with pytables, but this seems >>>>> pretty straight forward to me. Any help is appreciated. >>>>> >>>>> >>>>> >> > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2012-12-10 22:12:55
|
Hi Jennifer, Yeah, that is right, they are not in EPD Free. However, they are in Anaconda CE (http://continuum.io/downloads.html). Note the CE rather than the full version. Be Well Anthony On Mon, Dec 10, 2012 at 4:07 PM, Jennifer Flegg <jen...@ww...>wrote: > Hi Anthony, > Thanks for your reply. I installed HDF5 also from source. The > reason I'm building hdf5 and pytables myself is that they don't > seem to be available through EPD any more (at least in the free > version: http://www.enthought.com/products/epdlibraries.php) > They used to both come bundled in EPD, but not anymore, which is > a pain. > Many thanks, > Jennifer > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jennifer F. <jen...@ww...> - 2012-12-10 22:08:06
|
Hi Anthony, Thanks for your reply. I installed HDF5 also from source. The reason I'm building hdf5 and pytables myself is that they don't seem to be available through EPD any more (at least in the free version: http://www.enthought.com/products/epdlibraries.php) They used to both come bundled in EPD, but not anymore, which is a pain. Many thanks, Jennifer |
From: Anthony S. <sc...@gm...> - 2012-12-10 19:30:03
|
Hi Jennifer, Oh, right, I am sorry. Your end error message looks very similar to another, more common issue. How did you install HDF5? On Mac I typically use MacPorts or have to install it from source. IIRC the macports build fails to make the shared libraries and you typically have to configure & compile manually? Is there a reason you are building PyTables yourself? On Mac, I typically use EPD or Anaconda.... Even when I am making edits to the PyTables (or other projects source), I use these distributions as a base and link against the HDF5 provided in them. Be Well Anthony On Mon, Dec 10, 2012 at 1:23 PM, Jennifer Flegg <jen...@ww...>wrote: > HI Anthony, > I'm not in the pytables source dir when I'm running IPython, > so I don't think this is the problem. > Thanks, > Jennifer > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jennifer F. <jen...@ww...> - 2012-12-10 19:23:48
|
HI Anthony, I'm not in the pytables source dir when I'm running IPython, so I don't think this is the problem. Thanks, Jennifer |
From: Alan M. <al...@al...> - 2012-12-10 18:05:41
|
I think I have found a viable work around. Previously, I was flushing the whole HDF5 file: self.h5.flush() By replacing this with a flush on just the table of interest: table = self.tables[random.randint(0, self.num_groups-1)][random.randint(0, self.num_tables-1)] table.flush() The RuntimeError seems to be gone in all versions of my test program (both single threaded and threaded with locks). Hope this helps someone else and eventually maybe someone will figure out what is wrong with File.flush(). On Mon, Dec 10, 2012 at 10:52 AM, Alan Marchiori <al...@al...>wrote: > I'm continuing to fight this error. As a sanity check I rewrote my sample > app as a single thread only. With interleaved read/writes to multiple > tables I still get "RuntimeError: dictionary changed size during iteration" > in flush. I still think there is some underlying problem or something I > don't understand about pytables/hdf5. I'm far from an expert on either of > these so I appreciate any suggestions or even confirmation that I'm not > completely crazy? The following code should work, right? > > import tables > import random > import datetime > > # a simple table > class TableValue(tables.IsDescription): > a = tables.Int64Col(pos=1) > b = tables.UInt32Col(pos=2) > > class Test(): > def __init__(self): > self.stats = {'read': 0, > 'write': 0, > 'read_error': 0, > 'write_error': 0} > self.h5 = None > self.h5 = tables.openFile('/data/test.h5', mode='w') > self.num_groups = 5 > self.num_tables = 5 > # create num_groups > self.groups = [self.h5.createGroup('/', "group%d"%i) for i in > range(self.num_groups)] > self.tables = [] > # create num_tables in each group we just created > for group in self.groups: > tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) > for i in range(self.num_tables)] > self.tables.append (tbls) > for table in tbls: > # add an index for good measure > table.cols.a.createIndex() > > def write(self): > # select a random table and write to it > x = self.tables[random.randint(0, > self.num_groups-1)][random.randint(0, self.num_tables-1)].row > x['a'] = random.randint(0, 100) > x['b'] = random.randint(0, 100) > x.append() > self.stats['write'] += 1 > > def read(self): > # first flush any cached data > self.h5.flush() > # then select a random table > table = self.tables[random.randint(0, > self.num_groups-1)][random.randint(0, self.num_tables-1)] > # and do some random query > table.readWhere('a > %d'%(random.randint(0, 100))) > self.stats['read'] += 1 > > def close(self): > self.h5.close() > > def main(): > t = Test() > > start = datetime.datetime.now() > > # run for 10 seconds > while (datetime.datetime.now() - start < > datetime.timedelta(seconds=10)): > # randomly do a read or a write > if random.random() > 0.5: > t.write() > else: > t.read() > > print t.stats > print "Done" > t.close() > > if __name__ == "__main__": > main() > > > On Thu, Dec 6, 2012 at 9:55 AM, Alan Marchiori <al...@al...>wrote: > >> Josh, >> >> Thanks for the detailed response. I would like to avoid going through a >> separate process if at all possible due to the performance penalty. I have >> also tried your last suggestion to create a dedicated pytables thread and >> send everything through that but still see the same problem (Runtime error >> in flush). This leads me to believe something strange is going on behind >> the scenes. ?? >> >> Updated test program with dedicated pytables thread reading an input >> Queue.Queue: >> >> import tables >> import threading >> import random >> import time >> import Queue >> >> # a simple table >> class TableValue(tables.IsDescription): >> a = tables.Int64Col(pos=1) >> b = tables.UInt32Col(pos=2) >> >> class TablesThread(threading.Thread): >> def __init__(self): >> threading.Thread.__init__(self) >> self.name = 'HDF5 io thread' >> # create the dummy HDF5 file >> self.h5 = None >> self.h5 = tables.openFile('/data/test.h5', mode='w') >> self.num_groups = 5 >> self.num_tables = 5 >> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in >> range(self.num_groups)] >> self.tables = [] >> for group in self.groups: >> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) >> for i in range(self.num_tables)] >> self.tables.append (tbls) >> for table in tbls: >> # add an index for good measure >> table.cols.a.createIndex() >> self.stopEvt = threading.Event() >> self.stoppedEvt = threading.Event() >> self.inputQ = Queue.Queue() >> >> def run(self): >> try: >> while not self.stopEvt.is_set(): >> # get a command >> try: >> cmd, args, result = self.inputQ.get(timeout = 0.5) >> except Queue.Empty: >> # poll stopEvt so we can shutdown >> continue >> >> # do the command >> if cmd == 'write': >> x = self.tables[args[0]][args[1]].row >> x['a'] = args[2] >> x['b'] = args[3] >> x.append() >> elif cmd == 'read': >> self.h5.flush() >> table = self.tables[args[0]][args[1]] >> result.value = table.readWhere('a > %d'%(args[2])) >> else: >> raise Exception("Command not supported: %s"%(cmd,)) >> >> # signal that the result is ready >> result.event.set() >> >> finally: >> # shutdown >> self.h5.close() >> self.stoppedEvt.set() >> >> def stop(self): >> if not self.stoppedEvt.is_set(): >> self.stopEvt.set() >> self.stoppedEvt.wait() >> >> class ResultEvent(): >> def __init__(self): >> self.event = threading.Event() >> self.value = None >> >> class Test(): >> def __init__(self): >> self.tables = TablesThread() >> self.tables.start() >> self.timeout = 5 >> self.stats = {'read': 0, >> 'write': 0, >> 'read_error': 0, >> 'write_error': 0} >> >> def write(self): >> r = ResultEvent() >> self.tables.inputQ.put(('write', >> (random.randint(0, >> self.tables.num_groups-1), >> random.randint(0, >> self.tables.num_tables-1), >> random.randint(0, 100), >> random.randint(0, 100)), >> r)) >> r.event.wait(timeout = self.timeout) >> if r.event.is_set(): >> self.stats['write'] += 1 >> else: >> self.stats['write_error'] += 1 >> >> def read(self): >> r = ResultEvent() >> self.tables.inputQ.put(('read', >> (random.randint(0, >> self.tables.num_groups-1), >> random.randint(0, >> self.tables.num_tables-1), >> random.randint(0, 100)), >> r)) >> r.event.wait(timeout = self.timeout) >> if r.event.is_set(): >> self.stats['read'] += 1 >> #print 'Query got %d hits'%(len(r.value)) >> else: >> self.stats['read_error'] += 1 >> >> >> def close(self): >> self.tables.stop() >> >> def __del__(self): >> self.close() >> >> class Worker(threading.Thread): >> def __init__(self, method): >> threading.Thread.__init__(self) >> self.method = method >> self.stopEvt = threading.Event() >> >> def run(self): >> while not self.stopEvt.is_set(): >> try: >> self.method() >> except Exception, x: >> print 'Worker thread failed with: %s'%(x,) >> time.sleep(random.random()/100.0) >> >> def stop(self): >> self.stopEvt.set() >> >> def main(): >> t = Test() >> >> threads = [Worker(t.write) for _i in range(10)] >> threads.extend([Worker(t.read) for _i in range(10)]) >> >> for thread in threads: >> thread.start() >> >> time.sleep(5) >> >> for thread in threads: >> thread.stop() >> >> for thread in threads: >> thread.join() >> >> t.close() >> >> print t.stats >> >> if __name__ == "__main__": >> main() >> >> >> On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...> wrote: >> >>> Alan, >>> >>> Unfortunately, the underlying HDF5 library isn't thread-safe by >>> default. It can be built in a thread-safe mode that serializes all API >>> calls, but still doesn't allow actual parallel access to the disk. See [1] >>> for more details. Here's [2] another interesting discussion concerning >>> whether multithreaded access is actually beneficial for an I/O limited >>> library like HDF5. Ultimately, if one thread can read at the disk's >>> maximum transfer rate, then multiple threads don't provide any benefit. >>> >>> Beyond the limitations of HDF5, PyTables also maintains global state in >>> various module-level variables. One example is the _open_file cache in the >>> file.py module. I made an attempt in the past to work around this to allow >>> read-only access from multiple threads, but didn't make much progress. >>> >>> In general, I think your best bet is to serialize all access through a >>> single process. There is another example in the PyTables/examples >>> directory that benchmarks different methods of transferring data from >>> PyTables to another process [3]. It compares Python's >>> multiprocessing.Queue, sockets, and memory-mapped files. In my testing, >>> the latter two are 5-10x faster than using a queue. >>> >>> Another option would be to use multiple threads, but handle all access >>> to the HDF5 file in one thread. PyTables will release the GIL when making >>> HDF5 library calls, so the other threads will be able to run. You could >>> use a Queue.Queue or some other mechanism to transfer data between >>> threads. No actual copying would be needed since their memory is shared, >>> which should make it faster than the multi-process techniques. >>> >>> Hope that helps. >>> >>> Josh Ayers >>> >>> >>> [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread >>> >>> [2]: >>> https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5 >>> >>> [3]: >>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py >>> >>> >>> On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote: >>> >>>> I am trying to allow multiple threads read/write access to pytables >>>> data and found it is necessary to call flush() before any read. If not, >>>> the latest data is not returned. However, this can cause a RuntimeError. >>>> I have tried protecting pytables access with both locks and queues as done >>>> by joshayers ( >>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py). >>>> In either case I still get RuntimeError: dictionary changed size during >>>> iteration when doing the flush. (incidentally using the Locks appears to >>>> be much faster than using queues in my unscientific tests...) >>>> >>>> I have tried versions 2.4 and 2.3.1 with the same results. >>>> Interestingly this only appears to happen if there are multiple >>>> tables/groups in the H5 file. To investigate this behavior further I >>>> create a test program to illustrate (below). When run with num_groups = >>>> 5 num_tables = 5 (or greater) I see the runtime error every time. When >>>> these values are smaller than this it doesn't (at least in a short test >>>> period). >>>> >>>> I might be doing something unexpected with pytables, but this seems >>>> pretty straight forward to me. Any help is appreciated. >>>> >>>> >>>> > |
From: Alan M. <al...@al...> - 2012-12-10 16:18:19
|
I'm continuing to fight this error. As a sanity check I rewrote my sample app as a single thread only. With interleaved read/writes to multiple tables I still get "RuntimeError: dictionary changed size during iteration" in flush. I still think there is some underlying problem or something I don't understand about pytables/hdf5. I'm far from an expert on either of these so I appreciate any suggestions or even confirmation that I'm not completely crazy? The following code should work, right? import tables import random import datetime # a simple table class TableValue(tables.IsDescription): a = tables.Int64Col(pos=1) b = tables.UInt32Col(pos=2) class Test(): def __init__(self): self.stats = {'read': 0, 'write': 0, 'read_error': 0, 'write_error': 0} self.h5 = None self.h5 = tables.openFile('/data/test.h5', mode='w') self.num_groups = 5 self.num_tables = 5 # create num_groups self.groups = [self.h5.createGroup('/', "group%d"%i) for i in range(self.num_groups)] self.tables = [] # create num_tables in each group we just created for group in self.groups: tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) for i in range(self.num_tables)] self.tables.append (tbls) for table in tbls: # add an index for good measure table.cols.a.createIndex() def write(self): # select a random table and write to it x = self.tables[random.randint(0, self.num_groups-1)][random.randint(0, self.num_tables-1)].row x['a'] = random.randint(0, 100) x['b'] = random.randint(0, 100) x.append() self.stats['write'] += 1 def read(self): # first flush any cached data self.h5.flush() # then select a random table table = self.tables[random.randint(0, self.num_groups-1)][random.randint(0, self.num_tables-1)] # and do some random query table.readWhere('a > %d'%(random.randint(0, 100))) self.stats['read'] += 1 def close(self): self.h5.close() def main(): t = Test() start = datetime.datetime.now() # run for 10 seconds while (datetime.datetime.now() - start < datetime.timedelta(seconds=10)): # randomly do a read or a write if random.random() > 0.5: t.write() else: t.read() print t.stats print "Done" t.close() if __name__ == "__main__": main() On Thu, Dec 6, 2012 at 9:55 AM, Alan Marchiori <al...@al...> wrote: > Josh, > > Thanks for the detailed response. I would like to avoid going through a > separate process if at all possible due to the performance penalty. I have > also tried your last suggestion to create a dedicated pytables thread and > send everything through that but still see the same problem (Runtime error > in flush). This leads me to believe something strange is going on behind > the scenes. ?? > > Updated test program with dedicated pytables thread reading an input > Queue.Queue: > > import tables > import threading > import random > import time > import Queue > > # a simple table > class TableValue(tables.IsDescription): > a = tables.Int64Col(pos=1) > b = tables.UInt32Col(pos=2) > > class TablesThread(threading.Thread): > def __init__(self): > threading.Thread.__init__(self) > self.name = 'HDF5 io thread' > # create the dummy HDF5 file > self.h5 = None > self.h5 = tables.openFile('/data/test.h5', mode='w') > self.num_groups = 5 > self.num_tables = 5 > self.groups = [self.h5.createGroup('/', "group%d"%i) for i in > range(self.num_groups)] > self.tables = [] > for group in self.groups: > tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) > for i in range(self.num_tables)] > self.tables.append (tbls) > for table in tbls: > # add an index for good measure > table.cols.a.createIndex() > self.stopEvt = threading.Event() > self.stoppedEvt = threading.Event() > self.inputQ = Queue.Queue() > > def run(self): > try: > while not self.stopEvt.is_set(): > # get a command > try: > cmd, args, result = self.inputQ.get(timeout = 0.5) > except Queue.Empty: > # poll stopEvt so we can shutdown > continue > > # do the command > if cmd == 'write': > x = self.tables[args[0]][args[1]].row > x['a'] = args[2] > x['b'] = args[3] > x.append() > elif cmd == 'read': > self.h5.flush() > table = self.tables[args[0]][args[1]] > result.value = table.readWhere('a > %d'%(args[2])) > else: > raise Exception("Command not supported: %s"%(cmd,)) > > # signal that the result is ready > result.event.set() > > finally: > # shutdown > self.h5.close() > self.stoppedEvt.set() > > def stop(self): > if not self.stoppedEvt.is_set(): > self.stopEvt.set() > self.stoppedEvt.wait() > > class ResultEvent(): > def __init__(self): > self.event = threading.Event() > self.value = None > > class Test(): > def __init__(self): > self.tables = TablesThread() > self.tables.start() > self.timeout = 5 > self.stats = {'read': 0, > 'write': 0, > 'read_error': 0, > 'write_error': 0} > > def write(self): > r = ResultEvent() > self.tables.inputQ.put(('write', > (random.randint(0, > self.tables.num_groups-1), > random.randint(0, > self.tables.num_tables-1), > random.randint(0, 100), > random.randint(0, 100)), > r)) > r.event.wait(timeout = self.timeout) > if r.event.is_set(): > self.stats['write'] += 1 > else: > self.stats['write_error'] += 1 > > def read(self): > r = ResultEvent() > self.tables.inputQ.put(('read', > (random.randint(0, > self.tables.num_groups-1), > random.randint(0, > self.tables.num_tables-1), > random.randint(0, 100)), > r)) > r.event.wait(timeout = self.timeout) > if r.event.is_set(): > self.stats['read'] += 1 > #print 'Query got %d hits'%(len(r.value)) > else: > self.stats['read_error'] += 1 > > > def close(self): > self.tables.stop() > > def __del__(self): > self.close() > > class Worker(threading.Thread): > def __init__(self, method): > threading.Thread.__init__(self) > self.method = method > self.stopEvt = threading.Event() > > def run(self): > while not self.stopEvt.is_set(): > try: > self.method() > except Exception, x: > print 'Worker thread failed with: %s'%(x,) > time.sleep(random.random()/100.0) > > def stop(self): > self.stopEvt.set() > > def main(): > t = Test() > > threads = [Worker(t.write) for _i in range(10)] > threads.extend([Worker(t.read) for _i in range(10)]) > > for thread in threads: > thread.start() > > time.sleep(5) > > for thread in threads: > thread.stop() > > for thread in threads: > thread.join() > > t.close() > > print t.stats > > if __name__ == "__main__": > main() > > > On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...> wrote: > >> Alan, >> >> Unfortunately, the underlying HDF5 library isn't thread-safe by default. >> It can be built in a thread-safe mode that serializes all API calls, but >> still doesn't allow actual parallel access to the disk. See [1] for more >> details. Here's [2] another interesting discussion concerning whether >> multithreaded access is actually beneficial for an I/O limited library like >> HDF5. Ultimately, if one thread can read at the disk's maximum transfer >> rate, then multiple threads don't provide any benefit. >> >> Beyond the limitations of HDF5, PyTables also maintains global state in >> various module-level variables. One example is the _open_file cache in the >> file.py module. I made an attempt in the past to work around this to allow >> read-only access from multiple threads, but didn't make much progress. >> >> In general, I think your best bet is to serialize all access through a >> single process. There is another example in the PyTables/examples >> directory that benchmarks different methods of transferring data from >> PyTables to another process [3]. It compares Python's >> multiprocessing.Queue, sockets, and memory-mapped files. In my testing, >> the latter two are 5-10x faster than using a queue. >> >> Another option would be to use multiple threads, but handle all access to >> the HDF5 file in one thread. PyTables will release the GIL when making >> HDF5 library calls, so the other threads will be able to run. You could >> use a Queue.Queue or some other mechanism to transfer data between >> threads. No actual copying would be needed since their memory is shared, >> which should make it faster than the multi-process techniques. >> >> Hope that helps. >> >> Josh Ayers >> >> >> [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread >> >> [2]: >> https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5 >> >> [3]: >> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py >> >> >> On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote: >> >>> I am trying to allow multiple threads read/write access to pytables data >>> and found it is necessary to call flush() before any read. If not, the >>> latest data is not returned. However, this can cause a RuntimeError. I >>> have tried protecting pytables access with both locks and queues as done by >>> joshayers ( >>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py). >>> In either case I still get RuntimeError: dictionary changed size during >>> iteration when doing the flush. (incidentally using the Locks appears to >>> be much faster than using queues in my unscientific tests...) >>> >>> I have tried versions 2.4 and 2.3.1 with the same results. >>> Interestingly this only appears to happen if there are multiple >>> tables/groups in the H5 file. To investigate this behavior further I >>> create a test program to illustrate (below). When run with num_groups = >>> 5 num_tables = 5 (or greater) I see the runtime error every time. When >>> these values are smaller than this it doesn't (at least in a short test >>> period). >>> >>> I might be doing something unexpected with pytables, but this seems >>> pretty straight forward to me. Any help is appreciated. >>> >>> >>> |
From: Anthony S. <sc...@gm...> - 2012-12-10 15:42:26
|
Try leaving the pytables source dir and then running then running IPython. On Mon, Dec 10, 2012 at 9:20 AM, Jennifer Flegg <jen...@ww...>wrote: > Hi, > I'm trying to install pytables and its proving difficult (using MAC OS > 10.6.4). > I have installed in "/usr/local/hdf5" and set the environment variable > $HDF5_DIR to /usr/local/hdf5. When I run setup, I get a warning about > not being able to find the HDF5 runtime. > > ndmmac149:tables-2.4.0 jflegg$ sudo python setup.py install > --hdf5="/usr/local/hdf5" > * Found numpy 1.6.1 package installed. > * Found numexpr 2.0.1 package installed. > * Found Cython 0.17.2 package installed. > * Found HDF5 headers at ``/usr/local/hdf5/include``, > library at ``/usr/local/hdf5/lib``. > .. WARNING:: Could not find the HDF5 runtime. > The HDF5 shared library was *not* found in the default library > paths. In case of runtime problems, please remember to install it. > ld: library not found for -llzo2 > collect2: ld returned 1 exit status > ld: library not found for -llzo2 > collect2: ld returned 1 exit status > * Could not find LZO 2 headers and library; disabling support for it. > ld: library not found for -llzo > collect2: ld returned 1 exit status > ld: library not found for -llzo > collect2: ld returned 1 exit status > * Could not find LZO 1 headers and library; disabling support for it. > * Found bzip2 headers at ``/usr/include``, library at ``/usr/lib``. > running install > running build > running build_py > creating build > creating build/lib.macosx-10.5-i386-2.7 > creating build/lib.macosx-10.5-i386-2.7/tables > copying tables/__init__.py -> build/lib.macosx-10.5-i386-2.7/tables > copying tables/array.py -> build/lib.macosx-10.5-i386-2.7/tables > > When I import pytables in python, I get the following error message > > In [1]: import tables > --------------------------------------------------------- > ImportError Traceback (most recent call last) > /Users/jflegg/<ipython-input-1-389ecae14f10> in <module>() > ----> 1 import tables > > /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- > packages/tables/__init__.py in <module>() > 28 > 29 # Necessary imports to get versions stored on the Pyrex extension > > ---> 30 from tables.utilsExtension import getPyTablesVersion, > getHDF5Version > 31 > 32 > > ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/7.3 > /lib/python2.7/site-packages/tables/utilsExtension.so, 2): > Symbol not found: _H5E_CALLBACK_g Referenced from: > /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- > packages/tables/utilsExtension.so > Expected in: flat namespace > in /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- > packages/tables/utilsExtension.so > > > Any help would be greatly appreciated. > Jennifer > > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jennifer F. <jen...@ww...> - 2012-12-10 15:25:03
|
Hi, I'm trying to install pytables and its proving difficult (using MAC OS 10.6.4). I have installed in "/usr/local/hdf5" and set the environment variable $HDF5_DIR to /usr/local/hdf5. When I run setup, I get a warning about not being able to find the HDF5 runtime. ndmmac149:tables-2.4.0 jflegg$ sudo python setup.py install --hdf5="/usr/local/hdf5" * Found numpy 1.6.1 package installed. * Found numexpr 2.0.1 package installed. * Found Cython 0.17.2 package installed. * Found HDF5 headers at ``/usr/local/hdf5/include``, library at ``/usr/local/hdf5/lib``. .. WARNING:: Could not find the HDF5 runtime. The HDF5 shared library was *not* found in the default library paths. In case of runtime problems, please remember to install it. ld: library not found for -llzo2 collect2: ld returned 1 exit status ld: library not found for -llzo2 collect2: ld returned 1 exit status * Could not find LZO 2 headers and library; disabling support for it. ld: library not found for -llzo collect2: ld returned 1 exit status ld: library not found for -llzo collect2: ld returned 1 exit status * Could not find LZO 1 headers and library; disabling support for it. * Found bzip2 headers at ``/usr/include``, library at ``/usr/lib``. running install running build running build_py creating build creating build/lib.macosx-10.5-i386-2.7 creating build/lib.macosx-10.5-i386-2.7/tables copying tables/__init__.py -> build/lib.macosx-10.5-i386-2.7/tables copying tables/array.py -> build/lib.macosx-10.5-i386-2.7/tables When I import pytables in python, I get the following error message In [1]: import tables --------------------------------------------------------- ImportError Traceback (most recent call last) /Users/jflegg/<ipython-input-1-389ecae14f10> in <module>() ----> 1 import tables /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- packages/tables/__init__.py in <module>() 28 29 # Necessary imports to get versions stored on the Pyrex extension ---> 30 from tables.utilsExtension import getPyTablesVersion, getHDF5Version 31 32 ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/7.3 /lib/python2.7/site-packages/tables/utilsExtension.so, 2): Symbol not found: _H5E_CALLBACK_g Referenced from: /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- packages/tables/utilsExtension.so Expected in: flat namespace in /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- packages/tables/utilsExtension.so Any help would be greatly appreciated. Jennifer |
From: Francesc A. <fa...@gm...> - 2012-12-07 19:37:07
|
Please, stop reporting carray problems here. Let's communicate privately if you want. Thanks, Francesc On 12/7/12 8:22 PM, Alvaro Tejero Cantero wrote: > Thanks Francesc, that solved it. Having the disk datastructures load > compressed in memory can be a deal-breaker when you got daily 50Gb+ > datasets to process! > > The carray google group (I had not noticed it) seems unreachable at > the moment. That's why I am going to report a problem here for the > moment. With the following code > > ct0 = ca.ctable((h5f.root.c_000[:],), names=('c_000',), > rootdir=u'/lfpd1/tmp/ctable-1', mode='w', cparams=ca.cparams(5), > dtype='u2', expectedlen=len(h5f.root.c_000)) > > for k in h5f.root._v_children.keys()[:3]: #just some of the HDF5 datasets > try: > col = getattr(h5f.root, k) > ct0.addcol(col[:], name=k, expectedlen=len(col), dtype='u2') > except ValueError: > pass #exists > ct0.flush() > > >>> ct0 > ctable((303390000,), [('c_000', '<u2'), ('c_007', '<u2'), ('c_006', '<u2'), ('c_005', '<u2')]) > nbytes: 2.26 GB; cbytes: 1.30 GB; ratio: 1.73 > cparams := cparams(clevel=5, shuffle=True) > rootdir := '/lfpd1/tmp/ctable-1' > [(312, 37, 65432, 91) (313, 32, 65439, 65) (320, 24, 65433, 66) ..., > (283, 597, 677, 647) (276, 600, 649, 635) (298, 607, 635, 620)] > > The newly-added datasets/columns exist in memory > > >>> ct0['c_007'] > carray((303390000,), uint16) > nbytes: 578.67 MB; cbytes: 333.50 MB; ratio: 1.74 > cparams := cparams(clevel=5, shuffle=True) > [ 37 32 24 ..., 597 600 607] > > but they do not appear in the rootdir, not even after .flush() > > /lfpd1/tmp/ctable-1]$ ls > __attrs__ c_000 __rootdirs__ > > and something seems amiss with __rootdirs__: > /lfpd1/tmp/ctable-1]$ cat __rootdirs__ > {"dirs": {"c_007": null, "c_006": null, "c_005": null, "c_000": > "/lfpd1/tmp/ctable-1/c_000"}, "names": ["c_000", "c_007", "c_006", > "c_005"]} > > >>> ct0.cbytes//1024**2 > 1334 > > vs > /lfpd1/tmp]$ du -h ctable-1 > 12K ctable-1/c_000/meta > 340M ctable-1/c_000/data > 340M ctable-1/c_000 > 340M ctable-1 > > > and, finally, no 'open' > > ct0_disk = ca.open(rootdir='/lfpd1/tmp/ctable-1', mode='r') > > --------------------------------------------------------------------------- > ValueError Traceback (most recent call last) > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-26-41e1cb01ffe6> in<module>() > ----> 1 ct0_disk= ca.open(rootdir='/lfpd1/tmp/ctable-1', mode='r') > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/toplevel.pyc inopen(rootdir, mode) > 104 # Not a carray. Now with a ctable > > 105 try: > --> 106 obj= ca.ctable(rootdir=rootdir, mode=mode) > 107 except IOError: > 108 # Not a ctable > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in__init__(self, columns, names, **kwargs) > 193 _new= True > 194 else: > --> 195 self.open_ctable() > 196 _new= False > 197 > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc inopen_ctable(self) > 282 > 283 # Open the ctable by reading the metadata > > --> 284 self.cols.read_meta_and_open() > 285 > 286 # Get the length out of the first column > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc inread_meta_and_open(self) > 40 # Initialize the cols by instatiating the carrays > > 41 for name, dir_in data['dirs'].items(): > ---> 42 self._cols[str(name)] = ca.carray(rootdir=dir_, mode=self.mode) > 43 > 44 def update_meta(self): > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so incarray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:8637)() > > ValueError: You need at least to pass an array or/and a rootdir > > -á. > > > > On 7 December 2012 17:04, Francesc Alted <fa...@gm... > <mailto:fa...@gm...>> wrote: > > Hmm, perhaps cythonizing by hand is your best bet: > > $ cython carray/carrayExtension.pyx > > If you continue having problems, please write to the carray > mailing list. > > Francesc > > On 12/7/12 5:29 PM, Alvaro Tejero Cantero wrote: > > I have now similar dependencies as you, except for Numpy 1.7 beta 2. > > > > I wish I could help with the carray flavor. > > > > -- > > Running setup.py install for carray > > * Found Cython 0.17.2 package installed. > > * Found numpy 1.6.2 package installed. > > * Found numexpr 2.0.1 package installed. > > building 'carray.carrayExtension' extension > > C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC > > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC > > compile options: '-Iblosc > > > -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include > > -I/usr/include/python2.7 -c' > > extra options: '-msse2' > > gcc: blosc/blosclz.c > > gcc: carray/carrayExtension.c > > gcc: error: carray/carrayExtension.c: No such file or directory > > gcc: fatal error: no input files > > compilation terminated. > > gcc: error: carray/carrayExtension.c: No such file or directory > > gcc: fatal error: no input files > > compilation terminated. > > error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe > > -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC > > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc > > > -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include > > -I/usr/include/python2.7 -c carray/carrayExtension.c -o > > build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed > > with exit status 4 > > > > > > > > -á. > > > > > > > > On 7 December 2012 12:47, Francesc Alted <fa...@gm... > <mailto:fa...@gm...> > > <mailto:fa...@gm... <mailto:fa...@gm...>>> wrote: > > > > On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote: > > > Thank you for the comprehensive round-up. I have some > ideas and > > > reports below. > > > > > > What about ctables? The documentation says that it is > specificly > > > column-access optimized, which is what I need in this scenario > > > (sometimes sequential, sometimes random). > > > > Yes, ctables is optimized for column access. > > > > > > > > Unfortunately I could not get the rootdir parameter for > ctables > > > __init__ to work in carray 0.4 and pip-installing 0.5 or > 0.5.1 leads > > > to compilation errors. > > > > Yep, persistence for carray/ctables objects was added in 0.5. > > > > > > > > This is the ctables-to-disk error: > > > > > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',), > > > rootdir='/tmp/ctable2.ctable') > > > > > > --------------------------------------------------------------------------- > > > TypeError Traceback (most > > recent call last) > > > > > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> > > in<module>() > > > ----> 1 ct2= ca.ctable((np.arange(30000000),), > > names=('range2',), rootdir='/tmp/ctable2.ctable') > > > > > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc > > in__init__(self, cols, names, **kwargs) > > > 158 if column.dtype== np.void: > > > 159 raise ValueError, "`cols` > > elements cannot be of type void" > > > --> 160 column= ca.carray(column, **kwargs) > > > 161 elif ratype: > > > 162 column= ca.carray(cols[name], > **kwargs) > > > > > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so > > incarray.carrayExtension.carray.__cinit__ > > (carray/carrayExtension.c:3917)() > > > > > > TypeError: __cinit__() got an unexpected keyword argument > 'rootdir' > > > > > > > > > And this is cut from the pip output when trying to upgrade > carray. > > > > > > gcc: carray/carrayExtension.c > > > > > > gcc: error: carray/carrayExtension.c: No such file or > directory > > > > Hmm, that's strange, because the carrayExtension should have > been > > cythonized automatically. Here it is part of my install process > > with pip: > > > > Running setup.py install for carray > > * Found Cython 0.17.1 package installed. > > * Found numpy 1.7.0b2 package installed. > > * Found numexpr 2.0.1 package installed. > > cythoning carray/carrayExtension.pyx to > carray/carrayExtension.c > > building 'carray.carrayExtension' extension > > C compiler: gcc -fno-strict-aliasing > > -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g > -fwrapv -O3 > > -Wall -Wstrict-prototypes > > > > Hmm, perhaps you need a newer version of Cython? > > > > > > > > > > > Two more notes: > > > > > > * a way was added to check in-disk (compressed) vs in-memory > > > (uncompressed) node sizes. I was unable to find the way to > use it > > > either from the 2.4.0 release notes or from the git issue > > > > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 > > > > You already found the answer. > > > > > > > > * is/will it be possible to load PyTables carrays as in-memory > > carrays > > > without decompression? > > > > Actually, that has been my idea from the very beginning. The > > concept of > > 'flavor' for the returned objects when reading is already > there, so it > > should be relatively easy to add a new 'carray' flavor. > Maybe you can > > contribute this? > > > > -- > > Francesc Alted > > > > > > > ------------------------------------------------------------------------------ > > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. > Free Trial > > Remotely access PCs and mobile devices and provide instant > support > > Improve your efficiency, and focus on delivering more value-add > > services > > Discover what IT Professionals Know. Rescue delivers > > http://p.sf.net/sfu/logmein_12329d2d > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > <mailto:Pyt...@li...> > > <mailto:Pyt...@li... > <mailto:Pyt...@li...>> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > > > ------------------------------------------------------------------------------ > > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > > Remotely access PCs and mobile devices and provide instant support > > Improve your efficiency, and focus on delivering more value-add > services > > Discover what IT Professionals Know. Rescue delivers > > http://p.sf.net/sfu/logmein_12329d2d > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add > services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-12-07 19:22:56
|
Thanks Francesc, that solved it. Having the disk datastructures load compressed in memory can be a deal-breaker when you got daily 50Gb+ datasets to process! The carray google group (I had not noticed it) seems unreachable at the moment. That's why I am going to report a problem here for the moment. With the following code ct0 = ca.ctable((h5f.root.c_000[:],), names=('c_000',), rootdir= u'/lfpd1/tmp/ctable-1', mode='w', cparams=ca.cparams(5), dtype='u2', expectedlen=len(h5f.root.c_000)) for k in h5f.root._v_children.keys()[:3]: #just some of the HDF5 datasets try: col = getattr(h5f.root, k) ct0.addcol(col[:], name=k, expectedlen=len(col), dtype='u2') except ValueError: pass #exists ct0.flush() >>> ct0 ctable((303390000,), [('c_000', '<u2'), ('c_007', '<u2'), ('c_006', '<u2'), ('c_005', '<u2')]) nbytes: 2.26 GB; cbytes: 1.30 GB; ratio: 1.73 cparams := cparams(clevel=5, shuffle=True) rootdir := '/lfpd1/tmp/ctable-1' [(312, 37, 65432, 91) (313, 32, 65439, 65) (320, 24, 65433, 66) ..., (283, 597, 677, 647) (276, 600, 649, 635) (298, 607, 635, 620)] The newly-added datasets/columns exist in memory >>> ct0['c_007'] carray((303390000,), uint16) nbytes: 578.67 MB; cbytes: 333.50 MB; ratio: 1.74 cparams := cparams(clevel=5, shuffle=True) [ 37 32 24 ..., 597 600 607] but they do not appear in the rootdir, not even after .flush() /lfpd1/tmp/ctable-1]$ ls __attrs__ c_000 __rootdirs__ and something seems amiss with __rootdirs__: /lfpd1/tmp/ctable-1]$ cat __rootdirs__ {"dirs": {"c_007": null, "c_006": null, "c_005": null, "c_000": "/lfpd1/tmp/ctable-1/c_000"}, "names": ["c_000", "c_007", "c_006", "c_005"]} >>> ct0.cbytes//1024**2 1334 vs /lfpd1/tmp]$ du -h ctable-1 12K ctable-1/c_000/meta 340M ctable-1/c_000/data 340M ctable-1/c_000 340M ctable-1 and, finally, no 'open' ct0_disk = ca.open(rootdir='/lfpd1/tmp/ctable-1', mode='r') ---------------------------------------------------------------------------ValueError Traceback (most recent call last)/home/tejero/Dropbox/O/nb/nonridge/<ipython-input-26-41e1cb01ffe6> in <module>()----> 1 ct0_disk = ca.open(rootdir='/lfpd1/tmp/ctable-1', mode='r') /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/toplevel.pyc in open(rootdir, mode) 104 # Not a carray. Now with a ctable 105 try:--> 106 obj = ca.ctable(rootdir=rootdir, mode=mode) 107 except IOError: 108 # Not a ctable /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in __init__(self, columns, names, **kwargs) 193 _new = True 194 else:--> 195 self.open_ctable() 196 _new = False 197 /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in open_ctable(self) 282 283 # Open the ctable by reading the metadata--> 284 self.cols.read_meta_and_open() 285 286 # Get the length out of the first column /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in read_meta_and_open(self) 40 # Initialize the cols by instatiating the carrays 41 for name, dir_ in data['dirs'].items():---> 42 self._cols[str(name)] = ca.carray(rootdir=dir_, mode=self.mode) 43 44 def update_meta(self): /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so in carray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:8637)() ValueError: You need at least to pass an array or/and a rootdir -á. On 7 December 2012 17:04, Francesc Alted <fa...@gm...> wrote: > Hmm, perhaps cythonizing by hand is your best bet: > > $ cython carray/carrayExtension.pyx > > If you continue having problems, please write to the carray mailing list. > > Francesc > > On 12/7/12 5:29 PM, Alvaro Tejero Cantero wrote: > > I have now similar dependencies as you, except for Numpy 1.7 beta 2. > > > > I wish I could help with the carray flavor. > > > > -- > > Running setup.py install for carray > > * Found Cython 0.17.2 package installed. > > * Found numpy 1.6.2 package installed. > > * Found numexpr 2.0.1 package installed. > > building 'carray.carrayExtension' extension > > C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall > > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC > > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC > > compile options: '-Iblosc > > > -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include > > -I/usr/include/python2.7 -c' > > extra options: '-msse2' > > gcc: blosc/blosclz.c > > gcc: carray/carrayExtension.c > > gcc: error: carray/carrayExtension.c: No such file or directory > > gcc: fatal error: no input files > > compilation terminated. > > gcc: error: carray/carrayExtension.c: No such file or directory > > gcc: fatal error: no input files > > compilation terminated. > > error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe > > -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC > > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc > > > -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include > > -I/usr/include/python2.7 -c carray/carrayExtension.c -o > > build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed > > with exit status 4 > > > > > > > > -á. > > > > > > > > On 7 December 2012 12:47, Francesc Alted <fa...@gm... > > <mailto:fa...@gm...>> wrote: > > > > On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote: > > > Thank you for the comprehensive round-up. I have some ideas and > > > reports below. > > > > > > What about ctables? The documentation says that it is specificly > > > column-access optimized, which is what I need in this scenario > > > (sometimes sequential, sometimes random). > > > > Yes, ctables is optimized for column access. > > > > > > > > Unfortunately I could not get the rootdir parameter for ctables > > > __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 > leads > > > to compilation errors. > > > > Yep, persistence for carray/ctables objects was added in 0.5. > > > > > > > > This is the ctables-to-disk error: > > > > > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',), > > > rootdir='/tmp/ctable2.ctable') > > > > > > --------------------------------------------------------------------------- > > > TypeError Traceback (most > > recent call last) > > > > > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> > > in<module>() > > > ----> 1 ct2= ca.ctable((np.arange(30000000),), > > names=('range2',), rootdir='/tmp/ctable2.ctable') > > > > > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc > > in__init__(self, cols, names, **kwargs) > > > 158 if column.dtype== np.void: > > > 159 raise ValueError, "`cols` > > elements cannot be of type void" > > > --> 160 column= ca.carray(column, **kwargs) > > > 161 elif ratype: > > > 162 column= ca.carray(cols[name], **kwargs) > > > > > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so > > incarray.carrayExtension.carray.__cinit__ > > (carray/carrayExtension.c:3917)() > > > > > > TypeError: __cinit__() got an unexpected keyword argument 'rootdir' > > > > > > > > > And this is cut from the pip output when trying to upgrade carray. > > > > > > gcc: carray/carrayExtension.c > > > > > > gcc: error: carray/carrayExtension.c: No such file or directory > > > > Hmm, that's strange, because the carrayExtension should have been > > cythonized automatically. Here it is part of my install process > > with pip: > > > > Running setup.py install for carray > > * Found Cython 0.17.1 package installed. > > * Found numpy 1.7.0b2 package installed. > > * Found numexpr 2.0.1 package installed. > > cythoning carray/carrayExtension.pyx to carray/carrayExtension.c > > building 'carray.carrayExtension' extension > > C compiler: gcc -fno-strict-aliasing > > -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 > > -Wall -Wstrict-prototypes > > > > Hmm, perhaps you need a newer version of Cython? > > > > > > > > > > > Two more notes: > > > > > > * a way was added to check in-disk (compressed) vs in-memory > > > (uncompressed) node sizes. I was unable to find the way to use it > > > either from the 2.4.0 release notes or from the git issue > > > > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 > > > > You already found the answer. > > > > > > > > * is/will it be possible to load PyTables carrays as in-memory > > carrays > > > without decompression? > > > > Actually, that has been my idea from the very beginning. The > > concept of > > 'flavor' for the returned objects when reading is already there, so > it > > should be relatively easy to add a new 'carray' flavor. Maybe you > can > > contribute this? > > > > -- > > Francesc Alted > > > > > > > ------------------------------------------------------------------------------ > > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > > Remotely access PCs and mobile devices and provide instant support > > Improve your efficiency, and focus on delivering more value-add > > services > > Discover what IT Professionals Know. Rescue delivers > > http://p.sf.net/sfu/logmein_12329d2d > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > > > ------------------------------------------------------------------------------ > > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > > Remotely access PCs and mobile devices and provide instant support > > Improve your efficiency, and focus on delivering more value-add services > > Discover what IT Professionals Know. Rescue delivers > > http://p.sf.net/sfu/logmein_12329d2d > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@gm...> - 2012-12-07 17:04:25
|
Hmm, perhaps cythonizing by hand is your best bet: $ cython carray/carrayExtension.pyx If you continue having problems, please write to the carray mailing list. Francesc On 12/7/12 5:29 PM, Alvaro Tejero Cantero wrote: > I have now similar dependencies as you, except for Numpy 1.7 beta 2. > > I wish I could help with the carray flavor. > > -- > Running setup.py install for carray > * Found Cython 0.17.2 package installed. > * Found numpy 1.6.2 package installed. > * Found numexpr 2.0.1 package installed. > building 'carray.carrayExtension' extension > C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC > compile options: '-Iblosc > -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include > -I/usr/include/python2.7 -c' > extra options: '-msse2' > gcc: blosc/blosclz.c > gcc: carray/carrayExtension.c > gcc: error: carray/carrayExtension.c: No such file or directory > gcc: fatal error: no input files > compilation terminated. > gcc: error: carray/carrayExtension.c: No such file or directory > gcc: fatal error: no input files > compilation terminated. > error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe > -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc > -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include > -I/usr/include/python2.7 -c carray/carrayExtension.c -o > build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed > with exit status 4 > > > > -á. > > > > On 7 December 2012 12:47, Francesc Alted <fa...@gm... > <mailto:fa...@gm...>> wrote: > > On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote: > > Thank you for the comprehensive round-up. I have some ideas and > > reports below. > > > > What about ctables? The documentation says that it is specificly > > column-access optimized, which is what I need in this scenario > > (sometimes sequential, sometimes random). > > Yes, ctables is optimized for column access. > > > > > Unfortunately I could not get the rootdir parameter for ctables > > __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads > > to compilation errors. > > Yep, persistence for carray/ctables objects was added in 0.5. > > > > > This is the ctables-to-disk error: > > > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',), > > rootdir='/tmp/ctable2.ctable') > > > --------------------------------------------------------------------------- > > TypeError Traceback (most > recent call last) > > > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> > in<module>() > > ----> 1 ct2= ca.ctable((np.arange(30000000),), > names=('range2',), rootdir='/tmp/ctable2.ctable') > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc > in__init__(self, cols, names, **kwargs) > > 158 if column.dtype== np.void: > > 159 raise ValueError, "`cols` > elements cannot be of type void" > > --> 160 column= ca.carray(column, **kwargs) > > 161 elif ratype: > > 162 column= ca.carray(cols[name], **kwargs) > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so > incarray.carrayExtension.carray.__cinit__ > (carray/carrayExtension.c:3917)() > > > > TypeError: __cinit__() got an unexpected keyword argument 'rootdir' > > > > > > And this is cut from the pip output when trying to upgrade carray. > > > > gcc: carray/carrayExtension.c > > > > gcc: error: carray/carrayExtension.c: No such file or directory > > Hmm, that's strange, because the carrayExtension should have been > cythonized automatically. Here it is part of my install process > with pip: > > Running setup.py install for carray > * Found Cython 0.17.1 package installed. > * Found numpy 1.7.0b2 package installed. > * Found numexpr 2.0.1 package installed. > cythoning carray/carrayExtension.pyx to carray/carrayExtension.c > building 'carray.carrayExtension' extension > C compiler: gcc -fno-strict-aliasing > -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 > -Wall -Wstrict-prototypes > > Hmm, perhaps you need a newer version of Cython? > > > > > > > Two more notes: > > > > * a way was added to check in-disk (compressed) vs in-memory > > (uncompressed) node sizes. I was unable to find the way to use it > > either from the 2.4.0 release notes or from the git issue > > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 > > You already found the answer. > > > > > * is/will it be possible to load PyTables carrays as in-memory > carrays > > without decompression? > > Actually, that has been my idea from the very beginning. The > concept of > 'flavor' for the returned objects when reading is already there, so it > should be relatively easy to add a new 'carray' flavor. Maybe you can > contribute this? > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add > services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-12-07 16:30:31
|
I have now similar dependencies as you, except for Numpy 1.7 beta 2. I wish I could help with the carray flavor. -- Running setup.py install for carray * Found Cython 0.17.2 package installed. * Found numpy 1.6.2 package installed. * Found numexpr 2.0.1 package installed. building 'carray.carrayExtension' extension C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC compile options: '-Iblosc -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -c' extra options: '-msse2' gcc: blosc/blosclz.c gcc: carray/carrayExtension.c gcc: error: carray/carrayExtension.c: No such file or directory gcc: fatal error: no input files compilation terminated. gcc: error: carray/carrayExtension.c: No such file or directory gcc: fatal error: no input files compilation terminated. error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include -I/usr/include/python2.7 -c carray/carrayExtension.c -o build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed with exit status 4 -á. On 7 December 2012 12:47, Francesc Alted <fa...@gm...> wrote: > On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote: > > Thank you for the comprehensive round-up. I have some ideas and > > reports below. > > > > What about ctables? The documentation says that it is specificly > > column-access optimized, which is what I need in this scenario > > (sometimes sequential, sometimes random). > > Yes, ctables is optimized for column access. > > > > > Unfortunately I could not get the rootdir parameter for ctables > > __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads > > to compilation errors. > > Yep, persistence for carray/ctables objects was added in 0.5. > > > > > This is the ctables-to-disk error: > > > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',), > > rootdir='/tmp/ctable2.ctable') > > > --------------------------------------------------------------------------- > > TypeError Traceback (most recent call > last) > > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> > in<module>() > > ----> 1 ct2= ca.ctable((np.arange(30000000),), names=('range2',), > rootdir='/tmp/ctable2.ctable') > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc > in__init__(self, cols, names, **kwargs) > > 158 if column.dtype== np.void: > > 159 raise ValueError, "`cols` elements > cannot be of type void" > > --> 160 column= ca.carray(column, **kwargs) > > 161 elif ratype: > > 162 column= ca.carray(cols[name], **kwargs) > > > > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so > incarray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:3917)() > > > > TypeError: __cinit__() got an unexpected keyword argument 'rootdir' > > > > > > And this is cut from the pip output when trying to upgrade carray. > > > > gcc: carray/carrayExtension.c > > > > gcc: error: carray/carrayExtension.c: No such file or directory > > Hmm, that's strange, because the carrayExtension should have been > cythonized automatically. Here it is part of my install process with pip: > > Running setup.py install for carray > * Found Cython 0.17.1 package installed. > * Found numpy 1.7.0b2 package installed. > * Found numexpr 2.0.1 package installed. > cythoning carray/carrayExtension.pyx to carray/carrayExtension.c > building 'carray.carrayExtension' extension > C compiler: gcc -fno-strict-aliasing > -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 > -Wall -Wstrict-prototypes > > Hmm, perhaps you need a newer version of Cython? > > > > > > > Two more notes: > > > > * a way was added to check in-disk (compressed) vs in-memory > > (uncompressed) node sizes. I was unable to find the way to use it > > either from the 2.4.0 release notes or from the git issue > > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 > > You already found the answer. > > > > > * is/will it be possible to load PyTables carrays as in-memory carrays > > without decompression? > > Actually, that has been my idea from the very beginning. The concept of > 'flavor' for the returned objects when reading is already there, so it > should be relatively easy to add a new 'carray' flavor. Maybe you can > contribute this? > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@gm...> - 2012-12-07 12:47:12
|
On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote: > Thank you for the comprehensive round-up. I have some ideas and > reports below. > > What about ctables? The documentation says that it is specificly > column-access optimized, which is what I need in this scenario > (sometimes sequential, sometimes random). Yes, ctables is optimized for column access. > > Unfortunately I could not get the rootdir parameter for ctables > __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads > to compilation errors. Yep, persistence for carray/ctables objects was added in 0.5. > > This is the ctables-to-disk error: > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',), > rootdir='/tmp/ctable2.ctable') > --------------------------------------------------------------------------- > TypeError Traceback (most recent call last) > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> in<module>() > ----> 1 ct2= ca.ctable((np.arange(30000000),), names=('range2',), rootdir='/tmp/ctable2.ctable') > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in__init__(self, cols, names, **kwargs) > 158 if column.dtype== np.void: > 159 raise ValueError, "`cols` elements cannot be of type void" > --> 160 column= ca.carray(column, **kwargs) > 161 elif ratype: > 162 column= ca.carray(cols[name], **kwargs) > > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so incarray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:3917)() > > TypeError: __cinit__() got an unexpected keyword argument 'rootdir' > > > And this is cut from the pip output when trying to upgrade carray. > > gcc: carray/carrayExtension.c > > gcc: error: carray/carrayExtension.c: No such file or directory Hmm, that's strange, because the carrayExtension should have been cythonized automatically. Here it is part of my install process with pip: Running setup.py install for carray * Found Cython 0.17.1 package installed. * Found numpy 1.7.0b2 package installed. * Found numexpr 2.0.1 package installed. cythoning carray/carrayExtension.pyx to carray/carrayExtension.c building 'carray.carrayExtension' extension C compiler: gcc -fno-strict-aliasing -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Hmm, perhaps you need a newer version of Cython? > > > Two more notes: > > * a way was added to check in-disk (compressed) vs in-memory > (uncompressed) node sizes. I was unable to find the way to use it > either from the 2.4.0 release notes or from the git issue > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 You already found the answer. > > * is/will it be possible to load PyTables carrays as in-memory carrays > without decompression? Actually, that has been my idea from the very beginning. The concept of 'flavor' for the returned objects when reading is already there, so it should be relatively easy to add a new 'carray' flavor. Maybe you can contribute this? -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-12-06 18:30:02
|
I'll answer myself on the size-checking: the right attributes are Leaf.size_in_memory and Leaf.size_on_disk (per http://pytables.github.com/usersguide/libref/hierarchy_classes.html) -á. On 6 December 2012 12:42, Alvaro Tejero Cantero <al...@mi...> wrote: > Thank you for the comprehensive round-up. I have some ideas and reports > below. > > What about ctables? The documentation says that it is specificly > column-access optimized, which is what I need in this scenario (sometimes > sequential, sometimes random). > > Unfortunately I could not get the rootdir parameter for ctables __init__ > to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads to compilation > errors. > > This is the ctables-to-disk error: > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',), > rootdir='/tmp/ctable2.ctable') > > ---------------------------------------------------------------------------TypeError Traceback (most recent call last)/home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> in <module>()----> 1 ct2 = ca.ctable((np.arange(30000000),), names=('range2',), rootdir='/tmp/ctable2.ctable') > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs) 158 if column.dtype == np.void: 159 raise ValueError, "`cols` elements cannot be of type void"--> 160 column = ca.carray(column, **kwargs) 161 elif ratype: 162 column = ca.carray(cols[name], **kwargs) > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so in carray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:3917)() > TypeError: __cinit__() got an unexpected keyword argument 'rootdir' > > > > And this is cut from the pip output when trying to upgrade carray. > > gcc: carray/carrayExtension.c > > gcc: error: carray/carrayExtension.c: No such file or directory > > > > Two more notes: > > * a way was added to check in-disk (compressed) vs in-memory > (uncompressed) node sizes. I was unable to find the way to use it either > from the 2.4.0 release notes or from the git issue > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 > > * is/will it be possible to load PyTables carrays as in-memory carrays > without decompression? > > Best, > > Álvaro > > > > On 6 December 2012 11:49, Francesc Alted <fa...@gm...> wrote: > >> completeness, let's see how fast can perform >> carray (the package, n >> > > |
From: Alan M. <al...@al...> - 2012-12-06 14:55:34
|
Josh, Thanks for the detailed response. I would like to avoid going through a separate process if at all possible due to the performance penalty. I have also tried your last suggestion to create a dedicated pytables thread and send everything through that but still see the same problem (Runtime error in flush). This leads me to believe something strange is going on behind the scenes. ?? Updated test program with dedicated pytables thread reading an input Queue.Queue: import tables import threading import random import time import Queue # a simple table class TableValue(tables.IsDescription): a = tables.Int64Col(pos=1) b = tables.UInt32Col(pos=2) class TablesThread(threading.Thread): def __init__(self): threading.Thread.__init__(self) self.name = 'HDF5 io thread' # create the dummy HDF5 file self.h5 = None self.h5 = tables.openFile('/data/test.h5', mode='w') self.num_groups = 5 self.num_tables = 5 self.groups = [self.h5.createGroup('/', "group%d"%i) for i in range(self.num_groups)] self.tables = [] for group in self.groups: tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) for i in range(self.num_tables)] self.tables.append (tbls) for table in tbls: # add an index for good measure table.cols.a.createIndex() self.stopEvt = threading.Event() self.stoppedEvt = threading.Event() self.inputQ = Queue.Queue() def run(self): try: while not self.stopEvt.is_set(): # get a command try: cmd, args, result = self.inputQ.get(timeout = 0.5) except Queue.Empty: # poll stopEvt so we can shutdown continue # do the command if cmd == 'write': x = self.tables[args[0]][args[1]].row x['a'] = args[2] x['b'] = args[3] x.append() elif cmd == 'read': self.h5.flush() table = self.tables[args[0]][args[1]] result.value = table.readWhere('a > %d'%(args[2])) else: raise Exception("Command not supported: %s"%(cmd,)) # signal that the result is ready result.event.set() finally: # shutdown self.h5.close() self.stoppedEvt.set() def stop(self): if not self.stoppedEvt.is_set(): self.stopEvt.set() self.stoppedEvt.wait() class ResultEvent(): def __init__(self): self.event = threading.Event() self.value = None class Test(): def __init__(self): self.tables = TablesThread() self.tables.start() self.timeout = 5 self.stats = {'read': 0, 'write': 0, 'read_error': 0, 'write_error': 0} def write(self): r = ResultEvent() self.tables.inputQ.put(('write', (random.randint(0, self.tables.num_groups-1), random.randint(0, self.tables.num_tables-1), random.randint(0, 100), random.randint(0, 100)), r)) r.event.wait(timeout = self.timeout) if r.event.is_set(): self.stats['write'] += 1 else: self.stats['write_error'] += 1 def read(self): r = ResultEvent() self.tables.inputQ.put(('read', (random.randint(0, self.tables.num_groups-1), random.randint(0, self.tables.num_tables-1), random.randint(0, 100)), r)) r.event.wait(timeout = self.timeout) if r.event.is_set(): self.stats['read'] += 1 #print 'Query got %d hits'%(len(r.value)) else: self.stats['read_error'] += 1 def close(self): self.tables.stop() def __del__(self): self.close() class Worker(threading.Thread): def __init__(self, method): threading.Thread.__init__(self) self.method = method self.stopEvt = threading.Event() def run(self): while not self.stopEvt.is_set(): try: self.method() except Exception, x: print 'Worker thread failed with: %s'%(x,) time.sleep(random.random()/100.0) def stop(self): self.stopEvt.set() def main(): t = Test() threads = [Worker(t.write) for _i in range(10)] threads.extend([Worker(t.read) for _i in range(10)]) for thread in threads: thread.start() time.sleep(5) for thread in threads: thread.stop() for thread in threads: thread.join() t.close() print t.stats if __name__ == "__main__": main() On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...> wrote: > Alan, > > Unfortunately, the underlying HDF5 library isn't thread-safe by default. > It can be built in a thread-safe mode that serializes all API calls, but > still doesn't allow actual parallel access to the disk. See [1] for more > details. Here's [2] another interesting discussion concerning whether > multithreaded access is actually beneficial for an I/O limited library like > HDF5. Ultimately, if one thread can read at the disk's maximum transfer > rate, then multiple threads don't provide any benefit. > > Beyond the limitations of HDF5, PyTables also maintains global state in > various module-level variables. One example is the _open_file cache in the > file.py module. I made an attempt in the past to work around this to allow > read-only access from multiple threads, but didn't make much progress. > > In general, I think your best bet is to serialize all access through a > single process. There is another example in the PyTables/examples > directory that benchmarks different methods of transferring data from > PyTables to another process [3]. It compares Python's > multiprocessing.Queue, sockets, and memory-mapped files. In my testing, > the latter two are 5-10x faster than using a queue. > > Another option would be to use multiple threads, but handle all access to > the HDF5 file in one thread. PyTables will release the GIL when making > HDF5 library calls, so the other threads will be able to run. You could > use a Queue.Queue or some other mechanism to transfer data between > threads. No actual copying would be needed since their memory is shared, > which should make it faster than the multi-process techniques. > > Hope that helps. > > Josh Ayers > > > [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread > > [2]: > https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5 > > [3]: > https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py > > > On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote: > >> I am trying to allow multiple threads read/write access to pytables data >> and found it is necessary to call flush() before any read. If not, the >> latest data is not returned. However, this can cause a RuntimeError. I >> have tried protecting pytables access with both locks and queues as done by >> joshayers ( >> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py). >> In either case I still get RuntimeError: dictionary changed size during >> iteration when doing the flush. (incidentally using the Locks appears to >> be much faster than using queues in my unscientific tests...) >> >> I have tried versions 2.4 and 2.3.1 with the same results. Interestingly >> this only appears to happen if there are multiple tables/groups in the H5 >> file. To investigate this behavior further I create a test program to >> illustrate (below). When run with num_groups = 5 num_tables = 5 (or >> greater) I see the runtime error every time. When these values are smaller >> than this it doesn't (at least in a short test period). >> >> I might be doing something unexpected with pytables, but this seems >> pretty straight forward to me. Any help is appreciated. >> >> >> |
From: Alvaro T. C. <al...@mi...> - 2012-12-06 12:42:57
|
Thank you for the comprehensive round-up. I have some ideas and reports below. What about ctables? The documentation says that it is specificly column-access optimized, which is what I need in this scenario (sometimes sequential, sometimes random). Unfortunately I could not get the rootdir parameter for ctables __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads to compilation errors. This is the ctables-to-disk error: ct2 = ca.ctable((np.arange(30000000),), names=('range2',), rootdir='/tmp/ctable2.ctable') ---------------------------------------------------------------------------TypeError Traceback (most recent call last)/home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> in <module>()----> 1 ct2 = ca.ctable((np.arange(30000000),), names=('range2',), rootdir='/tmp/ctable2.ctable') /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs) 158 if column.dtype == np.void: 159 raise ValueError, "`cols` elements cannot be of type void"--> 160 column = ca.carray(column, **kwargs) 161 elif ratype: 162 column = ca.carray(cols[name], **kwargs) /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so in carray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:3917)() TypeError: __cinit__() got an unexpected keyword argument 'rootdir' And this is cut from the pip output when trying to upgrade carray. gcc: carray/carrayExtension.c gcc: error: carray/carrayExtension.c: No such file or directory Two more notes: * a way was added to check in-disk (compressed) vs in-memory (uncompressed) node sizes. I was unable to find the way to use it either from the 2.4.0 release notes or from the git issue https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 * is/will it be possible to load PyTables carrays as in-memory carrays without decompression? Best, Álvaro On 6 December 2012 11:49, Francesc Alted <fa...@gm...> wrote: > completeness, let's see how fast can perform > carray (the package, n > |
From: Francesc A. <fa...@gm...> - 2012-12-06 11:49:26
|
On 12/5/12 7:55 PM, Alvaro Tejero Cantero wrote: > My system was benched for reads and writes with Blosc[1]: > > with pt.openFile(paths.braw(block), 'r') as handle: > pt.setBloscMaxThreads(1) > %timeit a = handle.root.raw.c042[:] > pt.setBloscMaxThreads(6) > %timeit a = handle.root.raw.c042[:] > pt.setBloscMaxThreads(11) > %timeit a = handle.root.raw.c042[:] > print handle.root.raw._v_attrs.FILTERS > print handle.root.raw.c042.__sizeof__() > print handle.root.raw.c042 > > gives > > 1 loops, best of 3: 483 ms per loop > 1 loops, best of 3: 782 ms per loop > 1 loops, best of 3: 663 ms per loop > Filters(complevel=5, complib='blosc', shuffle=True, fletcher32=False) > 104 > /raw/c042 (CArray(303390000,), shuffle, blosc(5)) '' > > I can't understand what is going on, for the life of me. These > datasets use int16 atoms and at Blosc complevel=5 used to compress by > a factor of about 2. Even for such low compression ratios there should > be huge differences between single- and multi-threaded reads. > > Do you have any clue? Yeah, welcome to the wonderful art of fine tuning. Fortunately we have a machine which is pretty identical to yours (hey, your computer was too good in Blosc benchmarks so as to ignore it :), so I can reproduce your issue: In [3]: a = ((np.random.rand(3e8))*100).astype('i2') In [4]: f = tb.openFile("test.h5", "w") In [5]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, filters=tb.Filters(5, complib="blosc")) In [6]: act[:] = a In [7]: f.flush() In [8]: ll test.h5 -rw-rw-r-- 1 faltet 301719914 Dec 6 04:55 test.h5 This random set of numbers is close to your array in size (~3e8 elements), and also has a similar compression factor (~2x). Now the timings (using 6 cores by default): In [9]: timeit act[:] 1 loops, best of 3: 441 ms per loop In [11]: tb.setBloscMaxThreads(1) Out[11]: 6 In [12]: timeit act[:] 1 loops, best of 3: 347 ms per loop So yeah, that might seem a bit disappointing. It turns out that the default chunksize for PyTables is tuned so as to balance among sequential and random reads. If what you want is to optimize only for sequential reads (apparently this is what you are after, right?), then it normally helps to increase the chunksize. For example, by doing some quick trials, I determined that a chunksize of 2 MB is pretty optimal for sequential access: In [44]: f.removeNode(f.root.act) In [45]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, filters=tb.Filters(5, complib="blosc"), chunkshape=(2**20,)) In [46]: act[:] = a In [47]: tb.setBloscMaxThreads(1) Out[47]: 6 In [48]: timeit act[:] 1 loops, best of 3: 334 ms per loop In [49]: tb.setBloscMaxThreads(3) Out[49]: 1 In [50]: timeit act[:] 1 loops, best of 3: 298 ms per loop In [51]: tb.setBloscMaxThreads(6) Out[51]: 3 In [52]: timeit act[:] 1 loops, best of 3: 303 ms per loop Also, we see here that the sweet point is using 3 threads, not more (don't ask why). However, that does not mean that Blosc is not able to work faster on this machine, and in fact it does: In [59]: import blosc In [60]: sa = a.tostring() In [61]: ac2 = blosc.compress(sa, 2, clevel=5) In [62]: blosc.set_nthreads(6) Out[62]: 6 In [64]: timeit a2 = blosc.decompress(ac2) 10 loops, best of 3: 80.7 ms per loop In [65]: blosc.set_nthreads(1) Out[65]: 6 In [66]: timeit a2 = blosc.decompress(ac2) 1 loops, best of 3: 249 ms per loop So that means that a pure Blosc compression in-memory can only go 4x faster than PyTables + Blosc, and in this is case the latter is reaching an excellent mark of 2 GB/s, which is really good for a read from disk operation. Note how a memcpy() operation in this machine is just about as good as this: In [36]: timeit a.copy() 1 loops, best of 3: 294 ms per loop Now that I'm on this, I'm curious on how other compressors would perform for this scenario: In [6]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, filters=tb.Filters(5, complib="lzo"), chunkshape=(2**20,)) In [7]: act[:] = a In [8]: f.flush() In [9]: ll test.h5 # compression ratio very close to Blosc -rw-rw-r-- 1 faltet 302769510 Dec 6 05:23 test.h5 In [10]: timeit act[:] 1 loops, best of 3: 1.13 s per loop so, the time for LZO is more than 3x slower than Blosc. And a similar thing with zlib: In [12]: f.close() In [13]: f = tb.openFile("test.h5", "w") In [14]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape, filters=tb.Filters(1, complib="zlib"), chunkshape=(2**20,)) In [15]: act[:] = a In [16]: f.flush() In [17]: ll test.h5 # the compression rate is somewhat better -rw-rw-r-- 1 faltet 254821296 Dec 6 05:26 test.h5 In [18]: timeit act[:] 1 loops, best of 3: 2.24 s per loop which is 6x slower than Blosc (although the compression ratio is a bit better). And just for matter of completeness, let's see how fast can perform carray (the package, not the CArray object in PyTables) for a chunked array in-memory: In [19]: import carray as ca In [20]: ac3 = ca.carray(a, chunklen=2**20, cparams=ca.cparams(5)) In [21]: ac3 Out[21]: carray((300000000,), int16) nbytes: 572.20 MB; cbytes: 289.56 MB; ratio: 1.98 cparams := cparams(clevel=5, shuffle=True) [59 34 36 ..., 21 58 50] In [22]: timeit ac3[:] 1 loops, best of 3: 254 ms per loop In [23]: ca.set_nthreads(1) Out[23]: 6 In [24]: timeit ac3[:] 1 loops, best of 3: 282 ms per loop So, with 254 ms, it is only marginally faster than PyTables (~298 ms). Now with a carray object on-disk: In [27]: acd = ca.carray(a, chunklen=2**20, cparams=ca.cparams(5), rootdir="test") In [28]: acd Out[28]: carray((300000000,), int16) nbytes: 572.20 MB; cbytes: 289.56 MB; ratio: 1.98 cparams := cparams(clevel=5, shuffle=True) rootdir := 'test' [59 34 36 ..., 21 58 50] In [30]: ca.set_nthreads(6) Out[30]: 1 In [31]: timeit acd[:] 1 loops, best of 3: 317 ms per loop In [32]: ca.set_nthreads(1) Out[32]: 6 In [33]: timeit acd[:] 1 loops, best of 3: 361 ms per loop The times in this case are a bit larger than with PyTables (317ms vs 298ms), which speaks a lot how efficiently is implemented I/O in HDF5/PyTables stack. -- Francesc Alted |