You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
| 2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
| 2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
| 2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
| 2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
| 2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
| 2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
| 2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
| 2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
| 2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
| 2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
|
From: Josh A. <jos...@gm...> - 2013-01-03 18:29:40
|
David,
The change in issue 27 was only for iteration over a tables.Column
instance. To use it, tweak Anthony's code as follows. This will iterate
over the "element" column, as in your original example.
Note also that this will only work with the development version of PyTables
available on github. It will be very slow using the released v2.4.0.
from itertools import izip
with tb.openFile(...) as f:
data = f.root.data.cols.element
data_i = iter(data)
data_j = iter(data)
data_i.next() # throw the first value away
for i, j in izip(data_i, data_j):
compare(i, j)
Hope that helps,
Josh
On Thu, Jan 3, 2013 at 9:11 AM, Anthony Scopatz <sc...@gm...> wrote:
> HI David,
>
> Tables and table column iteration have been overhauled fairly recently
> [1]. So you might try creating two iterators, offset by one, and then
> doing the comparison. I am hacking this out super quick so please forgive
> me:
>
> from itertools import izip
>
> with tb.openFile(...) as f:
> data = f.root.data
> data_i = iter(data)
> data_j = iter(data)
> data_i.next() # throw the first value away
> for i, j in izip(data_i, data_j):
> compare(i, j)
>
> You get the idea ;)
>
> Be Well
> Anthony
>
> 1. https://github.com/PyTables/PyTables/issues/27
>
>
> On Thu, Jan 3, 2013 at 9:25 AM, David Reed <dav...@gm...> wrote:
>
>> I was hoping someone could help me out here.
>>
>> This is from a post I put up on StackOverflow,
>>
>> I am have a fairly large dataset that I store in HDF5 and access using
>> PyTables. One operation I need to do on this dataset are pairwise
>> comparisons between each of the elements. This requires 2 loops, one to
>> iterate over each element, and an inner loop to iterate over every other
>> element. This operation thus looks at N(N-1)/2 comparisons.
>>
>> For fairly small sets I found it to be faster to dump the contents into a
>> multdimensional numpy array and then do my iteration. I run into problems
>> with large sets because of memory issues and need to access each element of
>> the dataset at run time.
>>
>> Putting the elements into an array gives me about 600 comparisons per
>> second, while operating on hdf5 data itself gives me about 300 comparisons
>> per second.
>>
>> Is there a way to speed this process up?
>>
>> Example follows (this is not my real code, just an example):
>>
>> *Small Set*:
>>
>>
>> with tb.openFile(h5_file, 'r') as f:
>> data = f.root.data
>>
>> N_elements = len(data)
>> elements = np.empty((N_irises, 1e5))
>>
>> for ii, d in enumerate(data):
>> elements[ii] = data['element']
>>
>> D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements):
>> for jj in xrange(ii+1, N_elements):
>> D[ii, jj] = compare(elements[ii], elements[jj])
>>
>> *Large Set*:
>>
>>
>> with tb.openFile(h5_file, 'r') as f:
>> data = f.root.data
>>
>> N_elements = len(data)
>>
>> D = np.empty((N_irises, N_irises))
>> for ii in xrange(N_elements):
>> for jj in xrange(ii+1, N_elements):
>> D[ii, jj] = compare(data['element'][ii], data['element'][jj])
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
>> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
>> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
>> MVPs and experts. ON SALE this month only -- learn more at:
>> http://p.sf.net/sfu/learnmore_122712
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. ON SALE this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122712
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
|
|
From: Anthony S. <sc...@gm...> - 2013-01-03 17:12:15
|
HI David,
Tables and table column iteration have been overhauled fairly recently [1].
So you might try creating two iterators, offset by one, and then doing the
comparison. I am hacking this out super quick so please forgive me:
from itertools import izip
with tb.openFile(...) as f:
data = f.root.data
data_i = iter(data)
data_j = iter(data)
data_i.next() # throw the first value away
for i, j in izip(data_i, data_j):
compare(i, j)
You get the idea ;)
Be Well
Anthony
1. https://github.com/PyTables/PyTables/issues/27
On Thu, Jan 3, 2013 at 9:25 AM, David Reed <dav...@gm...> wrote:
> I was hoping someone could help me out here.
>
> This is from a post I put up on StackOverflow,
>
> I am have a fairly large dataset that I store in HDF5 and access using
> PyTables. One operation I need to do on this dataset are pairwise
> comparisons between each of the elements. This requires 2 loops, one to
> iterate over each element, and an inner loop to iterate over every other
> element. This operation thus looks at N(N-1)/2 comparisons.
>
> For fairly small sets I found it to be faster to dump the contents into a
> multdimensional numpy array and then do my iteration. I run into problems
> with large sets because of memory issues and need to access each element of
> the dataset at run time.
>
> Putting the elements into an array gives me about 600 comparisons per
> second, while operating on hdf5 data itself gives me about 300 comparisons
> per second.
>
> Is there a way to speed this process up?
>
> Example follows (this is not my real code, just an example):
>
> *Small Set*:
>
>
> with tb.openFile(h5_file, 'r') as f:
> data = f.root.data
>
> N_elements = len(data)
> elements = np.empty((N_irises, 1e5))
>
> for ii, d in enumerate(data):
> elements[ii] = data['element']
>
> D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements):
> for jj in xrange(ii+1, N_elements):
> D[ii, jj] = compare(elements[ii], elements[jj])
>
> *Large Set*:
>
>
> with tb.openFile(h5_file, 'r') as f:
> data = f.root.data
>
> N_elements = len(data)
>
> D = np.empty((N_irises, N_irises))
> for ii in xrange(N_elements):
> for jj in xrange(ii+1, N_elements):
> D[ii, jj] = compare(data['element'][ii], data['element'][jj])
>
>
>
> ------------------------------------------------------------------------------
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. ON SALE this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122712
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
|
|
From: David R. <dav...@gm...> - 2013-01-03 15:26:06
|
I was hoping someone could help me out here.
This is from a post I put up on StackOverflow,
I am have a fairly large dataset that I store in HDF5 and access using
PyTables. One operation I need to do on this dataset are pairwise
comparisons between each of the elements. This requires 2 loops, one to
iterate over each element, and an inner loop to iterate over every other
element. This operation thus looks at N(N-1)/2 comparisons.
For fairly small sets I found it to be faster to dump the contents into a
multdimensional numpy array and then do my iteration. I run into problems
with large sets because of memory issues and need to access each element of
the dataset at run time.
Putting the elements into an array gives me about 600 comparisons per
second, while operating on hdf5 data itself gives me about 300 comparisons
per second.
Is there a way to speed this process up?
Example follows (this is not my real code, just an example):
*Small Set*:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
elements = np.empty((N_irises, 1e5))
for ii, d in enumerate(data):
elements[ii] = data['element']
D = np.empty((N_irises, N_irises)) for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(elements[ii], elements[jj])
*Large Set*:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
D = np.empty((N_irises, N_irises))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(data['element'][ii], data['element'][jj])
|
|
From: Aquil H. A. <aqu...@gm...> - 2012-12-14 06:54:17
|
Hello All,
I currently use PyTables to generate a dataset that is indexed by a
timestamp and a symbol. The problem that I have is that the data is stored
at irregular intervals. For example:
*# See below for method ts_from_str*
data=[{'text_ts':'2012-01-04T15:00:00Z', 'symbol':'APPL', 'price':689.00,
'timestamp':ts_from_str('2012-01-04T15:00:00Z')},
{'text_ts':'2012-01-04T15:11:00Z', 'symbol':'APPL', 'price':687.24,
'timestamp':ts_from_str('2012-01-04T15:11:00Z')},
{'text_ts':'2012-01-05T15:33:00Z', 'symbol':'APPL', 'price':688.32,
'timestamp':ts_from_str('2012-01-05T15:33:00Z')},
{'text_ts':'2012-01-04T15:01:00Z', 'symbol':'MSFT', 'price':32.30,
'timestamp':ts_from_str('2012-01-04T15:01:00Z')},
{'text_ts':'2012-01-04T16:00:00Z', 'symbol':'MSFT', 'price':36.44,
'timestamp':ts_from_str('2012-01-04T16:00:00Z')},
{'text_ts':'2012-01-05T15:19:00Z', 'symbol':'MSFT', 'price':35.89,
'timestamp':ts_from_str('2012-01-05T15:19:00Z')}]
If I want to look up the price for Apple on for January 4, 2012 at
15:01:00, I will get an empty ndarry. *Is there a way to optimize the
search for data "asof" a specific time other than iterating until you find
data?* I've written my own price_asof method (See code below), that
produces the following output.
*In [63]: price_asof(dt,'APPL')*
*QUERY: (timestamp == 1325707380) & (symbol == "APPL") -- text_ts:
2012-01-04T15:03:00Z*
*QUERY: (timestamp == 1325707320) & (symbol == "APPL") -- text_ts:
2012-01-04T15:02:00Z*
*QUERY: (timestamp == 1325707260) & (symbol == "APPL") -- text_ts:
2012-01-04T15:01:00Z*
*QUERY: (timestamp == 1325707200) & (symbol == "APPL") -- text_ts:
2012-01-04T15:00:00Z*
*Out[63]: *
*array([(689.0, 'APPL', '2012-01-04T15:00:00Z', 1325707200)], *
* dtype=[('price', '<f8'), ('symbol', 'S16'), ('text_ts', 'S26'),
('timestamp', '<i4')])*
*# Code to generate data*
import tables
from datetime import datetime, timedelta
from time import mktime
import numpy as np
def ts_from_str(ts_str, ts_format='%Y-%m-%dT%H:%M:%SZ'):
"""
Create a Unix Timestamp from an ISO 8601 timestamp string
"""
dt = datetime.strptime(ts_str, ts_format)
return mktime(dt.timetuple())
class PriceData(tables.IsDescription):
text_ts = tables.StringCol(len('2012-01-01T00:00:00+00:00 '))
symbol = tables.StringCol(16)
price = tables.Float64Col()
timestamp = tables.Time32Col()
h5f = tables.openFile('test.h5','w', title='Price Data For Apple and
Microsoft')
group = h5f.createGroup('/','January', 'January Price Data')
tbl = h5f.createTable(group, 'Prices',PriceData,'Apple and Microsoft
Prices')
data=[{'text_ts':'2012-01-04T15:00:00Z', 'symbol':'APPL', 'price':689.00,
'timestamp':ts_from_str('2012-01-04T15:00:00Z')},
{'text_ts':'2012-01-04T15:11:00Z', 'symbol':'APPL', 'price':687.24,
'timestamp':ts_from_str('2012-01-04T15:11:00Z')},
{'text_ts':'2012-01-05T15:33:00Z', 'symbol':'APPL', 'price':688.32,
'timestamp':ts_from_str('2012-01-05T15:33:00Z')},
{'text_ts':'2012-01-04T15:01:00Z', 'symbol':'MSFT', 'price':32.30,
'timestamp':ts_from_str('2012-01-04T15:01:00Z')},
{'text_ts':'2012-01-04T16:00:00Z', 'symbol':'MSFT', 'price':36.44,
'timestamp':ts_from_str('2012-01-04T16:00:00Z')},
{'text_ts':'2012-01-05T15:19:00Z', 'symbol':'MSFT', 'price':35.89,
'timestamp':ts_from_str('2012-01-05T15:19:00Z')}]
price_data = tbl.row
for d in data:
price_data['text_ts'] = d['text_ts']
price_data['symbol'] = d['symbol']
price_data['price'] = d['price']
price_data['timestamp'] = d['timestamp']
price_data.append()
tbl.flush()
*# This is my price_asof function*
def price_asof(dt, symbol, max_rec=1000):
"""
Return the price of the time dt
"""
ts = mktime(dt.timetuple())
query = '(timestamp == %d)' % ts
if symbol:
query += ' & (symbol == "%s")' % symbol
data = np.ndarray(0)
count = 0
while (not data) and (count <= max_rec):
# print "QUERY: %s -- text_ts: %s" % (query,
dt.strftime('%Y-%m-%dT%H:%M:%SZ'))
data = tbl.readWhere(query)
dt = dt-timedelta(seconds=60)
ts = mktime(dt.timetuple())
query = '(timestamp == %d)' % ts
if symbol:
query += ' & (symbol == "%s")' % symbol
count += 1
return data
h5f.close()
--
Aquil H. Abdullah
aqu...@gm...
|
|
From: Josh A. <jos...@gm...> - 2012-12-12 17:53:40
|
Jennifer, When adding a Python object to a VLArray, PyTables first pickles the object. It looks like you're trying to add something that can't be pickled. Check the type of the 'state' variable in the first line of the stack trace and make sure it's something that can be pickled. See [1] for more details. Hope that helps, Josh [1]: http://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled On Wed, Dec 12, 2012 at 4:58 AM, Jennifer Flegg <jen...@ww...>wrote: > Hi All, > I'm getting errors of this sort when I use pytables to store data in hdf5 > format. > Has anyone come across this before? Is there a fix? > Thanks, > Jennifer > > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/pymc/database/hdf5.py", line 474, > in savestate s.append(state) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/tables/vlarray.py", line 462, in append > sequence = atom.toarray(sequence) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/tables/atom.py", line 1000, in toarray > buffer_ = self._tobuffer(object_) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7 > /lib/python2.7/site-packages/tables/atom.py", line 1112, in _tobuffer > return cPickle.dumps(object_, cPickle.HIGHEST_PROTOCOL) > PicklingError: Can't pickle <type 'function'>: > attribute lookup __builtin__.function failed > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
|
From: Jennifer F. <jen...@ww...> - 2012-12-12 12:59:24
|
Hi All,
I'm getting errors of this sort when I use pytables to store data in hdf5 format.
Has anyone come across this before? Is there a fix?
Thanks,
Jennifer
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7
/lib/python2.7/site-packages/pymc/database/hdf5.py", line 474,
in savestate s.append(state)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7
/lib/python2.7/site-packages/tables/vlarray.py", line 462, in append
sequence = atom.toarray(sequence)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7
/lib/python2.7/site-packages/tables/atom.py", line 1000, in toarray
buffer_ = self._tobuffer(object_)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7
/lib/python2.7/site-packages/tables/atom.py", line 1112, in _tobuffer
return cPickle.dumps(object_, cPickle.HIGHEST_PROTOCOL)
PicklingError: Can't pickle <type 'function'>:
attribute lookup __builtin__.function failed
|
|
From: Jennifer F. <jen...@ww...> - 2012-12-11 12:34:10
|
Thanks Anthony. I will check it out. Cheers, Jennifer |
|
From: Josh A. <jos...@gm...> - 2012-12-11 10:29:30
|
Alan,
I haven't found the exact problem, but seems to have something to do with
the node cache. Changing the 'NODE_CACHE_SLOTS' parameter to zero (which
disables the node cache) or to a negative number (which allows the cache to
grow without limit) also eliminates the problem, at least in the unthreaded
version of your code.
It can be set permanently in the parameters.py file, or passed as a
parameter to the tables.openFile function.
I'll open an issue on github about this problem.
Thanks,
Josh
On Mon, Dec 10, 2012 at 10:05 AM, Alan Marchiori <al...@al...>wrote:
> I think I have found a viable work around.
> Previously, I was flushing the whole HDF5 file:
> self.h5.flush()
>
> By replacing this with a flush on just the table of interest:
> table = self.tables[random.randint(0,
> self.num_groups-1)][random.randint(0, self.num_tables-1)]
> table.flush()
>
> The RuntimeError seems to be gone in all versions of my test program (both
> single threaded and threaded with locks). Hope this helps someone else and
> eventually maybe someone will figure out what is wrong with File.flush().
>
>
> On Mon, Dec 10, 2012 at 10:52 AM, Alan Marchiori <al...@al...>wrote:
>
>> I'm continuing to fight this error. As a sanity check I rewrote my
>> sample app as a single thread only. With interleaved read/writes to
>> multiple tables I still get "RuntimeError: dictionary changed size during
>> iteration" in flush. I still think there is some underlying problem or
>> something I don't understand about pytables/hdf5. I'm far from an expert
>> on either of these so I appreciate any suggestions or even confirmation
>> that I'm not completely crazy? The following code should work, right?
>>
>> import tables
>> import random
>> import datetime
>>
>> # a simple table
>> class TableValue(tables.IsDescription):
>> a = tables.Int64Col(pos=1)
>> b = tables.UInt32Col(pos=2)
>>
>> class Test():
>> def __init__(self):
>> self.stats = {'read': 0,
>> 'write': 0,
>> 'read_error': 0,
>> 'write_error': 0}
>> self.h5 = None
>> self.h5 = tables.openFile('/data/test.h5', mode='w')
>> self.num_groups = 5
>> self.num_tables = 5
>> # create num_groups
>> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in
>> range(self.num_groups)]
>> self.tables = []
>> # create num_tables in each group we just created
>> for group in self.groups:
>> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue)
>> for i in range(self.num_tables)]
>> self.tables.append (tbls)
>> for table in tbls:
>> # add an index for good measure
>> table.cols.a.createIndex()
>>
>> def write(self):
>> # select a random table and write to it
>> x = self.tables[random.randint(0,
>> self.num_groups-1)][random.randint(0, self.num_tables-1)].row
>> x['a'] = random.randint(0, 100)
>> x['b'] = random.randint(0, 100)
>> x.append()
>> self.stats['write'] += 1
>>
>> def read(self):
>> # first flush any cached data
>> self.h5.flush()
>> # then select a random table
>> table = self.tables[random.randint(0,
>> self.num_groups-1)][random.randint(0, self.num_tables-1)]
>> # and do some random query
>> table.readWhere('a > %d'%(random.randint(0, 100)))
>> self.stats['read'] += 1
>>
>> def close(self):
>> self.h5.close()
>>
>> def main():
>> t = Test()
>>
>> start = datetime.datetime.now()
>>
>> # run for 10 seconds
>> while (datetime.datetime.now() - start <
>> datetime.timedelta(seconds=10)):
>> # randomly do a read or a write
>> if random.random() > 0.5:
>> t.write()
>> else:
>> t.read()
>>
>> print t.stats
>> print "Done"
>> t.close()
>>
>> if __name__ == "__main__":
>> main()
>>
>>
>> On Thu, Dec 6, 2012 at 9:55 AM, Alan Marchiori <al...@al...>wrote:
>>
>>> Josh,
>>>
>>> Thanks for the detailed response. I would like to avoid going through a
>>> separate process if at all possible due to the performance penalty. I have
>>> also tried your last suggestion to create a dedicated pytables thread and
>>> send everything through that but still see the same problem (Runtime error
>>> in flush). This leads me to believe something strange is going on behind
>>> the scenes. ??
>>>
>>> Updated test program with dedicated pytables thread reading an input
>>> Queue.Queue:
>>>
>>> import tables
>>> import threading
>>> import random
>>> import time
>>> import Queue
>>>
>>> # a simple table
>>> class TableValue(tables.IsDescription):
>>> a = tables.Int64Col(pos=1)
>>> b = tables.UInt32Col(pos=2)
>>>
>>> class TablesThread(threading.Thread):
>>> def __init__(self):
>>> threading.Thread.__init__(self)
>>> self.name = 'HDF5 io thread'
>>> # create the dummy HDF5 file
>>> self.h5 = None
>>> self.h5 = tables.openFile('/data/test.h5', mode='w')
>>> self.num_groups = 5
>>> self.num_tables = 5
>>> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in
>>> range(self.num_groups)]
>>> self.tables = []
>>> for group in self.groups:
>>> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue)
>>> for i in range(self.num_tables)]
>>> self.tables.append (tbls)
>>> for table in tbls:
>>> # add an index for good measure
>>> table.cols.a.createIndex()
>>> self.stopEvt = threading.Event()
>>> self.stoppedEvt = threading.Event()
>>> self.inputQ = Queue.Queue()
>>>
>>> def run(self):
>>> try:
>>> while not self.stopEvt.is_set():
>>> # get a command
>>> try:
>>> cmd, args, result = self.inputQ.get(timeout = 0.5)
>>> except Queue.Empty:
>>> # poll stopEvt so we can shutdown
>>> continue
>>>
>>> # do the command
>>> if cmd == 'write':
>>> x = self.tables[args[0]][args[1]].row
>>> x['a'] = args[2]
>>> x['b'] = args[3]
>>> x.append()
>>> elif cmd == 'read':
>>> self.h5.flush()
>>> table = self.tables[args[0]][args[1]]
>>> result.value = table.readWhere('a > %d'%(args[2]))
>>> else:
>>> raise Exception("Command not supported: %s"%(cmd,))
>>>
>>> # signal that the result is ready
>>> result.event.set()
>>>
>>> finally:
>>> # shutdown
>>> self.h5.close()
>>> self.stoppedEvt.set()
>>>
>>> def stop(self):
>>> if not self.stoppedEvt.is_set():
>>> self.stopEvt.set()
>>> self.stoppedEvt.wait()
>>>
>>> class ResultEvent():
>>> def __init__(self):
>>> self.event = threading.Event()
>>> self.value = None
>>>
>>> class Test():
>>> def __init__(self):
>>> self.tables = TablesThread()
>>> self.tables.start()
>>> self.timeout = 5
>>> self.stats = {'read': 0,
>>> 'write': 0,
>>> 'read_error': 0,
>>> 'write_error': 0}
>>>
>>> def write(self):
>>> r = ResultEvent()
>>> self.tables.inputQ.put(('write',
>>> (random.randint(0,
>>> self.tables.num_groups-1),
>>> random.randint(0,
>>> self.tables.num_tables-1),
>>> random.randint(0, 100),
>>> random.randint(0, 100)),
>>> r))
>>> r.event.wait(timeout = self.timeout)
>>> if r.event.is_set():
>>> self.stats['write'] += 1
>>> else:
>>> self.stats['write_error'] += 1
>>>
>>> def read(self):
>>> r = ResultEvent()
>>> self.tables.inputQ.put(('read',
>>> (random.randint(0,
>>> self.tables.num_groups-1),
>>> random.randint(0,
>>> self.tables.num_tables-1),
>>> random.randint(0, 100)),
>>> r))
>>> r.event.wait(timeout = self.timeout)
>>> if r.event.is_set():
>>> self.stats['read'] += 1
>>> #print 'Query got %d hits'%(len(r.value))
>>> else:
>>> self.stats['read_error'] += 1
>>>
>>>
>>> def close(self):
>>> self.tables.stop()
>>>
>>> def __del__(self):
>>> self.close()
>>>
>>> class Worker(threading.Thread):
>>> def __init__(self, method):
>>> threading.Thread.__init__(self)
>>> self.method = method
>>> self.stopEvt = threading.Event()
>>>
>>> def run(self):
>>> while not self.stopEvt.is_set():
>>> try:
>>> self.method()
>>> except Exception, x:
>>> print 'Worker thread failed with: %s'%(x,)
>>> time.sleep(random.random()/100.0)
>>>
>>> def stop(self):
>>> self.stopEvt.set()
>>>
>>> def main():
>>> t = Test()
>>>
>>> threads = [Worker(t.write) for _i in range(10)]
>>> threads.extend([Worker(t.read) for _i in range(10)])
>>>
>>> for thread in threads:
>>> thread.start()
>>>
>>> time.sleep(5)
>>>
>>> for thread in threads:
>>> thread.stop()
>>>
>>> for thread in threads:
>>> thread.join()
>>>
>>> t.close()
>>>
>>> print t.stats
>>>
>>> if __name__ == "__main__":
>>> main()
>>>
>>>
>>> On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...>wrote:
>>>
>>>> Alan,
>>>>
>>>> Unfortunately, the underlying HDF5 library isn't thread-safe by
>>>> default. It can be built in a thread-safe mode that serializes all API
>>>> calls, but still doesn't allow actual parallel access to the disk. See [1]
>>>> for more details. Here's [2] another interesting discussion concerning
>>>> whether multithreaded access is actually beneficial for an I/O limited
>>>> library like HDF5. Ultimately, if one thread can read at the disk's
>>>> maximum transfer rate, then multiple threads don't provide any benefit.
>>>>
>>>> Beyond the limitations of HDF5, PyTables also maintains global state in
>>>> various module-level variables. One example is the _open_file cache in the
>>>> file.py module. I made an attempt in the past to work around this to allow
>>>> read-only access from multiple threads, but didn't make much progress.
>>>>
>>>> In general, I think your best bet is to serialize all access through a
>>>> single process. There is another example in the PyTables/examples
>>>> directory that benchmarks different methods of transferring data from
>>>> PyTables to another process [3]. It compares Python's
>>>> multiprocessing.Queue, sockets, and memory-mapped files. In my testing,
>>>> the latter two are 5-10x faster than using a queue.
>>>>
>>>> Another option would be to use multiple threads, but handle all access
>>>> to the HDF5 file in one thread. PyTables will release the GIL when making
>>>> HDF5 library calls, so the other threads will be able to run. You could
>>>> use a Queue.Queue or some other mechanism to transfer data between
>>>> threads. No actual copying would be needed since their memory is shared,
>>>> which should make it faster than the multi-process techniques.
>>>>
>>>> Hope that helps.
>>>>
>>>> Josh Ayers
>>>>
>>>>
>>>> [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread
>>>>
>>>> [2]:
>>>> https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5
>>>>
>>>> [3]:
>>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py
>>>>
>>>>
>>>> On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote:
>>>>
>>>>> I am trying to allow multiple threads read/write access to pytables
>>>>> data and found it is necessary to call flush() before any read. If not,
>>>>> the latest data is not returned. However, this can cause a RuntimeError.
>>>>> I have tried protecting pytables access with both locks and queues as done
>>>>> by joshayers (
>>>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py).
>>>>> In either case I still get RuntimeError: dictionary changed size during
>>>>> iteration when doing the flush. (incidentally using the Locks appears to
>>>>> be much faster than using queues in my unscientific tests...)
>>>>>
>>>>> I have tried versions 2.4 and 2.3.1 with the same results.
>>>>> Interestingly this only appears to happen if there are multiple
>>>>> tables/groups in the H5 file. To investigate this behavior further I
>>>>> create a test program to illustrate (below). When run with num_groups =
>>>>> 5 num_tables = 5 (or greater) I see the runtime error every time. When
>>>>> these values are smaller than this it doesn't (at least in a short test
>>>>> period).
>>>>>
>>>>> I might be doing something unexpected with pytables, but this seems
>>>>> pretty straight forward to me. Any help is appreciated.
>>>>>
>>>>>
>>>>>
>>
>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
|
|
From: Anthony S. <sc...@gm...> - 2012-12-10 22:12:55
|
Hi Jennifer, Yeah, that is right, they are not in EPD Free. However, they are in Anaconda CE (http://continuum.io/downloads.html). Note the CE rather than the full version. Be Well Anthony On Mon, Dec 10, 2012 at 4:07 PM, Jennifer Flegg <jen...@ww...>wrote: > Hi Anthony, > Thanks for your reply. I installed HDF5 also from source. The > reason I'm building hdf5 and pytables myself is that they don't > seem to be available through EPD any more (at least in the free > version: http://www.enthought.com/products/epdlibraries.php) > They used to both come bundled in EPD, but not anymore, which is > a pain. > Many thanks, > Jennifer > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
|
From: Jennifer F. <jen...@ww...> - 2012-12-10 22:08:06
|
Hi Anthony, Thanks for your reply. I installed HDF5 also from source. The reason I'm building hdf5 and pytables myself is that they don't seem to be available through EPD any more (at least in the free version: http://www.enthought.com/products/epdlibraries.php) They used to both come bundled in EPD, but not anymore, which is a pain. Many thanks, Jennifer |
|
From: Anthony S. <sc...@gm...> - 2012-12-10 19:30:03
|
Hi Jennifer, Oh, right, I am sorry. Your end error message looks very similar to another, more common issue. How did you install HDF5? On Mac I typically use MacPorts or have to install it from source. IIRC the macports build fails to make the shared libraries and you typically have to configure & compile manually? Is there a reason you are building PyTables yourself? On Mac, I typically use EPD or Anaconda.... Even when I am making edits to the PyTables (or other projects source), I use these distributions as a base and link against the HDF5 provided in them. Be Well Anthony On Mon, Dec 10, 2012 at 1:23 PM, Jennifer Flegg <jen...@ww...>wrote: > HI Anthony, > I'm not in the pytables source dir when I'm running IPython, > so I don't think this is the problem. > Thanks, > Jennifer > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
|
From: Jennifer F. <jen...@ww...> - 2012-12-10 19:23:48
|
HI Anthony, I'm not in the pytables source dir when I'm running IPython, so I don't think this is the problem. Thanks, Jennifer |
|
From: Alan M. <al...@al...> - 2012-12-10 18:05:41
|
I think I have found a viable work around.
Previously, I was flushing the whole HDF5 file:
self.h5.flush()
By replacing this with a flush on just the table of interest:
table = self.tables[random.randint(0, self.num_groups-1)][random.randint(0,
self.num_tables-1)]
table.flush()
The RuntimeError seems to be gone in all versions of my test program (both
single threaded and threaded with locks). Hope this helps someone else and
eventually maybe someone will figure out what is wrong with File.flush().
On Mon, Dec 10, 2012 at 10:52 AM, Alan Marchiori <al...@al...>wrote:
> I'm continuing to fight this error. As a sanity check I rewrote my sample
> app as a single thread only. With interleaved read/writes to multiple
> tables I still get "RuntimeError: dictionary changed size during iteration"
> in flush. I still think there is some underlying problem or something I
> don't understand about pytables/hdf5. I'm far from an expert on either of
> these so I appreciate any suggestions or even confirmation that I'm not
> completely crazy? The following code should work, right?
>
> import tables
> import random
> import datetime
>
> # a simple table
> class TableValue(tables.IsDescription):
> a = tables.Int64Col(pos=1)
> b = tables.UInt32Col(pos=2)
>
> class Test():
> def __init__(self):
> self.stats = {'read': 0,
> 'write': 0,
> 'read_error': 0,
> 'write_error': 0}
> self.h5 = None
> self.h5 = tables.openFile('/data/test.h5', mode='w')
> self.num_groups = 5
> self.num_tables = 5
> # create num_groups
> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in
> range(self.num_groups)]
> self.tables = []
> # create num_tables in each group we just created
> for group in self.groups:
> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue)
> for i in range(self.num_tables)]
> self.tables.append (tbls)
> for table in tbls:
> # add an index for good measure
> table.cols.a.createIndex()
>
> def write(self):
> # select a random table and write to it
> x = self.tables[random.randint(0,
> self.num_groups-1)][random.randint(0, self.num_tables-1)].row
> x['a'] = random.randint(0, 100)
> x['b'] = random.randint(0, 100)
> x.append()
> self.stats['write'] += 1
>
> def read(self):
> # first flush any cached data
> self.h5.flush()
> # then select a random table
> table = self.tables[random.randint(0,
> self.num_groups-1)][random.randint(0, self.num_tables-1)]
> # and do some random query
> table.readWhere('a > %d'%(random.randint(0, 100)))
> self.stats['read'] += 1
>
> def close(self):
> self.h5.close()
>
> def main():
> t = Test()
>
> start = datetime.datetime.now()
>
> # run for 10 seconds
> while (datetime.datetime.now() - start <
> datetime.timedelta(seconds=10)):
> # randomly do a read or a write
> if random.random() > 0.5:
> t.write()
> else:
> t.read()
>
> print t.stats
> print "Done"
> t.close()
>
> if __name__ == "__main__":
> main()
>
>
> On Thu, Dec 6, 2012 at 9:55 AM, Alan Marchiori <al...@al...>wrote:
>
>> Josh,
>>
>> Thanks for the detailed response. I would like to avoid going through a
>> separate process if at all possible due to the performance penalty. I have
>> also tried your last suggestion to create a dedicated pytables thread and
>> send everything through that but still see the same problem (Runtime error
>> in flush). This leads me to believe something strange is going on behind
>> the scenes. ??
>>
>> Updated test program with dedicated pytables thread reading an input
>> Queue.Queue:
>>
>> import tables
>> import threading
>> import random
>> import time
>> import Queue
>>
>> # a simple table
>> class TableValue(tables.IsDescription):
>> a = tables.Int64Col(pos=1)
>> b = tables.UInt32Col(pos=2)
>>
>> class TablesThread(threading.Thread):
>> def __init__(self):
>> threading.Thread.__init__(self)
>> self.name = 'HDF5 io thread'
>> # create the dummy HDF5 file
>> self.h5 = None
>> self.h5 = tables.openFile('/data/test.h5', mode='w')
>> self.num_groups = 5
>> self.num_tables = 5
>> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in
>> range(self.num_groups)]
>> self.tables = []
>> for group in self.groups:
>> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue)
>> for i in range(self.num_tables)]
>> self.tables.append (tbls)
>> for table in tbls:
>> # add an index for good measure
>> table.cols.a.createIndex()
>> self.stopEvt = threading.Event()
>> self.stoppedEvt = threading.Event()
>> self.inputQ = Queue.Queue()
>>
>> def run(self):
>> try:
>> while not self.stopEvt.is_set():
>> # get a command
>> try:
>> cmd, args, result = self.inputQ.get(timeout = 0.5)
>> except Queue.Empty:
>> # poll stopEvt so we can shutdown
>> continue
>>
>> # do the command
>> if cmd == 'write':
>> x = self.tables[args[0]][args[1]].row
>> x['a'] = args[2]
>> x['b'] = args[3]
>> x.append()
>> elif cmd == 'read':
>> self.h5.flush()
>> table = self.tables[args[0]][args[1]]
>> result.value = table.readWhere('a > %d'%(args[2]))
>> else:
>> raise Exception("Command not supported: %s"%(cmd,))
>>
>> # signal that the result is ready
>> result.event.set()
>>
>> finally:
>> # shutdown
>> self.h5.close()
>> self.stoppedEvt.set()
>>
>> def stop(self):
>> if not self.stoppedEvt.is_set():
>> self.stopEvt.set()
>> self.stoppedEvt.wait()
>>
>> class ResultEvent():
>> def __init__(self):
>> self.event = threading.Event()
>> self.value = None
>>
>> class Test():
>> def __init__(self):
>> self.tables = TablesThread()
>> self.tables.start()
>> self.timeout = 5
>> self.stats = {'read': 0,
>> 'write': 0,
>> 'read_error': 0,
>> 'write_error': 0}
>>
>> def write(self):
>> r = ResultEvent()
>> self.tables.inputQ.put(('write',
>> (random.randint(0,
>> self.tables.num_groups-1),
>> random.randint(0,
>> self.tables.num_tables-1),
>> random.randint(0, 100),
>> random.randint(0, 100)),
>> r))
>> r.event.wait(timeout = self.timeout)
>> if r.event.is_set():
>> self.stats['write'] += 1
>> else:
>> self.stats['write_error'] += 1
>>
>> def read(self):
>> r = ResultEvent()
>> self.tables.inputQ.put(('read',
>> (random.randint(0,
>> self.tables.num_groups-1),
>> random.randint(0,
>> self.tables.num_tables-1),
>> random.randint(0, 100)),
>> r))
>> r.event.wait(timeout = self.timeout)
>> if r.event.is_set():
>> self.stats['read'] += 1
>> #print 'Query got %d hits'%(len(r.value))
>> else:
>> self.stats['read_error'] += 1
>>
>>
>> def close(self):
>> self.tables.stop()
>>
>> def __del__(self):
>> self.close()
>>
>> class Worker(threading.Thread):
>> def __init__(self, method):
>> threading.Thread.__init__(self)
>> self.method = method
>> self.stopEvt = threading.Event()
>>
>> def run(self):
>> while not self.stopEvt.is_set():
>> try:
>> self.method()
>> except Exception, x:
>> print 'Worker thread failed with: %s'%(x,)
>> time.sleep(random.random()/100.0)
>>
>> def stop(self):
>> self.stopEvt.set()
>>
>> def main():
>> t = Test()
>>
>> threads = [Worker(t.write) for _i in range(10)]
>> threads.extend([Worker(t.read) for _i in range(10)])
>>
>> for thread in threads:
>> thread.start()
>>
>> time.sleep(5)
>>
>> for thread in threads:
>> thread.stop()
>>
>> for thread in threads:
>> thread.join()
>>
>> t.close()
>>
>> print t.stats
>>
>> if __name__ == "__main__":
>> main()
>>
>>
>> On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...> wrote:
>>
>>> Alan,
>>>
>>> Unfortunately, the underlying HDF5 library isn't thread-safe by
>>> default. It can be built in a thread-safe mode that serializes all API
>>> calls, but still doesn't allow actual parallel access to the disk. See [1]
>>> for more details. Here's [2] another interesting discussion concerning
>>> whether multithreaded access is actually beneficial for an I/O limited
>>> library like HDF5. Ultimately, if one thread can read at the disk's
>>> maximum transfer rate, then multiple threads don't provide any benefit.
>>>
>>> Beyond the limitations of HDF5, PyTables also maintains global state in
>>> various module-level variables. One example is the _open_file cache in the
>>> file.py module. I made an attempt in the past to work around this to allow
>>> read-only access from multiple threads, but didn't make much progress.
>>>
>>> In general, I think your best bet is to serialize all access through a
>>> single process. There is another example in the PyTables/examples
>>> directory that benchmarks different methods of transferring data from
>>> PyTables to another process [3]. It compares Python's
>>> multiprocessing.Queue, sockets, and memory-mapped files. In my testing,
>>> the latter two are 5-10x faster than using a queue.
>>>
>>> Another option would be to use multiple threads, but handle all access
>>> to the HDF5 file in one thread. PyTables will release the GIL when making
>>> HDF5 library calls, so the other threads will be able to run. You could
>>> use a Queue.Queue or some other mechanism to transfer data between
>>> threads. No actual copying would be needed since their memory is shared,
>>> which should make it faster than the multi-process techniques.
>>>
>>> Hope that helps.
>>>
>>> Josh Ayers
>>>
>>>
>>> [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread
>>>
>>> [2]:
>>> https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5
>>>
>>> [3]:
>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py
>>>
>>>
>>> On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote:
>>>
>>>> I am trying to allow multiple threads read/write access to pytables
>>>> data and found it is necessary to call flush() before any read. If not,
>>>> the latest data is not returned. However, this can cause a RuntimeError.
>>>> I have tried protecting pytables access with both locks and queues as done
>>>> by joshayers (
>>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py).
>>>> In either case I still get RuntimeError: dictionary changed size during
>>>> iteration when doing the flush. (incidentally using the Locks appears to
>>>> be much faster than using queues in my unscientific tests...)
>>>>
>>>> I have tried versions 2.4 and 2.3.1 with the same results.
>>>> Interestingly this only appears to happen if there are multiple
>>>> tables/groups in the H5 file. To investigate this behavior further I
>>>> create a test program to illustrate (below). When run with num_groups =
>>>> 5 num_tables = 5 (or greater) I see the runtime error every time. When
>>>> these values are smaller than this it doesn't (at least in a short test
>>>> period).
>>>>
>>>> I might be doing something unexpected with pytables, but this seems
>>>> pretty straight forward to me. Any help is appreciated.
>>>>
>>>>
>>>>
>
|
|
From: Alan M. <al...@al...> - 2012-12-10 16:18:19
|
I'm continuing to fight this error. As a sanity check I rewrote my sample
app as a single thread only. With interleaved read/writes to multiple
tables I still get "RuntimeError: dictionary changed size during iteration"
in flush. I still think there is some underlying problem or something I
don't understand about pytables/hdf5. I'm far from an expert on either of
these so I appreciate any suggestions or even confirmation that I'm not
completely crazy? The following code should work, right?
import tables
import random
import datetime
# a simple table
class TableValue(tables.IsDescription):
a = tables.Int64Col(pos=1)
b = tables.UInt32Col(pos=2)
class Test():
def __init__(self):
self.stats = {'read': 0,
'write': 0,
'read_error': 0,
'write_error': 0}
self.h5 = None
self.h5 = tables.openFile('/data/test.h5', mode='w')
self.num_groups = 5
self.num_tables = 5
# create num_groups
self.groups = [self.h5.createGroup('/', "group%d"%i) for i in
range(self.num_groups)]
self.tables = []
# create num_tables in each group we just created
for group in self.groups:
tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) for
i in range(self.num_tables)]
self.tables.append (tbls)
for table in tbls:
# add an index for good measure
table.cols.a.createIndex()
def write(self):
# select a random table and write to it
x = self.tables[random.randint(0,
self.num_groups-1)][random.randint(0, self.num_tables-1)].row
x['a'] = random.randint(0, 100)
x['b'] = random.randint(0, 100)
x.append()
self.stats['write'] += 1
def read(self):
# first flush any cached data
self.h5.flush()
# then select a random table
table = self.tables[random.randint(0,
self.num_groups-1)][random.randint(0, self.num_tables-1)]
# and do some random query
table.readWhere('a > %d'%(random.randint(0, 100)))
self.stats['read'] += 1
def close(self):
self.h5.close()
def main():
t = Test()
start = datetime.datetime.now()
# run for 10 seconds
while (datetime.datetime.now() - start <
datetime.timedelta(seconds=10)):
# randomly do a read or a write
if random.random() > 0.5:
t.write()
else:
t.read()
print t.stats
print "Done"
t.close()
if __name__ == "__main__":
main()
On Thu, Dec 6, 2012 at 9:55 AM, Alan Marchiori <al...@al...> wrote:
> Josh,
>
> Thanks for the detailed response. I would like to avoid going through a
> separate process if at all possible due to the performance penalty. I have
> also tried your last suggestion to create a dedicated pytables thread and
> send everything through that but still see the same problem (Runtime error
> in flush). This leads me to believe something strange is going on behind
> the scenes. ??
>
> Updated test program with dedicated pytables thread reading an input
> Queue.Queue:
>
> import tables
> import threading
> import random
> import time
> import Queue
>
> # a simple table
> class TableValue(tables.IsDescription):
> a = tables.Int64Col(pos=1)
> b = tables.UInt32Col(pos=2)
>
> class TablesThread(threading.Thread):
> def __init__(self):
> threading.Thread.__init__(self)
> self.name = 'HDF5 io thread'
> # create the dummy HDF5 file
> self.h5 = None
> self.h5 = tables.openFile('/data/test.h5', mode='w')
> self.num_groups = 5
> self.num_tables = 5
> self.groups = [self.h5.createGroup('/', "group%d"%i) for i in
> range(self.num_groups)]
> self.tables = []
> for group in self.groups:
> tbls = [self.h5.createTable(group, 'table%d'%i, TableValue)
> for i in range(self.num_tables)]
> self.tables.append (tbls)
> for table in tbls:
> # add an index for good measure
> table.cols.a.createIndex()
> self.stopEvt = threading.Event()
> self.stoppedEvt = threading.Event()
> self.inputQ = Queue.Queue()
>
> def run(self):
> try:
> while not self.stopEvt.is_set():
> # get a command
> try:
> cmd, args, result = self.inputQ.get(timeout = 0.5)
> except Queue.Empty:
> # poll stopEvt so we can shutdown
> continue
>
> # do the command
> if cmd == 'write':
> x = self.tables[args[0]][args[1]].row
> x['a'] = args[2]
> x['b'] = args[3]
> x.append()
> elif cmd == 'read':
> self.h5.flush()
> table = self.tables[args[0]][args[1]]
> result.value = table.readWhere('a > %d'%(args[2]))
> else:
> raise Exception("Command not supported: %s"%(cmd,))
>
> # signal that the result is ready
> result.event.set()
>
> finally:
> # shutdown
> self.h5.close()
> self.stoppedEvt.set()
>
> def stop(self):
> if not self.stoppedEvt.is_set():
> self.stopEvt.set()
> self.stoppedEvt.wait()
>
> class ResultEvent():
> def __init__(self):
> self.event = threading.Event()
> self.value = None
>
> class Test():
> def __init__(self):
> self.tables = TablesThread()
> self.tables.start()
> self.timeout = 5
> self.stats = {'read': 0,
> 'write': 0,
> 'read_error': 0,
> 'write_error': 0}
>
> def write(self):
> r = ResultEvent()
> self.tables.inputQ.put(('write',
> (random.randint(0,
> self.tables.num_groups-1),
> random.randint(0,
> self.tables.num_tables-1),
> random.randint(0, 100),
> random.randint(0, 100)),
> r))
> r.event.wait(timeout = self.timeout)
> if r.event.is_set():
> self.stats['write'] += 1
> else:
> self.stats['write_error'] += 1
>
> def read(self):
> r = ResultEvent()
> self.tables.inputQ.put(('read',
> (random.randint(0,
> self.tables.num_groups-1),
> random.randint(0,
> self.tables.num_tables-1),
> random.randint(0, 100)),
> r))
> r.event.wait(timeout = self.timeout)
> if r.event.is_set():
> self.stats['read'] += 1
> #print 'Query got %d hits'%(len(r.value))
> else:
> self.stats['read_error'] += 1
>
>
> def close(self):
> self.tables.stop()
>
> def __del__(self):
> self.close()
>
> class Worker(threading.Thread):
> def __init__(self, method):
> threading.Thread.__init__(self)
> self.method = method
> self.stopEvt = threading.Event()
>
> def run(self):
> while not self.stopEvt.is_set():
> try:
> self.method()
> except Exception, x:
> print 'Worker thread failed with: %s'%(x,)
> time.sleep(random.random()/100.0)
>
> def stop(self):
> self.stopEvt.set()
>
> def main():
> t = Test()
>
> threads = [Worker(t.write) for _i in range(10)]
> threads.extend([Worker(t.read) for _i in range(10)])
>
> for thread in threads:
> thread.start()
>
> time.sleep(5)
>
> for thread in threads:
> thread.stop()
>
> for thread in threads:
> thread.join()
>
> t.close()
>
> print t.stats
>
> if __name__ == "__main__":
> main()
>
>
> On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...> wrote:
>
>> Alan,
>>
>> Unfortunately, the underlying HDF5 library isn't thread-safe by default.
>> It can be built in a thread-safe mode that serializes all API calls, but
>> still doesn't allow actual parallel access to the disk. See [1] for more
>> details. Here's [2] another interesting discussion concerning whether
>> multithreaded access is actually beneficial for an I/O limited library like
>> HDF5. Ultimately, if one thread can read at the disk's maximum transfer
>> rate, then multiple threads don't provide any benefit.
>>
>> Beyond the limitations of HDF5, PyTables also maintains global state in
>> various module-level variables. One example is the _open_file cache in the
>> file.py module. I made an attempt in the past to work around this to allow
>> read-only access from multiple threads, but didn't make much progress.
>>
>> In general, I think your best bet is to serialize all access through a
>> single process. There is another example in the PyTables/examples
>> directory that benchmarks different methods of transferring data from
>> PyTables to another process [3]. It compares Python's
>> multiprocessing.Queue, sockets, and memory-mapped files. In my testing,
>> the latter two are 5-10x faster than using a queue.
>>
>> Another option would be to use multiple threads, but handle all access to
>> the HDF5 file in one thread. PyTables will release the GIL when making
>> HDF5 library calls, so the other threads will be able to run. You could
>> use a Queue.Queue or some other mechanism to transfer data between
>> threads. No actual copying would be needed since their memory is shared,
>> which should make it faster than the multi-process techniques.
>>
>> Hope that helps.
>>
>> Josh Ayers
>>
>>
>> [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread
>>
>> [2]:
>> https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5
>>
>> [3]:
>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py
>>
>>
>> On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote:
>>
>>> I am trying to allow multiple threads read/write access to pytables data
>>> and found it is necessary to call flush() before any read. If not, the
>>> latest data is not returned. However, this can cause a RuntimeError. I
>>> have tried protecting pytables access with both locks and queues as done by
>>> joshayers (
>>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py).
>>> In either case I still get RuntimeError: dictionary changed size during
>>> iteration when doing the flush. (incidentally using the Locks appears to
>>> be much faster than using queues in my unscientific tests...)
>>>
>>> I have tried versions 2.4 and 2.3.1 with the same results.
>>> Interestingly this only appears to happen if there are multiple
>>> tables/groups in the H5 file. To investigate this behavior further I
>>> create a test program to illustrate (below). When run with num_groups =
>>> 5 num_tables = 5 (or greater) I see the runtime error every time. When
>>> these values are smaller than this it doesn't (at least in a short test
>>> period).
>>>
>>> I might be doing something unexpected with pytables, but this seems
>>> pretty straight forward to me. Any help is appreciated.
>>>
>>>
>>>
|
|
From: Anthony S. <sc...@gm...> - 2012-12-10 15:42:26
|
Try leaving the pytables source dir and then running then running IPython. On Mon, Dec 10, 2012 at 9:20 AM, Jennifer Flegg <jen...@ww...>wrote: > Hi, > I'm trying to install pytables and its proving difficult (using MAC OS > 10.6.4). > I have installed in "/usr/local/hdf5" and set the environment variable > $HDF5_DIR to /usr/local/hdf5. When I run setup, I get a warning about > not being able to find the HDF5 runtime. > > ndmmac149:tables-2.4.0 jflegg$ sudo python setup.py install > --hdf5="/usr/local/hdf5" > * Found numpy 1.6.1 package installed. > * Found numexpr 2.0.1 package installed. > * Found Cython 0.17.2 package installed. > * Found HDF5 headers at ``/usr/local/hdf5/include``, > library at ``/usr/local/hdf5/lib``. > .. WARNING:: Could not find the HDF5 runtime. > The HDF5 shared library was *not* found in the default library > paths. In case of runtime problems, please remember to install it. > ld: library not found for -llzo2 > collect2: ld returned 1 exit status > ld: library not found for -llzo2 > collect2: ld returned 1 exit status > * Could not find LZO 2 headers and library; disabling support for it. > ld: library not found for -llzo > collect2: ld returned 1 exit status > ld: library not found for -llzo > collect2: ld returned 1 exit status > * Could not find LZO 1 headers and library; disabling support for it. > * Found bzip2 headers at ``/usr/include``, library at ``/usr/lib``. > running install > running build > running build_py > creating build > creating build/lib.macosx-10.5-i386-2.7 > creating build/lib.macosx-10.5-i386-2.7/tables > copying tables/__init__.py -> build/lib.macosx-10.5-i386-2.7/tables > copying tables/array.py -> build/lib.macosx-10.5-i386-2.7/tables > > When I import pytables in python, I get the following error message > > In [1]: import tables > --------------------------------------------------------- > ImportError Traceback (most recent call last) > /Users/jflegg/<ipython-input-1-389ecae14f10> in <module>() > ----> 1 import tables > > /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- > packages/tables/__init__.py in <module>() > 28 > 29 # Necessary imports to get versions stored on the Pyrex extension > > ---> 30 from tables.utilsExtension import getPyTablesVersion, > getHDF5Version > 31 > 32 > > ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/7.3 > /lib/python2.7/site-packages/tables/utilsExtension.so, 2): > Symbol not found: _H5E_CALLBACK_g Referenced from: > /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- > packages/tables/utilsExtension.so > Expected in: flat namespace > in /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site- > packages/tables/utilsExtension.so > > > Any help would be greatly appreciated. > Jennifer > > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
|
From: Jennifer F. <jen...@ww...> - 2012-12-10 15:25:03
|
Hi,
I'm trying to install pytables and its proving difficult (using MAC OS 10.6.4).
I have installed in "/usr/local/hdf5" and set the environment variable
$HDF5_DIR to /usr/local/hdf5. When I run setup, I get a warning about
not being able to find the HDF5 runtime.
ndmmac149:tables-2.4.0 jflegg$ sudo python setup.py install
--hdf5="/usr/local/hdf5"
* Found numpy 1.6.1 package installed.
* Found numexpr 2.0.1 package installed.
* Found Cython 0.17.2 package installed.
* Found HDF5 headers at ``/usr/local/hdf5/include``,
library at ``/usr/local/hdf5/lib``.
.. WARNING:: Could not find the HDF5 runtime.
The HDF5 shared library was *not* found in the default library
paths. In case of runtime problems, please remember to install it.
ld: library not found for -llzo2
collect2: ld returned 1 exit status
ld: library not found for -llzo2
collect2: ld returned 1 exit status
* Could not find LZO 2 headers and library; disabling support for it.
ld: library not found for -llzo
collect2: ld returned 1 exit status
ld: library not found for -llzo
collect2: ld returned 1 exit status
* Could not find LZO 1 headers and library; disabling support for it.
* Found bzip2 headers at ``/usr/include``, library at ``/usr/lib``.
running install
running build
running build_py
creating build
creating build/lib.macosx-10.5-i386-2.7
creating build/lib.macosx-10.5-i386-2.7/tables
copying tables/__init__.py -> build/lib.macosx-10.5-i386-2.7/tables
copying tables/array.py -> build/lib.macosx-10.5-i386-2.7/tables
When I import pytables in python, I get the following error message
In [1]: import tables
---------------------------------------------------------
ImportError Traceback (most recent call last)
/Users/jflegg/<ipython-input-1-389ecae14f10> in <module>()
----> 1 import tables
/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-
packages/tables/__init__.py in <module>()
28
29 # Necessary imports to get versions stored on the Pyrex extension
---> 30 from tables.utilsExtension import getPyTablesVersion, getHDF5Version
31
32
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/7.3
/lib/python2.7/site-packages/tables/utilsExtension.so, 2):
Symbol not found: _H5E_CALLBACK_g Referenced from:
/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-
packages/tables/utilsExtension.so
Expected in: flat namespace
in /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-
packages/tables/utilsExtension.so
Any help would be greatly appreciated.
Jennifer
|
|
From: Francesc A. <fa...@gm...> - 2012-12-07 19:37:07
|
Please, stop reporting carray problems here. Let's communicate
privately if you want.
Thanks,
Francesc
On 12/7/12 8:22 PM, Alvaro Tejero Cantero wrote:
> Thanks Francesc, that solved it. Having the disk datastructures load
> compressed in memory can be a deal-breaker when you got daily 50Gb+
> datasets to process!
>
> The carray google group (I had not noticed it) seems unreachable at
> the moment. That's why I am going to report a problem here for the
> moment. With the following code
>
> ct0 = ca.ctable((h5f.root.c_000[:],), names=('c_000',),
> rootdir=u'/lfpd1/tmp/ctable-1', mode='w', cparams=ca.cparams(5),
> dtype='u2', expectedlen=len(h5f.root.c_000))
>
> for k in h5f.root._v_children.keys()[:3]: #just some of the HDF5 datasets
> try:
> col = getattr(h5f.root, k)
> ct0.addcol(col[:], name=k, expectedlen=len(col), dtype='u2')
> except ValueError:
> pass #exists
> ct0.flush()
>
> >>> ct0
> ctable((303390000,), [('c_000', '<u2'), ('c_007', '<u2'), ('c_006', '<u2'), ('c_005', '<u2')])
> nbytes: 2.26 GB; cbytes: 1.30 GB; ratio: 1.73
> cparams := cparams(clevel=5, shuffle=True)
> rootdir := '/lfpd1/tmp/ctable-1'
> [(312, 37, 65432, 91) (313, 32, 65439, 65) (320, 24, 65433, 66) ...,
> (283, 597, 677, 647) (276, 600, 649, 635) (298, 607, 635, 620)]
>
> The newly-added datasets/columns exist in memory
>
> >>> ct0['c_007']
> carray((303390000,), uint16)
> nbytes: 578.67 MB; cbytes: 333.50 MB; ratio: 1.74
> cparams := cparams(clevel=5, shuffle=True)
> [ 37 32 24 ..., 597 600 607]
>
> but they do not appear in the rootdir, not even after .flush()
>
> /lfpd1/tmp/ctable-1]$ ls
> __attrs__ c_000 __rootdirs__
>
> and something seems amiss with __rootdirs__:
> /lfpd1/tmp/ctable-1]$ cat __rootdirs__
> {"dirs": {"c_007": null, "c_006": null, "c_005": null, "c_000":
> "/lfpd1/tmp/ctable-1/c_000"}, "names": ["c_000", "c_007", "c_006",
> "c_005"]}
>
> >>> ct0.cbytes//1024**2
> 1334
>
> vs
> /lfpd1/tmp]$ du -h ctable-1
> 12K ctable-1/c_000/meta
> 340M ctable-1/c_000/data
> 340M ctable-1/c_000
> 340M ctable-1
>
>
> and, finally, no 'open'
>
> ct0_disk = ca.open(rootdir='/lfpd1/tmp/ctable-1', mode='r')
>
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-26-41e1cb01ffe6> in<module>()
> ----> 1 ct0_disk= ca.open(rootdir='/lfpd1/tmp/ctable-1', mode='r')
>
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/toplevel.pyc inopen(rootdir, mode)
> 104 # Not a carray. Now with a ctable
>
> 105 try:
> --> 106 obj= ca.ctable(rootdir=rootdir, mode=mode)
> 107 except IOError:
> 108 # Not a ctable
>
>
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in__init__(self, columns, names, **kwargs)
> 193 _new= True
> 194 else:
> --> 195 self.open_ctable()
> 196 _new= False
> 197
>
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc inopen_ctable(self)
> 282
> 283 # Open the ctable by reading the metadata
>
> --> 284 self.cols.read_meta_and_open()
> 285
> 286 # Get the length out of the first column
>
>
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc inread_meta_and_open(self)
> 40 # Initialize the cols by instatiating the carrays
>
> 41 for name, dir_in data['dirs'].items():
> ---> 42 self._cols[str(name)] = ca.carray(rootdir=dir_, mode=self.mode)
> 43
> 44 def update_meta(self):
>
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so incarray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:8637)()
>
> ValueError: You need at least to pass an array or/and a rootdir
>
> -á.
>
>
>
> On 7 December 2012 17:04, Francesc Alted <fa...@gm...
> <mailto:fa...@gm...>> wrote:
>
> Hmm, perhaps cythonizing by hand is your best bet:
>
> $ cython carray/carrayExtension.pyx
>
> If you continue having problems, please write to the carray
> mailing list.
>
> Francesc
>
> On 12/7/12 5:29 PM, Alvaro Tejero Cantero wrote:
> > I have now similar dependencies as you, except for Numpy 1.7 beta 2.
> >
> > I wish I could help with the carray flavor.
> >
> > --
> > Running setup.py install for carray
> > * Found Cython 0.17.2 package installed.
> > * Found numpy 1.6.2 package installed.
> > * Found numexpr 2.0.1 package installed.
> > building 'carray.carrayExtension' extension
> > C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall
> > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
> > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC
> > compile options: '-Iblosc
> >
> -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
> > -I/usr/include/python2.7 -c'
> > extra options: '-msse2'
> > gcc: blosc/blosclz.c
> > gcc: carray/carrayExtension.c
> > gcc: error: carray/carrayExtension.c: No such file or directory
> > gcc: fatal error: no input files
> > compilation terminated.
> > gcc: error: carray/carrayExtension.c: No such file or directory
> > gcc: fatal error: no input files
> > compilation terminated.
> > error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe
> > -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
> > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc
> >
> -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
> > -I/usr/include/python2.7 -c carray/carrayExtension.c -o
> > build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed
> > with exit status 4
> >
> >
> >
> > -á.
> >
> >
> >
> > On 7 December 2012 12:47, Francesc Alted <fa...@gm...
> <mailto:fa...@gm...>
> > <mailto:fa...@gm... <mailto:fa...@gm...>>> wrote:
> >
> > On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote:
> > > Thank you for the comprehensive round-up. I have some
> ideas and
> > > reports below.
> > >
> > > What about ctables? The documentation says that it is
> specificly
> > > column-access optimized, which is what I need in this scenario
> > > (sometimes sequential, sometimes random).
> >
> > Yes, ctables is optimized for column access.
> >
> > >
> > > Unfortunately I could not get the rootdir parameter for
> ctables
> > > __init__ to work in carray 0.4 and pip-installing 0.5 or
> 0.5.1 leads
> > > to compilation errors.
> >
> > Yep, persistence for carray/ctables objects was added in 0.5.
> >
> > >
> > > This is the ctables-to-disk error:
> > >
> > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',),
> > > rootdir='/tmp/ctable2.ctable')
> > >
> >
> ---------------------------------------------------------------------------
> > > TypeError Traceback (most
> > recent call last)
> > >
> > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b>
> > in<module>()
> > > ----> 1 ct2= ca.ctable((np.arange(30000000),),
> > names=('range2',), rootdir='/tmp/ctable2.ctable')
> > >
> > >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
> > in__init__(self, cols, names, **kwargs)
> > > 158 if column.dtype== np.void:
> > > 159 raise ValueError, "`cols`
> > elements cannot be of type void"
> > > --> 160 column= ca.carray(column, **kwargs)
> > > 161 elif ratype:
> > > 162 column= ca.carray(cols[name],
> **kwargs)
> > >
> > >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so
> > incarray.carrayExtension.carray.__cinit__
> > (carray/carrayExtension.c:3917)()
> > >
> > > TypeError: __cinit__() got an unexpected keyword argument
> 'rootdir'
> > >
> > >
> > > And this is cut from the pip output when trying to upgrade
> carray.
> > >
> > > gcc: carray/carrayExtension.c
> > >
> > > gcc: error: carray/carrayExtension.c: No such file or
> directory
> >
> > Hmm, that's strange, because the carrayExtension should have
> been
> > cythonized automatically. Here it is part of my install process
> > with pip:
> >
> > Running setup.py install for carray
> > * Found Cython 0.17.1 package installed.
> > * Found numpy 1.7.0b2 package installed.
> > * Found numexpr 2.0.1 package installed.
> > cythoning carray/carrayExtension.pyx to
> carray/carrayExtension.c
> > building 'carray.carrayExtension' extension
> > C compiler: gcc -fno-strict-aliasing
> > -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g
> -fwrapv -O3
> > -Wall -Wstrict-prototypes
> >
> > Hmm, perhaps you need a newer version of Cython?
> >
> > >
> > >
> > > Two more notes:
> > >
> > > * a way was added to check in-disk (compressed) vs in-memory
> > > (uncompressed) node sizes. I was unable to find the way to
> use it
> > > either from the 2.4.0 release notes or from the git issue
> > >
> https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763
> >
> > You already found the answer.
> >
> > >
> > > * is/will it be possible to load PyTables carrays as in-memory
> > carrays
> > > without decompression?
> >
> > Actually, that has been my idea from the very beginning. The
> > concept of
> > 'flavor' for the returned objects when reading is already
> there, so it
> > should be relatively easy to add a new 'carray' flavor.
> Maybe you can
> > contribute this?
> >
> > --
> > Francesc Alted
> >
> >
> >
> ------------------------------------------------------------------------------
> > LogMeIn Rescue: Anywhere, Anytime Remote support for IT.
> Free Trial
> > Remotely access PCs and mobile devices and provide instant
> support
> > Improve your efficiency, and focus on delivering more value-add
> > services
> > Discover what IT Professionals Know. Rescue delivers
> > http://p.sf.net/sfu/logmein_12329d2d
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> <mailto:Pyt...@li...>
> > <mailto:Pyt...@li...
> <mailto:Pyt...@li...>>
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> > Remotely access PCs and mobile devices and provide instant support
> > Improve your efficiency, and focus on delivering more value-add
> services
> > Discover what IT Professionals Know. Rescue delivers
> > http://p.sf.net/sfu/logmein_12329d2d
> >
> >
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> <mailto:Pyt...@li...>
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add
> services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> <mailto:Pyt...@li...>
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
>
>
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
--
Francesc Alted
|
|
From: Alvaro T. C. <al...@mi...> - 2012-12-07 19:22:56
|
Thanks Francesc, that solved it. Having the disk datastructures load
compressed in memory can be a deal-breaker when you got daily 50Gb+
datasets to process!
The carray google group (I had not noticed it) seems unreachable at the
moment. That's why I am going to report a problem here for the moment. With
the following code
ct0 = ca.ctable((h5f.root.c_000[:],), names=('c_000',), rootdir=
u'/lfpd1/tmp/ctable-1', mode='w', cparams=ca.cparams(5), dtype='u2',
expectedlen=len(h5f.root.c_000))
for k in h5f.root._v_children.keys()[:3]: #just some of the HDF5 datasets
try:
col = getattr(h5f.root, k)
ct0.addcol(col[:], name=k, expectedlen=len(col), dtype='u2')
except ValueError:
pass #exists
ct0.flush()
>>> ct0
ctable((303390000,), [('c_000', '<u2'), ('c_007', '<u2'), ('c_006',
'<u2'), ('c_005', '<u2')])
nbytes: 2.26 GB; cbytes: 1.30 GB; ratio: 1.73
cparams := cparams(clevel=5, shuffle=True)
rootdir := '/lfpd1/tmp/ctable-1'
[(312, 37, 65432, 91) (313, 32, 65439, 65) (320, 24, 65433, 66) ...,
(283, 597, 677, 647) (276, 600, 649, 635) (298, 607, 635, 620)]
The newly-added datasets/columns exist in memory
>>> ct0['c_007']
carray((303390000,), uint16)
nbytes: 578.67 MB; cbytes: 333.50 MB; ratio: 1.74
cparams := cparams(clevel=5, shuffle=True)
[ 37 32 24 ..., 597 600 607]
but they do not appear in the rootdir, not even after .flush()
/lfpd1/tmp/ctable-1]$ ls
__attrs__ c_000 __rootdirs__
and something seems amiss with __rootdirs__:
/lfpd1/tmp/ctable-1]$ cat __rootdirs__
{"dirs": {"c_007": null, "c_006": null, "c_005": null, "c_000":
"/lfpd1/tmp/ctable-1/c_000"}, "names": ["c_000", "c_007", "c_006", "c_005"]}
>>> ct0.cbytes//1024**2
1334
vs
/lfpd1/tmp]$ du -h ctable-1
12K ctable-1/c_000/meta
340M ctable-1/c_000/data
340M ctable-1/c_000
340M ctable-1
and, finally, no 'open'
ct0_disk = ca.open(rootdir='/lfpd1/tmp/ctable-1', mode='r')
---------------------------------------------------------------------------ValueError
Traceback (most recent call
last)/home/tejero/Dropbox/O/nb/nonridge/<ipython-input-26-41e1cb01ffe6>
in <module>()----> 1 ct0_disk = ca.open(rootdir='/lfpd1/tmp/ctable-1',
mode='r')
/home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/toplevel.pyc
in open(rootdir, mode) 104 # Not a carray. Now with a
ctable 105 try:--> 106 obj =
ca.ctable(rootdir=rootdir, mode=mode) 107 except IOError:
108 # Not a ctable
/home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
in __init__(self, columns, names, **kwargs) 193 _new =
True 194 else:--> 195 self.open_ctable() 196
_new = False 197
/home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
in open_ctable(self) 282 283 # Open the ctable by
reading the metadata--> 284 self.cols.read_meta_and_open()
285 286 # Get the length out of the first column
/home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
in read_meta_and_open(self) 40 # Initialize the cols by
instatiating the carrays 41 for name, dir_ in
data['dirs'].items():---> 42 self._cols[str(name)] =
ca.carray(rootdir=dir_, mode=self.mode) 43 44 def
update_meta(self):
/home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so
in carray.carrayExtension.carray.__cinit__
(carray/carrayExtension.c:8637)()
ValueError: You need at least to pass an array or/and a rootdir
-á.
On 7 December 2012 17:04, Francesc Alted <fa...@gm...> wrote:
> Hmm, perhaps cythonizing by hand is your best bet:
>
> $ cython carray/carrayExtension.pyx
>
> If you continue having problems, please write to the carray mailing list.
>
> Francesc
>
> On 12/7/12 5:29 PM, Alvaro Tejero Cantero wrote:
> > I have now similar dependencies as you, except for Numpy 1.7 beta 2.
> >
> > I wish I could help with the carray flavor.
> >
> > --
> > Running setup.py install for carray
> > * Found Cython 0.17.2 package installed.
> > * Found numpy 1.6.2 package installed.
> > * Found numexpr 2.0.1 package installed.
> > building 'carray.carrayExtension' extension
> > C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall
> > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
> > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC
> > compile options: '-Iblosc
> >
> -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
> > -I/usr/include/python2.7 -c'
> > extra options: '-msse2'
> > gcc: blosc/blosclz.c
> > gcc: carray/carrayExtension.c
> > gcc: error: carray/carrayExtension.c: No such file or directory
> > gcc: fatal error: no input files
> > compilation terminated.
> > gcc: error: carray/carrayExtension.c: No such file or directory
> > gcc: fatal error: no input files
> > compilation terminated.
> > error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe
> > -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> > --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
> > -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> > -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc
> >
> -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
> > -I/usr/include/python2.7 -c carray/carrayExtension.c -o
> > build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed
> > with exit status 4
> >
> >
> >
> > -á.
> >
> >
> >
> > On 7 December 2012 12:47, Francesc Alted <fa...@gm...
> > <mailto:fa...@gm...>> wrote:
> >
> > On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote:
> > > Thank you for the comprehensive round-up. I have some ideas and
> > > reports below.
> > >
> > > What about ctables? The documentation says that it is specificly
> > > column-access optimized, which is what I need in this scenario
> > > (sometimes sequential, sometimes random).
> >
> > Yes, ctables is optimized for column access.
> >
> > >
> > > Unfortunately I could not get the rootdir parameter for ctables
> > > __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1
> leads
> > > to compilation errors.
> >
> > Yep, persistence for carray/ctables objects was added in 0.5.
> >
> > >
> > > This is the ctables-to-disk error:
> > >
> > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',),
> > > rootdir='/tmp/ctable2.ctable')
> > >
> >
> ---------------------------------------------------------------------------
> > > TypeError Traceback (most
> > recent call last)
> > >
> > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b>
> > in<module>()
> > > ----> 1 ct2= ca.ctable((np.arange(30000000),),
> > names=('range2',), rootdir='/tmp/ctable2.ctable')
> > >
> > >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
> > in__init__(self, cols, names, **kwargs)
> > > 158 if column.dtype== np.void:
> > > 159 raise ValueError, "`cols`
> > elements cannot be of type void"
> > > --> 160 column= ca.carray(column, **kwargs)
> > > 161 elif ratype:
> > > 162 column= ca.carray(cols[name], **kwargs)
> > >
> > >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so
> > incarray.carrayExtension.carray.__cinit__
> > (carray/carrayExtension.c:3917)()
> > >
> > > TypeError: __cinit__() got an unexpected keyword argument 'rootdir'
> > >
> > >
> > > And this is cut from the pip output when trying to upgrade carray.
> > >
> > > gcc: carray/carrayExtension.c
> > >
> > > gcc: error: carray/carrayExtension.c: No such file or directory
> >
> > Hmm, that's strange, because the carrayExtension should have been
> > cythonized automatically. Here it is part of my install process
> > with pip:
> >
> > Running setup.py install for carray
> > * Found Cython 0.17.1 package installed.
> > * Found numpy 1.7.0b2 package installed.
> > * Found numexpr 2.0.1 package installed.
> > cythoning carray/carrayExtension.pyx to carray/carrayExtension.c
> > building 'carray.carrayExtension' extension
> > C compiler: gcc -fno-strict-aliasing
> > -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3
> > -Wall -Wstrict-prototypes
> >
> > Hmm, perhaps you need a newer version of Cython?
> >
> > >
> > >
> > > Two more notes:
> > >
> > > * a way was added to check in-disk (compressed) vs in-memory
> > > (uncompressed) node sizes. I was unable to find the way to use it
> > > either from the 2.4.0 release notes or from the git issue
> > >
> https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763
> >
> > You already found the answer.
> >
> > >
> > > * is/will it be possible to load PyTables carrays as in-memory
> > carrays
> > > without decompression?
> >
> > Actually, that has been my idea from the very beginning. The
> > concept of
> > 'flavor' for the returned objects when reading is already there, so
> it
> > should be relatively easy to add a new 'carray' flavor. Maybe you
> can
> > contribute this?
> >
> > --
> > Francesc Alted
> >
> >
> >
> ------------------------------------------------------------------------------
> > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> > Remotely access PCs and mobile devices and provide instant support
> > Improve your efficiency, and focus on delivering more value-add
> > services
> > Discover what IT Professionals Know. Rescue delivers
> > http://p.sf.net/sfu/logmein_12329d2d
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > <mailto:Pyt...@li...>
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> > Remotely access PCs and mobile devices and provide instant support
> > Improve your efficiency, and focus on delivering more value-add services
> > Discover what IT Professionals Know. Rescue delivers
> > http://p.sf.net/sfu/logmein_12329d2d
> >
> >
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
|
|
From: Francesc A. <fa...@gm...> - 2012-12-07 17:04:25
|
Hmm, perhaps cythonizing by hand is your best bet:
$ cython carray/carrayExtension.pyx
If you continue having problems, please write to the carray mailing list.
Francesc
On 12/7/12 5:29 PM, Alvaro Tejero Cantero wrote:
> I have now similar dependencies as you, except for Numpy 1.7 beta 2.
>
> I wish I could help with the carray flavor.
>
> --
> Running setup.py install for carray
> * Found Cython 0.17.2 package installed.
> * Found numpy 1.6.2 package installed.
> * Found numexpr 2.0.1 package installed.
> building 'carray.carrayExtension' extension
> C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
> -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC
> compile options: '-Iblosc
> -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
> -I/usr/include/python2.7 -c'
> extra options: '-msse2'
> gcc: blosc/blosclz.c
> gcc: carray/carrayExtension.c
> gcc: error: carray/carrayExtension.c: No such file or directory
> gcc: fatal error: no input files
> compilation terminated.
> gcc: error: carray/carrayExtension.c: No such file or directory
> gcc: fatal error: no input files
> compilation terminated.
> error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe
> -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
> --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC
> -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc
> -I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
> -I/usr/include/python2.7 -c carray/carrayExtension.c -o
> build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed
> with exit status 4
>
>
>
> -á.
>
>
>
> On 7 December 2012 12:47, Francesc Alted <fa...@gm...
> <mailto:fa...@gm...>> wrote:
>
> On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote:
> > Thank you for the comprehensive round-up. I have some ideas and
> > reports below.
> >
> > What about ctables? The documentation says that it is specificly
> > column-access optimized, which is what I need in this scenario
> > (sometimes sequential, sometimes random).
>
> Yes, ctables is optimized for column access.
>
> >
> > Unfortunately I could not get the rootdir parameter for ctables
> > __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads
> > to compilation errors.
>
> Yep, persistence for carray/ctables objects was added in 0.5.
>
> >
> > This is the ctables-to-disk error:
> >
> > ct2 = ca.ctable((np.arange(30000000),), names=('range2',),
> > rootdir='/tmp/ctable2.ctable')
> >
> ---------------------------------------------------------------------------
> > TypeError Traceback (most
> recent call last)
> >
> /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b>
> in<module>()
> > ----> 1 ct2= ca.ctable((np.arange(30000000),),
> names=('range2',), rootdir='/tmp/ctable2.ctable')
> >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
> in__init__(self, cols, names, **kwargs)
> > 158 if column.dtype== np.void:
> > 159 raise ValueError, "`cols`
> elements cannot be of type void"
> > --> 160 column= ca.carray(column, **kwargs)
> > 161 elif ratype:
> > 162 column= ca.carray(cols[name], **kwargs)
> >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so
> incarray.carrayExtension.carray.__cinit__
> (carray/carrayExtension.c:3917)()
> >
> > TypeError: __cinit__() got an unexpected keyword argument 'rootdir'
> >
> >
> > And this is cut from the pip output when trying to upgrade carray.
> >
> > gcc: carray/carrayExtension.c
> >
> > gcc: error: carray/carrayExtension.c: No such file or directory
>
> Hmm, that's strange, because the carrayExtension should have been
> cythonized automatically. Here it is part of my install process
> with pip:
>
> Running setup.py install for carray
> * Found Cython 0.17.1 package installed.
> * Found numpy 1.7.0b2 package installed.
> * Found numexpr 2.0.1 package installed.
> cythoning carray/carrayExtension.pyx to carray/carrayExtension.c
> building 'carray.carrayExtension' extension
> C compiler: gcc -fno-strict-aliasing
> -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3
> -Wall -Wstrict-prototypes
>
> Hmm, perhaps you need a newer version of Cython?
>
> >
> >
> > Two more notes:
> >
> > * a way was added to check in-disk (compressed) vs in-memory
> > (uncompressed) node sizes. I was unable to find the way to use it
> > either from the 2.4.0 release notes or from the git issue
> > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763
>
> You already found the answer.
>
> >
> > * is/will it be possible to load PyTables carrays as in-memory
> carrays
> > without decompression?
>
> Actually, that has been my idea from the very beginning. The
> concept of
> 'flavor' for the returned objects when reading is already there, so it
> should be relatively easy to add a new 'carray' flavor. Maybe you can
> contribute this?
>
> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add
> services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> <mailto:Pyt...@li...>
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
>
>
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
--
Francesc Alted
|
|
From: Alvaro T. C. <al...@mi...> - 2012-12-07 16:30:31
|
I have now similar dependencies as you, except for Numpy 1.7 beta 2.
I wish I could help with the carray flavor.
--
Running setup.py install for carray
* Found Cython 0.17.2 package installed.
* Found numpy 1.6.2 package installed.
* Found numexpr 2.0.1 package installed.
building 'carray.carrayExtension' extension
C compiler: gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv
-DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic
-D_GNU_SOURCE -fPIC -fwrapv -fPIC
compile options: '-Iblosc
-I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
-I/usr/include/python2.7 -c'
extra options: '-msse2'
gcc: blosc/blosclz.c
gcc: carray/carrayExtension.c
gcc: error: carray/carrayExtension.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
gcc: error: carray/carrayExtension.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: Command "gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv
-DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic
-D_GNU_SOURCE -fPIC -fwrapv -fPIC -Iblosc
-I/home/tejero/Local/Envs/test/lib/python2.7/site-packages/numpy/core/include
-I/usr/include/python2.7 -c carray/carrayExtension.c -o
build/temp.linux-x86_64-2.7/carray/carrayExtension.o -msse2" failed with
exit status 4
-á.
On 7 December 2012 12:47, Francesc Alted <fa...@gm...> wrote:
> On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote:
> > Thank you for the comprehensive round-up. I have some ideas and
> > reports below.
> >
> > What about ctables? The documentation says that it is specificly
> > column-access optimized, which is what I need in this scenario
> > (sometimes sequential, sometimes random).
>
> Yes, ctables is optimized for column access.
>
> >
> > Unfortunately I could not get the rootdir parameter for ctables
> > __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads
> > to compilation errors.
>
> Yep, persistence for carray/ctables objects was added in 0.5.
>
> >
> > This is the ctables-to-disk error:
> >
> > ct2 = ca.ctable((np.arange(30000000),), names=('range2',),
> > rootdir='/tmp/ctable2.ctable')
> >
> ---------------------------------------------------------------------------
> > TypeError Traceback (most recent call
> last)
> > /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b>
> in<module>()
> > ----> 1 ct2= ca.ctable((np.arange(30000000),), names=('range2',),
> rootdir='/tmp/ctable2.ctable')
> >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
> in__init__(self, cols, names, **kwargs)
> > 158 if column.dtype== np.void:
> > 159 raise ValueError, "`cols` elements
> cannot be of type void"
> > --> 160 column= ca.carray(column, **kwargs)
> > 161 elif ratype:
> > 162 column= ca.carray(cols[name], **kwargs)
> >
> >
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so
> incarray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:3917)()
> >
> > TypeError: __cinit__() got an unexpected keyword argument 'rootdir'
> >
> >
> > And this is cut from the pip output when trying to upgrade carray.
> >
> > gcc: carray/carrayExtension.c
> >
> > gcc: error: carray/carrayExtension.c: No such file or directory
>
> Hmm, that's strange, because the carrayExtension should have been
> cythonized automatically. Here it is part of my install process with pip:
>
> Running setup.py install for carray
> * Found Cython 0.17.1 package installed.
> * Found numpy 1.7.0b2 package installed.
> * Found numexpr 2.0.1 package installed.
> cythoning carray/carrayExtension.pyx to carray/carrayExtension.c
> building 'carray.carrayExtension' extension
> C compiler: gcc -fno-strict-aliasing
> -I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3
> -Wall -Wstrict-prototypes
>
> Hmm, perhaps you need a newer version of Cython?
>
> >
> >
> > Two more notes:
> >
> > * a way was added to check in-disk (compressed) vs in-memory
> > (uncompressed) node sizes. I was unable to find the way to use it
> > either from the 2.4.0 release notes or from the git issue
> > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763
>
> You already found the answer.
>
> >
> > * is/will it be possible to load PyTables carrays as in-memory carrays
> > without decompression?
>
> Actually, that has been my idea from the very beginning. The concept of
> 'flavor' for the returned objects when reading is already there, so it
> should be relatively easy to add a new 'carray' flavor. Maybe you can
> contribute this?
>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
|
|
From: Francesc A. <fa...@gm...> - 2012-12-07 12:47:12
|
On 12/6/12 1:42 PM, Alvaro Tejero Cantero wrote:
> Thank you for the comprehensive round-up. I have some ideas and
> reports below.
>
> What about ctables? The documentation says that it is specificly
> column-access optimized, which is what I need in this scenario
> (sometimes sequential, sometimes random).
Yes, ctables is optimized for column access.
>
> Unfortunately I could not get the rootdir parameter for ctables
> __init__ to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads
> to compilation errors.
Yep, persistence for carray/ctables objects was added in 0.5.
>
> This is the ctables-to-disk error:
>
> ct2 = ca.ctable((np.arange(30000000),), names=('range2',),
> rootdir='/tmp/ctable2.ctable')
> ---------------------------------------------------------------------------
> TypeError Traceback (most recent call last)
> /home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> in<module>()
> ----> 1 ct2= ca.ctable((np.arange(30000000),), names=('range2',), rootdir='/tmp/ctable2.ctable')
>
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in__init__(self, cols, names, **kwargs)
> 158 if column.dtype== np.void:
> 159 raise ValueError, "`cols` elements cannot be of type void"
> --> 160 column= ca.carray(column, **kwargs)
> 161 elif ratype:
> 162 column= ca.carray(cols[name], **kwargs)
>
> /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so incarray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:3917)()
>
> TypeError: __cinit__() got an unexpected keyword argument 'rootdir'
>
>
> And this is cut from the pip output when trying to upgrade carray.
>
> gcc: carray/carrayExtension.c
>
> gcc: error: carray/carrayExtension.c: No such file or directory
Hmm, that's strange, because the carrayExtension should have been
cythonized automatically. Here it is part of my install process with pip:
Running setup.py install for carray
* Found Cython 0.17.1 package installed.
* Found numpy 1.7.0b2 package installed.
* Found numexpr 2.0.1 package installed.
cythoning carray/carrayExtension.pyx to carray/carrayExtension.c
building 'carray.carrayExtension' extension
C compiler: gcc -fno-strict-aliasing
-I/Users/faltet/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3
-Wall -Wstrict-prototypes
Hmm, perhaps you need a newer version of Cython?
>
>
> Two more notes:
>
> * a way was added to check in-disk (compressed) vs in-memory
> (uncompressed) node sizes. I was unable to find the way to use it
> either from the 2.4.0 release notes or from the git issue
> https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763
You already found the answer.
>
> * is/will it be possible to load PyTables carrays as in-memory carrays
> without decompression?
Actually, that has been my idea from the very beginning. The concept of
'flavor' for the returned objects when reading is already there, so it
should be relatively easy to add a new 'carray' flavor. Maybe you can
contribute this?
--
Francesc Alted
|
|
From: Alvaro T. C. <al...@mi...> - 2012-12-06 18:30:02
|
I'll answer myself on the size-checking: the right attributes are Leaf.size_in_memory and Leaf.size_on_disk (per http://pytables.github.com/usersguide/libref/hierarchy_classes.html) -á. On 6 December 2012 12:42, Alvaro Tejero Cantero <al...@mi...> wrote: > Thank you for the comprehensive round-up. I have some ideas and reports > below. > > What about ctables? The documentation says that it is specificly > column-access optimized, which is what I need in this scenario (sometimes > sequential, sometimes random). > > Unfortunately I could not get the rootdir parameter for ctables __init__ > to work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads to compilation > errors. > > This is the ctables-to-disk error: > > ct2 = ca.ctable((np.arange(30000000),), names=('range2',), > rootdir='/tmp/ctable2.ctable') > > ---------------------------------------------------------------------------TypeError Traceback (most recent call last)/home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b> in <module>()----> 1 ct2 = ca.ctable((np.arange(30000000),), names=('range2',), rootdir='/tmp/ctable2.ctable') > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs) 158 if column.dtype == np.void: 159 raise ValueError, "`cols` elements cannot be of type void"--> 160 column = ca.carray(column, **kwargs) 161 elif ratype: 162 column = ca.carray(cols[name], **kwargs) > /home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so in carray.carrayExtension.carray.__cinit__ (carray/carrayExtension.c:3917)() > TypeError: __cinit__() got an unexpected keyword argument 'rootdir' > > > > And this is cut from the pip output when trying to upgrade carray. > > gcc: carray/carrayExtension.c > > gcc: error: carray/carrayExtension.c: No such file or directory > > > > Two more notes: > > * a way was added to check in-disk (compressed) vs in-memory > (uncompressed) node sizes. I was unable to find the way to use it either > from the 2.4.0 release notes or from the git issue > https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763 > > * is/will it be possible to load PyTables carrays as in-memory carrays > without decompression? > > Best, > > Álvaro > > > > On 6 December 2012 11:49, Francesc Alted <fa...@gm...> wrote: > >> completeness, let's see how fast can perform >> carray (the package, n >> > > |
|
From: Alan M. <al...@al...> - 2012-12-06 14:55:34
|
Josh,
Thanks for the detailed response. I would like to avoid going through a
separate process if at all possible due to the performance penalty. I have
also tried your last suggestion to create a dedicated pytables thread and
send everything through that but still see the same problem (Runtime error
in flush). This leads me to believe something strange is going on behind
the scenes. ??
Updated test program with dedicated pytables thread reading an input
Queue.Queue:
import tables
import threading
import random
import time
import Queue
# a simple table
class TableValue(tables.IsDescription):
a = tables.Int64Col(pos=1)
b = tables.UInt32Col(pos=2)
class TablesThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.name = 'HDF5 io thread'
# create the dummy HDF5 file
self.h5 = None
self.h5 = tables.openFile('/data/test.h5', mode='w')
self.num_groups = 5
self.num_tables = 5
self.groups = [self.h5.createGroup('/', "group%d"%i) for i in
range(self.num_groups)]
self.tables = []
for group in self.groups:
tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) for
i in range(self.num_tables)]
self.tables.append (tbls)
for table in tbls:
# add an index for good measure
table.cols.a.createIndex()
self.stopEvt = threading.Event()
self.stoppedEvt = threading.Event()
self.inputQ = Queue.Queue()
def run(self):
try:
while not self.stopEvt.is_set():
# get a command
try:
cmd, args, result = self.inputQ.get(timeout = 0.5)
except Queue.Empty:
# poll stopEvt so we can shutdown
continue
# do the command
if cmd == 'write':
x = self.tables[args[0]][args[1]].row
x['a'] = args[2]
x['b'] = args[3]
x.append()
elif cmd == 'read':
self.h5.flush()
table = self.tables[args[0]][args[1]]
result.value = table.readWhere('a > %d'%(args[2]))
else:
raise Exception("Command not supported: %s"%(cmd,))
# signal that the result is ready
result.event.set()
finally:
# shutdown
self.h5.close()
self.stoppedEvt.set()
def stop(self):
if not self.stoppedEvt.is_set():
self.stopEvt.set()
self.stoppedEvt.wait()
class ResultEvent():
def __init__(self):
self.event = threading.Event()
self.value = None
class Test():
def __init__(self):
self.tables = TablesThread()
self.tables.start()
self.timeout = 5
self.stats = {'read': 0,
'write': 0,
'read_error': 0,
'write_error': 0}
def write(self):
r = ResultEvent()
self.tables.inputQ.put(('write',
(random.randint(0,
self.tables.num_groups-1),
random.randint(0,
self.tables.num_tables-1),
random.randint(0, 100),
random.randint(0, 100)),
r))
r.event.wait(timeout = self.timeout)
if r.event.is_set():
self.stats['write'] += 1
else:
self.stats['write_error'] += 1
def read(self):
r = ResultEvent()
self.tables.inputQ.put(('read',
(random.randint(0,
self.tables.num_groups-1),
random.randint(0,
self.tables.num_tables-1),
random.randint(0, 100)),
r))
r.event.wait(timeout = self.timeout)
if r.event.is_set():
self.stats['read'] += 1
#print 'Query got %d hits'%(len(r.value))
else:
self.stats['read_error'] += 1
def close(self):
self.tables.stop()
def __del__(self):
self.close()
class Worker(threading.Thread):
def __init__(self, method):
threading.Thread.__init__(self)
self.method = method
self.stopEvt = threading.Event()
def run(self):
while not self.stopEvt.is_set():
try:
self.method()
except Exception, x:
print 'Worker thread failed with: %s'%(x,)
time.sleep(random.random()/100.0)
def stop(self):
self.stopEvt.set()
def main():
t = Test()
threads = [Worker(t.write) for _i in range(10)]
threads.extend([Worker(t.read) for _i in range(10)])
for thread in threads:
thread.start()
time.sleep(5)
for thread in threads:
thread.stop()
for thread in threads:
thread.join()
t.close()
print t.stats
if __name__ == "__main__":
main()
On Wed, Dec 5, 2012 at 10:52 PM, Josh Ayers <jos...@gm...> wrote:
> Alan,
>
> Unfortunately, the underlying HDF5 library isn't thread-safe by default.
> It can be built in a thread-safe mode that serializes all API calls, but
> still doesn't allow actual parallel access to the disk. See [1] for more
> details. Here's [2] another interesting discussion concerning whether
> multithreaded access is actually beneficial for an I/O limited library like
> HDF5. Ultimately, if one thread can read at the disk's maximum transfer
> rate, then multiple threads don't provide any benefit.
>
> Beyond the limitations of HDF5, PyTables also maintains global state in
> various module-level variables. One example is the _open_file cache in the
> file.py module. I made an attempt in the past to work around this to allow
> read-only access from multiple threads, but didn't make much progress.
>
> In general, I think your best bet is to serialize all access through a
> single process. There is another example in the PyTables/examples
> directory that benchmarks different methods of transferring data from
> PyTables to another process [3]. It compares Python's
> multiprocessing.Queue, sockets, and memory-mapped files. In my testing,
> the latter two are 5-10x faster than using a queue.
>
> Another option would be to use multiple threads, but handle all access to
> the HDF5 file in one thread. PyTables will release the GIL when making
> HDF5 library calls, so the other threads will be able to run. You could
> use a Queue.Queue or some other mechanism to transfer data between
> threads. No actual copying would be needed since their memory is shared,
> which should make it faster than the multi-process techniques.
>
> Hope that helps.
>
> Josh Ayers
>
>
> [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread
>
> [2]:
> https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5
>
> [3]:
> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py
>
>
> On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...>wrote:
>
>> I am trying to allow multiple threads read/write access to pytables data
>> and found it is necessary to call flush() before any read. If not, the
>> latest data is not returned. However, this can cause a RuntimeError. I
>> have tried protecting pytables access with both locks and queues as done by
>> joshayers (
>> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py).
>> In either case I still get RuntimeError: dictionary changed size during
>> iteration when doing the flush. (incidentally using the Locks appears to
>> be much faster than using queues in my unscientific tests...)
>>
>> I have tried versions 2.4 and 2.3.1 with the same results. Interestingly
>> this only appears to happen if there are multiple tables/groups in the H5
>> file. To investigate this behavior further I create a test program to
>> illustrate (below). When run with num_groups = 5 num_tables = 5 (or
>> greater) I see the runtime error every time. When these values are smaller
>> than this it doesn't (at least in a short test period).
>>
>> I might be doing something unexpected with pytables, but this seems
>> pretty straight forward to me. Any help is appreciated.
>>
>>
>>
|
|
From: Alvaro T. C. <al...@mi...> - 2012-12-06 12:42:57
|
Thank you for the comprehensive round-up. I have some ideas and reports
below.
What about ctables? The documentation says that it is specificly
column-access optimized, which is what I need in this scenario (sometimes
sequential, sometimes random).
Unfortunately I could not get the rootdir parameter for ctables __init__ to
work in carray 0.4 and pip-installing 0.5 or 0.5.1 leads to compilation
errors.
This is the ctables-to-disk error:
ct2 = ca.ctable((np.arange(30000000),), names=('range2',),
rootdir='/tmp/ctable2.ctable')
---------------------------------------------------------------------------TypeError
Traceback (most recent call
last)/home/tejero/Dropbox/O/nb/nonridge/<ipython-input-29-255842877a0b>
in <module>()----> 1 ct2 = ca.ctable((np.arange(30000000),),
names=('range2',), rootdir='/tmp/ctable2.ctable')
/home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/ctable.pyc
in __init__(self, cols, names, **kwargs) 158 if
column.dtype == np.void: 159 raise ValueError,
"`cols` elements cannot be of type void"--> 160 column
= ca.carray(column, **kwargs) 161 elif ratype: 162
column = ca.carray(cols[name], **kwargs)
/home/tejero/Local/Envs/test/lib/python2.7/site-packages/carray/carrayExtension.so
in carray.carrayExtension.carray.__cinit__
(carray/carrayExtension.c:3917)()
TypeError: __cinit__() got an unexpected keyword argument 'rootdir'
And this is cut from the pip output when trying to upgrade carray.
gcc: carray/carrayExtension.c
gcc: error: carray/carrayExtension.c: No such file or directory
Two more notes:
* a way was added to check in-disk (compressed) vs in-memory (uncompressed)
node sizes. I was unable to find the way to use it either from the 2.4.0
release notes or from the git issue
https://github.com/PyTables/PyTables/issues/141#issuecomment-5018763
* is/will it be possible to load PyTables carrays as in-memory carrays
without decompression?
Best,
Álvaro
On 6 December 2012 11:49, Francesc Alted <fa...@gm...> wrote:
> completeness, let's see how fast can perform
> carray (the package, n
>
|
|
From: Francesc A. <fa...@gm...> - 2012-12-06 11:49:26
|
On 12/5/12 7:55 PM, Alvaro Tejero Cantero wrote:
> My system was benched for reads and writes with Blosc[1]:
>
> with pt.openFile(paths.braw(block), 'r') as handle:
> pt.setBloscMaxThreads(1)
> %timeit a = handle.root.raw.c042[:]
> pt.setBloscMaxThreads(6)
> %timeit a = handle.root.raw.c042[:]
> pt.setBloscMaxThreads(11)
> %timeit a = handle.root.raw.c042[:]
> print handle.root.raw._v_attrs.FILTERS
> print handle.root.raw.c042.__sizeof__()
> print handle.root.raw.c042
>
> gives
>
> 1 loops, best of 3: 483 ms per loop
> 1 loops, best of 3: 782 ms per loop
> 1 loops, best of 3: 663 ms per loop
> Filters(complevel=5, complib='blosc', shuffle=True, fletcher32=False)
> 104
> /raw/c042 (CArray(303390000,), shuffle, blosc(5)) ''
>
> I can't understand what is going on, for the life of me. These
> datasets use int16 atoms and at Blosc complevel=5 used to compress by
> a factor of about 2. Even for such low compression ratios there should
> be huge differences between single- and multi-threaded reads.
>
> Do you have any clue?
Yeah, welcome to the wonderful art of fine tuning. Fortunately we have
a machine which is pretty identical to yours (hey, your computer was too
good in Blosc benchmarks so as to ignore it :), so I can reproduce your
issue:
In [3]: a = ((np.random.rand(3e8))*100).astype('i2')
In [4]: f = tb.openFile("test.h5", "w")
In [5]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape,
filters=tb.Filters(5, complib="blosc"))
In [6]: act[:] = a
In [7]: f.flush()
In [8]: ll test.h5
-rw-rw-r-- 1 faltet 301719914 Dec 6 04:55 test.h5
This random set of numbers is close to your array in size (~3e8
elements), and also has a similar compression factor (~2x). Now the
timings (using 6 cores by default):
In [9]: timeit act[:]
1 loops, best of 3: 441 ms per loop
In [11]: tb.setBloscMaxThreads(1)
Out[11]: 6
In [12]: timeit act[:]
1 loops, best of 3: 347 ms per loop
So yeah, that might seem a bit disappointing. It turns out that the
default chunksize for PyTables is tuned so as to balance among
sequential and random reads. If what you want is to optimize only for
sequential reads (apparently this is what you are after, right?), then
it normally helps to increase the chunksize. For example, by doing some
quick trials, I determined that a chunksize of 2 MB is pretty optimal
for sequential access:
In [44]: f.removeNode(f.root.act)
In [45]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape,
filters=tb.Filters(5, complib="blosc"), chunkshape=(2**20,))
In [46]: act[:] = a
In [47]: tb.setBloscMaxThreads(1)
Out[47]: 6
In [48]: timeit act[:]
1 loops, best of 3: 334 ms per loop
In [49]: tb.setBloscMaxThreads(3)
Out[49]: 1
In [50]: timeit act[:]
1 loops, best of 3: 298 ms per loop
In [51]: tb.setBloscMaxThreads(6)
Out[51]: 3
In [52]: timeit act[:]
1 loops, best of 3: 303 ms per loop
Also, we see here that the sweet point is using 3 threads, not more
(don't ask why). However, that does not mean that Blosc is not able to
work faster on this machine, and in fact it does:
In [59]: import blosc
In [60]: sa = a.tostring()
In [61]: ac2 = blosc.compress(sa, 2, clevel=5)
In [62]: blosc.set_nthreads(6)
Out[62]: 6
In [64]: timeit a2 = blosc.decompress(ac2)
10 loops, best of 3: 80.7 ms per loop
In [65]: blosc.set_nthreads(1)
Out[65]: 6
In [66]: timeit a2 = blosc.decompress(ac2)
1 loops, best of 3: 249 ms per loop
So that means that a pure Blosc compression in-memory can only go 4x
faster than PyTables + Blosc, and in this is case the latter is reaching
an excellent mark of 2 GB/s, which is really good for a read from disk
operation. Note how a memcpy() operation in this machine is just about
as good as this:
In [36]: timeit a.copy()
1 loops, best of 3: 294 ms per loop
Now that I'm on this, I'm curious on how other compressors would perform
for this scenario:
In [6]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape,
filters=tb.Filters(5, complib="lzo"), chunkshape=(2**20,))
In [7]: act[:] = a
In [8]: f.flush()
In [9]: ll test.h5 # compression ratio very close to Blosc
-rw-rw-r-- 1 faltet 302769510 Dec 6 05:23 test.h5
In [10]: timeit act[:]
1 loops, best of 3: 1.13 s per loop
so, the time for LZO is more than 3x slower than Blosc. And a similar
thing with zlib:
In [12]: f.close()
In [13]: f = tb.openFile("test.h5", "w")
In [14]: act = f.createCArray(f.root, 'act', tb.Int16Atom(), a.shape,
filters=tb.Filters(1, complib="zlib"), chunkshape=(2**20,))
In [15]: act[:] = a
In [16]: f.flush()
In [17]: ll test.h5 # the compression rate is somewhat better
-rw-rw-r-- 1 faltet 254821296 Dec 6 05:26 test.h5
In [18]: timeit act[:]
1 loops, best of 3: 2.24 s per loop
which is 6x slower than Blosc (although the compression ratio is a bit
better).
And just for matter of completeness, let's see how fast can perform
carray (the package, not the CArray object in PyTables) for a chunked
array in-memory:
In [19]: import carray as ca
In [20]: ac3 = ca.carray(a, chunklen=2**20, cparams=ca.cparams(5))
In [21]: ac3
Out[21]:
carray((300000000,), int16)
nbytes: 572.20 MB; cbytes: 289.56 MB; ratio: 1.98
cparams := cparams(clevel=5, shuffle=True)
[59 34 36 ..., 21 58 50]
In [22]: timeit ac3[:]
1 loops, best of 3: 254 ms per loop
In [23]: ca.set_nthreads(1)
Out[23]: 6
In [24]: timeit ac3[:]
1 loops, best of 3: 282 ms per loop
So, with 254 ms, it is only marginally faster than PyTables (~298 ms).
Now with a carray object on-disk:
In [27]: acd = ca.carray(a, chunklen=2**20, cparams=ca.cparams(5),
rootdir="test")
In [28]: acd
Out[28]:
carray((300000000,), int16)
nbytes: 572.20 MB; cbytes: 289.56 MB; ratio: 1.98
cparams := cparams(clevel=5, shuffle=True)
rootdir := 'test'
[59 34 36 ..., 21 58 50]
In [30]: ca.set_nthreads(6)
Out[30]: 1
In [31]: timeit acd[:]
1 loops, best of 3: 317 ms per loop
In [32]: ca.set_nthreads(1)
Out[32]: 6
In [33]: timeit acd[:]
1 loops, best of 3: 361 ms per loop
The times in this case are a bit larger than with PyTables (317ms vs
298ms), which speaks a lot how efficiently is implemented I/O in
HDF5/PyTables stack.
--
Francesc Alted
|