You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Josh A. <jos...@gm...> - 2012-12-06 03:52:32
|
Alan, Unfortunately, the underlying HDF5 library isn't thread-safe by default. It can be built in a thread-safe mode that serializes all API calls, but still doesn't allow actual parallel access to the disk. See [1] for more details. Here's [2] another interesting discussion concerning whether multithreaded access is actually beneficial for an I/O limited library like HDF5. Ultimately, if one thread can read at the disk's maximum transfer rate, then multiple threads don't provide any benefit. Beyond the limitations of HDF5, PyTables also maintains global state in various module-level variables. One example is the _open_file cache in the file.py module. I made an attempt in the past to work around this to allow read-only access from multiple threads, but didn't make much progress. In general, I think your best bet is to serialize all access through a single process. There is another example in the PyTables/examples directory that benchmarks different methods of transferring data from PyTables to another process [3]. It compares Python's multiprocessing.Queue, sockets, and memory-mapped files. In my testing, the latter two are 5-10x faster than using a queue. Another option would be to use multiple threads, but handle all access to the HDF5 file in one thread. PyTables will release the GIL when making HDF5 library calls, so the other threads will be able to run. You could use a Queue.Queue or some other mechanism to transfer data between threads. No actual copying would be needed since their memory is shared, which should make it faster than the multi-process techniques. Hope that helps. Josh Ayers [1]: http://www.hdfgroup.org/hdf5-quest.html#mthread [2]: https://visitbugs.ornl.gov/projects/8/wiki/Multi-threaded_cores_and_HPC-HDF5 [3]: https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_benchmarks.py On Wed, Dec 5, 2012 at 2:24 PM, Alan Marchiori <al...@al...> wrote: > I am trying to allow multiple threads read/write access to pytables data > and found it is necessary to call flush() before any read. If not, the > latest data is not returned. However, this can cause a RuntimeError. I > have tried protecting pytables access with both locks and queues as done by > joshayers ( > https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py). > In either case I still get RuntimeError: dictionary changed size during > iteration when doing the flush. (incidentally using the Locks appears to > be much faster than using queues in my unscientific tests...) > > I have tried versions 2.4 and 2.3.1 with the same results. Interestingly > this only appears to happen if there are multiple tables/groups in the H5 > file. To investigate this behavior further I create a test program to > illustrate (below). When run with num_groups = 5 num_tables = 5 (or > greater) I see the runtime error every time. When these values are smaller > than this it doesn't (at least in a short test period). > > I might be doing something unexpected with pytables, but this seems pretty > straight forward to me. Any help is appreciated. > > > import tables > import threading > import random > import time > > lock = threading.Lock() > def synchronized(lock): > def wrap(f): > def newFunction(*args, **kw): > lock.acquire() > try: > return f(*args, **kw) > finally: > lock.release() > return newFunction > return wrap > > class TableValue(tables.IsDescription): > a = tables.Int64Col(pos=1) > b = tables.UInt32Col(pos=2) > > class Test(): > def __init__(self): > self.h5 = None > self.h5 = tables.openFile('/data/test.h5', mode='w') > self.num_groups = 5 > self.num_tables = 5 > > self.groups = [self.h5.createGroup('/', "group%d"%i) for i in > range(self.num_groups)] > self.tables = [] > for group in self.groups: > tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) > for i in range(self.num_tables)] > self.tables.append (tbls) > for table in tbls: > table.cols.a.createIndex() > > self.stats = {'read': 0, > 'write': 0} > > @synchronized(lock) > def __del__(self): > if self.h5 != None: > self.h5.close() > self.h5 = None > > @synchronized(lock) > def write(self): > x = self.tables[random.randint(0, > self.num_groups-1)][random.randint(0, self.num_tables-1)].row > x['a'] = random.randint(0, 100) > x['b'] = random.randint(0, 100) > x.append() > self.stats['write'] += 1 > > @synchronized(lock) > def read(self): > # flush so we can query latest data > self.h5.flush() > > table = self.tables[random.randint(0, > self.num_groups-1)][random.randint(0, self.num_tables-1)] > # do some query > results = table.readWhere('a > %d'%(random.randint(0, 100))) > #print 'Query got %d hits'%(len(results)) > > self.stats['read'] += 1 > > class Worker(threading.Thread): > def __init__(self, method): > threading.Thread.__init__(self) > self.method = method > self.stopEvt = threading.Event() > > def run(self): > while not self.stopEvt.is_set(): > self.method() > time.sleep(random.random()/100.0) > > def stop(self): > self.stopEvt.set() > > def main(): > t = Test() > > threads = [Worker(t.write) for _i in range(10)] > threads.extend([Worker(t.read) for _i in range(10)]) > > for thread in threads: > thread.start() > > time.sleep(5) > > for thread in threads: > thread.stop() > > for thread in threads: > thread.join() > > print t.stats > > > > if __name__ == "__main__": > main() > > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Alan M. <al...@al...> - 2012-12-05 22:24:14
|
I am trying to allow multiple threads read/write access to pytables data and found it is necessary to call flush() before any read. If not, the latest data is not returned. However, this can cause a RuntimeError. I have tried protecting pytables access with both locks and queues as done by joshayers ( https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py). In either case I still get RuntimeError: dictionary changed size during iteration when doing the flush. (incidentally using the Locks appears to be much faster than using queues in my unscientific tests...) I have tried versions 2.4 and 2.3.1 with the same results. Interestingly this only appears to happen if there are multiple tables/groups in the H5 file. To investigate this behavior further I create a test program to illustrate (below). When run with num_groups = 5 num_tables = 5 (or greater) I see the runtime error every time. When these values are smaller than this it doesn't (at least in a short test period). I might be doing something unexpected with pytables, but this seems pretty straight forward to me. Any help is appreciated. import tables import threading import random import time lock = threading.Lock() def synchronized(lock): def wrap(f): def newFunction(*args, **kw): lock.acquire() try: return f(*args, **kw) finally: lock.release() return newFunction return wrap class TableValue(tables.IsDescription): a = tables.Int64Col(pos=1) b = tables.UInt32Col(pos=2) class Test(): def __init__(self): self.h5 = None self.h5 = tables.openFile('/data/test.h5', mode='w') self.num_groups = 5 self.num_tables = 5 self.groups = [self.h5.createGroup('/', "group%d"%i) for i in range(self.num_groups)] self.tables = [] for group in self.groups: tbls = [self.h5.createTable(group, 'table%d'%i, TableValue) for i in range(self.num_tables)] self.tables.append (tbls) for table in tbls: table.cols.a.createIndex() self.stats = {'read': 0, 'write': 0} @synchronized(lock) def __del__(self): if self.h5 != None: self.h5.close() self.h5 = None @synchronized(lock) def write(self): x = self.tables[random.randint(0, self.num_groups-1)][random.randint(0, self.num_tables-1)].row x['a'] = random.randint(0, 100) x['b'] = random.randint(0, 100) x.append() self.stats['write'] += 1 @synchronized(lock) def read(self): # flush so we can query latest data self.h5.flush() table = self.tables[random.randint(0, self.num_groups-1)][random.randint(0, self.num_tables-1)] # do some query results = table.readWhere('a > %d'%(random.randint(0, 100))) #print 'Query got %d hits'%(len(results)) self.stats['read'] += 1 class Worker(threading.Thread): def __init__(self, method): threading.Thread.__init__(self) self.method = method self.stopEvt = threading.Event() def run(self): while not self.stopEvt.is_set(): self.method() time.sleep(random.random()/100.0) def stop(self): self.stopEvt.set() def main(): t = Test() threads = [Worker(t.write) for _i in range(10)] threads.extend([Worker(t.read) for _i in range(10)]) for thread in threads: thread.start() time.sleep(5) for thread in threads: thread.stop() for thread in threads: thread.join() print t.stats if __name__ == "__main__": main() |
From: Alvaro T. C. <al...@mi...> - 2012-12-05 18:56:00
|
My system was benched for reads and writes with Blosc[1]: with pt.openFile(paths.braw(block), 'r') as handle: pt.setBloscMaxThreads(1) %timeit a = handle.root.raw.c042[:] pt.setBloscMaxThreads(6) %timeit a = handle.root.raw.c042[:] pt.setBloscMaxThreads(11) %timeit a = handle.root.raw.c042[:] print handle.root.raw._v_attrs.FILTERS print handle.root.raw.c042.__sizeof__() print handle.root.raw.c042 gives 1 loops, best of 3: 483 ms per loop 1 loops, best of 3: 782 ms per loop 1 loops, best of 3: 663 ms per loop Filters(complevel=5, complib='blosc', shuffle=True, fletcher32=False) 104 /raw/c042 (CArray(303390000,), shuffle, blosc(5)) '' I can't understand what is going on, for the life of me. These datasets use int16 atoms and at Blosc complevel=5 used to compress by a factor of about 2. Even for such low compression ratios there should be huge differences between single- and multi-threaded reads. Do you have any clue? -á. [1] http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks (first two plots) |
From: Jeff R. <jr...@ya...> - 2012-12-03 18:25:46
|
thanks created https://github.com/PyTables/PyTables/issues/198 I can be reached on my cell (917)971-6387 ________________________________ From: Anthony Scopatz <sc...@gm...> To: Jeff Reback <je...@re...>; Discussion list for PyTables <pyt...@li...> Sent: Monday, December 3, 2012 11:15 AM Subject: Re: [Pytables-users] variable length strings in tables? On Sun, Dec 2, 2012 at 2:49 PM, Jeff Reback <jr...@ya...> wrote: Hi, > >Pandas uses pytables as a storage backend and has worked out quite well >fyi ... http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables > >I have a particular use case where I build a table, then later append to it. >Fixed types are no problem. However, I often index these tables by StringCols, which I pre-allocated >to the largest size I think that i'll need. So, wanted to think about supporting >variable-length string columns in the table. > >any thoughts on these strategies: >1) any way to directly support a variable-length string in a particular column? (e.g. VLStringCol doesn't exist but a stand-alone VLStringAtom does) This is possible as the underlying HDF5 library will support it. However, no one has had the time to write it. Please open an issue (or possibly a pull request related to this.) 2) As an alternative, I could store along with the table a VLArray the same # of rows as the table and keep string data here > -- of course have to keep the synchronization up to date (and this doesn't help with an 'indexing' column, just with 'data' columns) This is what I do in PyTables and HDF5 itself. It works out quite well for me. This has the advantage that the VLString data get compressed separately from the numeric data (if using compression). Yes, it is one more thing to manage, but the file sizes I are much significantly smaller. Be Well Anthony >thanks, > >Jeff >------------------------------------------------------------------------------ >Keep yourself connected to Go Parallel: >DESIGN Expert tips on starting your parallel project right. >http://goparallel.sourceforge.net/ >_______________________________________________ >Pytables-users mailing list >Pyt...@li... >https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2012-12-03 16:16:25
|
On Sun, Dec 2, 2012 at 2:49 PM, Jeff Reback <jr...@ya...> wrote: > Hi, > > Pandas uses pytables as a storage backend and has worked out quite well > fyi ... http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables > > I have a particular use case where I build a table, then later append to > it. > Fixed types are no problem. However, I often index these tables by > StringCols, which I pre-allocated > to the largest size I think that i'll need. So, wanted to think about > supporting > variable-length string columns in the table. > > any thoughts on these strategies: > 1) any way to directly support a variable-length string in a particular > column? (e.g. VLStringCol doesn't exist but a stand-alone VLStringAtom does) > This is possible as the underlying HDF5 library will support it. However, no one has had the time to write it. Please open an issue (or possibly a pull request related to this.) > 2) As an alternative, I could store along with the table a VLArray the > same # of rows as the table and keep string data here > -- of course have to keep the synchronization up to date (and this > doesn't help with an 'indexing' column, just with 'data' columns) > This is what I do in PyTables and HDF5 itself. It works out quite well for me. This has the advantage that the VLString data get compressed separately from the numeric data (if using compression). Yes, it is one more thing to manage, but the file sizes I are much significantly smaller. Be Well Anthony > > thanks, > > Jeff > > > ------------------------------------------------------------------------------ > Keep yourself connected to Go Parallel: > DESIGN Expert tips on starting your parallel project right. > http://goparallel.sourceforge.net/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Jeff R. <jr...@ya...> - 2012-12-02 20:49:41
|
Hi, Pandas uses pytables as a storage backend and has worked out quite well fyi ... http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables I have a particular use case where I build a table, then later append to it. Fixed types are no problem. However, I often index these tables by StringCols, which I pre-allocated to the largest size I think that i'll need. So, wanted to think about supporting variable-length string columns in the table. any thoughts on these strategies: 1) any way to directly support a variable-length string in a particular column? (e.g. VLStringCol doesn't exist but a stand-alone VLStringAtom does) 2) As an alternative, I could store along with the table a VLArray the same # of rows as the table and keep string data here -- of course have to keep the synchronization up to date (and this doesn't help with an 'indexing' column, just with 'data' columns) thanks, Jeff |
From: dashesy <da...@gm...> - 2012-11-28 16:46:18
|
Thank you all, now it is much more clear. On Wed, Nov 28, 2012 at 12:30 AM, Anthony Scopatz <sc...@gm...> wrote: > > On Wed, Nov 28, 2012 at 1:03 AM, Antonio Valentino > <ant...@ti...> wrote: >> >> Hi Anthony, hi dashesy, >> >> >> Il giorno 28/nov/2012, alle ore 00:57, Anthony Scopatz <sc...@gm...> >> ha scritto: >> >> > This [1] seems to indicate that this kind of thing should be supported >> > via numpy structured arrays. However, I bet that this data set did not >> > start out as a numpy structured array. This might explain the problem if >> > the flavor is wrong. I would think that a fix should be relatively easy. >> > >> > Be Well >> > Anthony >> > >> > 1. >> > http://pytables.github.com/usersguide/libref/declarative_classes.html?highlight=attr#the-attributeset-class >> > >> >> I'm not sure that PyTables is able to handle variable length strings in >> compound data types at the moment. > > > Oops, I didn't notice that... Antonio is right, the variable length part of > this is probably your issue. > >> >> >> > >> > On Tue, Nov 27, 2012 at 5:17 PM, dashesy <da...@gm...> wrote: >> > I have a file that has attributes with nested compound type, when >> > reading it with PyTables 2.4.0 I get this error: >> > >> > C:\Python27\lib\site-packages\tables\attributeset.py:293: >> > DataTypeWarning: Unsupported type for attribute 'BmiRoot' in node '/'. >> > Offending HDF5 class: 6 >> > value = self._g_getAttr(self._v_node, name) >> > C:\Python27\lib\site-packages\tables\attributeset.py:293: >> > DataTypeWarning: Unsupported type for attribute 'BmiChanExt' in node >> > 'channel00001'. Offending HDF5 class: 6 >> > value = self._g_getAttr(self._v_node, name) >> > >> >> Yes, it is not clear >> >> > Hard to say what exactly happens, just wanted to know if this is not >> > already fixed in newer versions I will be more than happy to work on >> > it, any pointers as to where to look is appreciated. >> > >> >> I don't thing that there are changes that can impact on this issue. >> Anyway you can give a try to the development branch [1] >> >> Any help is very appreciated >> >> [1] https://github.com/PyTables/PyTables >> I found the issues 48 and 54 on GitHub and 122 on Sourceforge refer to the same problem. It seems there is a potential solution using VLTables which I will look into first, I have created my vl string type using: hid_t tid_attr_vl_str = H5Tcopy(H5T_C_S1); ret = H5Tset_size(tid_attr_vl_str, H5T_VARIABLE); I will first look at this (and add a test for in /tables/tests), but reading the trac it seems there might be other form of vl strings in the wild. I appreciate if you have any comments or advice >> >> > Here is the (partial) dump of the file (for brevity I deleted >> > non-related data parts but can provide the full file if needed): >> > >> > HDF5 "pause5-10-5.ns2.h5" { >> > GROUP "/" { >> > ATTRIBUTE "BmiRoot" { >> > DATATYPE "/BmiRootAttr_t" >> > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } >> > DATA { >> > (0): { >> > 1, >> > 0, >> > 0, >> > 1, >> > "2008-12-02 22:57:02.251000", >> > "1 kS/s", >> > "" >> > } >> > } >> > } >> > DATATYPE "BmiRootAttr_t" H5T_COMPOUND { >> > H5T_STD_U32LE "MajorVersion"; >> > H5T_STD_U32LE "MinorVersion"; >> > H5T_STD_U32LE "Flags"; >> > H5T_STD_U32LE "GroupCount"; >> > H5T_STRING { >> > STRSIZE H5T_VARIABLE; >> > STRPAD H5T_STR_NULLTERM; >> > CSET H5T_CSET_ASCII; >> > CTYPE H5T_C_S1; >> > } "Date"; >> > H5T_STRING { >> > STRSIZE H5T_VARIABLE; >> > STRPAD H5T_STR_NULLTERM; >> > CSET H5T_CSET_ASCII; >> > CTYPE H5T_C_S1; >> > } "Application"; >> > H5T_STRING { >> > STRSIZE H5T_VARIABLE; >> > STRPAD H5T_STR_NULLTERM; >> > CSET H5T_CSET_ASCII; >> > CTYPE H5T_C_S1; >> > } "Comment"; >> > } >> > GROUP "channel" { >> > DATATYPE "BmiChanAttr_t" H5T_COMPOUND { >> > H5T_STD_U16LE "ID"; >> > H5T_IEEE_F32LE "Clock"; >> > H5T_IEEE_F32LE "SampleRate"; >> > H5T_STD_U8LE "SampleBits"; >> > } >> > DATATYPE "BmiChanExt2Attr_t" H5T_COMPOUND { >> > H5T_STD_I32LE "DigitalMin"; >> > H5T_STD_I32LE "DigitalMax"; >> > H5T_STD_I32LE "AnalogMin"; >> > H5T_STD_I32LE "AnalogMax"; >> > H5T_STRING { >> > STRSIZE 16; >> > STRPAD H5T_STR_NULLTERM; >> > CSET H5T_CSET_ASCII; >> > CTYPE H5T_C_S1; >> > } "AnalogUnit"; >> > } >> > DATATYPE "BmiChanExtAttr_t" H5T_COMPOUND { >> > H5T_IEEE_F64LE "NanoVoltsPerLSB"; >> > H5T_COMPOUND { >> > H5T_STD_U32LE "HighPassFreq"; >> > H5T_STD_U32LE "HighPassOrder"; >> > H5T_STD_U16LE "HighPassType"; >> > H5T_STD_U32LE "LowPassFreq"; >> > H5T_STD_U32LE "LowPassOrder"; >> > H5T_STD_U16LE "LowPassType"; >> > } "Filter"; >> > H5T_STD_U8LE "PhysicalConnector"; >> > H5T_STD_U8LE "ConnectorPin"; >> > H5T_STRING { >> > STRSIZE H5T_VARIABLE; >> > STRPAD H5T_STR_NULLTERM; >> > CSET H5T_CSET_ASCII; >> > CTYPE H5T_C_S1; >> > } "Label"; >> > } >> > DATATYPE "BmiChanFiltAttr_t" H5T_COMPOUND { >> > H5T_STD_U32LE "HighPassFreq"; >> > H5T_STD_U32LE "HighPassOrder"; >> > H5T_STD_U16LE "HighPassType"; >> > H5T_STD_U32LE "LowPassFreq"; >> > H5T_STD_U32LE "LowPassOrder"; >> > H5T_STD_U16LE "LowPassType"; >> > } >> > GROUP "channel00001" { >> > ATTRIBUTE "BmiChan" { >> > DATATYPE "/channel/BmiChanAttr_t" >> > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } >> > DATA { >> > (0): { >> > 1, >> > 30000, >> > 1000, >> > 16 >> > } >> > } >> > } >> > ATTRIBUTE "BmiChanExt" { >> > DATATYPE "/channel/BmiChanExtAttr_t" >> > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } >> > DATA { >> > (0): { >> > 1000, >> > { >> > 750000, >> > 4, >> > 1, >> > 7500, >> > 3, >> > 1 >> > }, >> > 1, >> > 1, >> > "elec1" >> > } >> > } >> > } >> > ATTRIBUTE "BmiChanExt2" { >> > DATATYPE "/channel/BmiChanExt2Attr_t" >> > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } >> > DATA { >> > (0): { >> > -8191, >> > 8191, >> > -8191, >> > 8191, >> > "uV" >> > } >> > } >> > } >> > DATASET "continuous_set" { >> > DATATYPE H5T_STD_I16LE >> > DATASPACE SIMPLE { ( 631 ) / ( H5S_UNLIMITED ) } >> > DATA {... >> > } >> > } >> > } >> > } >> > } >> > } >> >> >> -- >> Antonio Valentino >> >> >> dashesy |
From: Anthony S. <sc...@gm...> - 2012-11-28 07:31:19
|
On Wed, Nov 28, 2012 at 1:03 AM, Antonio Valentino < ant...@ti...> wrote: > Hi Anthony, hi dashesy, > > > Il giorno 28/nov/2012, alle ore 00:57, Anthony Scopatz <sc...@gm...> > ha scritto: > > > This [1] seems to indicate that this kind of thing should be supported > via numpy structured arrays. However, I bet that this data set did not > start out as a numpy structured array. This might explain the problem if > the flavor is wrong. I would think that a fix should be relatively easy. > > > > Be Well > > Anthony > > > > 1. > http://pytables.github.com/usersguide/libref/declarative_classes.html?highlight=attr#the-attributeset-class > > > > I'm not sure that PyTables is able to handle variable length strings in > compound data types at the moment. > Oops, I didn't notice that... Antonio is right, the variable length part of this is probably your issue. > > > > > On Tue, Nov 27, 2012 at 5:17 PM, dashesy <da...@gm...> wrote: > > I have a file that has attributes with nested compound type, when > > reading it with PyTables 2.4.0 I get this error: > > > > C:\Python27\lib\site-packages\tables\attributeset.py:293: > > DataTypeWarning: Unsupported type for attribute 'BmiRoot' in node '/'. > > Offending HDF5 class: 6 > > value = self._g_getAttr(self._v_node, name) > > C:\Python27\lib\site-packages\tables\attributeset.py:293: > > DataTypeWarning: Unsupported type for attribute 'BmiChanExt' in node > > 'channel00001'. Offending HDF5 class: 6 > > value = self._g_getAttr(self._v_node, name) > > > > Yes, it is not clear > > > Hard to say what exactly happens, just wanted to know if this is not > > already fixed in newer versions I will be more than happy to work on > > it, any pointers as to where to look is appreciated. > > > > I don't thing that there are changes that can impact on this issue. > Anyway you can give a try to the development branch [1] > > Any help is very appreciated > > [1] https://github.com/PyTables/PyTables > > > > Here is the (partial) dump of the file (for brevity I deleted > > non-related data parts but can provide the full file if needed): > > > > HDF5 "pause5-10-5.ns2.h5" { > > GROUP "/" { > > ATTRIBUTE "BmiRoot" { > > DATATYPE "/BmiRootAttr_t" > > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > > DATA { > > (0): { > > 1, > > 0, > > 0, > > 1, > > "2008-12-02 22:57:02.251000", > > "1 kS/s", > > "" > > } > > } > > } > > DATATYPE "BmiRootAttr_t" H5T_COMPOUND { > > H5T_STD_U32LE "MajorVersion"; > > H5T_STD_U32LE "MinorVersion"; > > H5T_STD_U32LE "Flags"; > > H5T_STD_U32LE "GroupCount"; > > H5T_STRING { > > STRSIZE H5T_VARIABLE; > > STRPAD H5T_STR_NULLTERM; > > CSET H5T_CSET_ASCII; > > CTYPE H5T_C_S1; > > } "Date"; > > H5T_STRING { > > STRSIZE H5T_VARIABLE; > > STRPAD H5T_STR_NULLTERM; > > CSET H5T_CSET_ASCII; > > CTYPE H5T_C_S1; > > } "Application"; > > H5T_STRING { > > STRSIZE H5T_VARIABLE; > > STRPAD H5T_STR_NULLTERM; > > CSET H5T_CSET_ASCII; > > CTYPE H5T_C_S1; > > } "Comment"; > > } > > GROUP "channel" { > > DATATYPE "BmiChanAttr_t" H5T_COMPOUND { > > H5T_STD_U16LE "ID"; > > H5T_IEEE_F32LE "Clock"; > > H5T_IEEE_F32LE "SampleRate"; > > H5T_STD_U8LE "SampleBits"; > > } > > DATATYPE "BmiChanExt2Attr_t" H5T_COMPOUND { > > H5T_STD_I32LE "DigitalMin"; > > H5T_STD_I32LE "DigitalMax"; > > H5T_STD_I32LE "AnalogMin"; > > H5T_STD_I32LE "AnalogMax"; > > H5T_STRING { > > STRSIZE 16; > > STRPAD H5T_STR_NULLTERM; > > CSET H5T_CSET_ASCII; > > CTYPE H5T_C_S1; > > } "AnalogUnit"; > > } > > DATATYPE "BmiChanExtAttr_t" H5T_COMPOUND { > > H5T_IEEE_F64LE "NanoVoltsPerLSB"; > > H5T_COMPOUND { > > H5T_STD_U32LE "HighPassFreq"; > > H5T_STD_U32LE "HighPassOrder"; > > H5T_STD_U16LE "HighPassType"; > > H5T_STD_U32LE "LowPassFreq"; > > H5T_STD_U32LE "LowPassOrder"; > > H5T_STD_U16LE "LowPassType"; > > } "Filter"; > > H5T_STD_U8LE "PhysicalConnector"; > > H5T_STD_U8LE "ConnectorPin"; > > H5T_STRING { > > STRSIZE H5T_VARIABLE; > > STRPAD H5T_STR_NULLTERM; > > CSET H5T_CSET_ASCII; > > CTYPE H5T_C_S1; > > } "Label"; > > } > > DATATYPE "BmiChanFiltAttr_t" H5T_COMPOUND { > > H5T_STD_U32LE "HighPassFreq"; > > H5T_STD_U32LE "HighPassOrder"; > > H5T_STD_U16LE "HighPassType"; > > H5T_STD_U32LE "LowPassFreq"; > > H5T_STD_U32LE "LowPassOrder"; > > H5T_STD_U16LE "LowPassType"; > > } > > GROUP "channel00001" { > > ATTRIBUTE "BmiChan" { > > DATATYPE "/channel/BmiChanAttr_t" > > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > > DATA { > > (0): { > > 1, > > 30000, > > 1000, > > 16 > > } > > } > > } > > ATTRIBUTE "BmiChanExt" { > > DATATYPE "/channel/BmiChanExtAttr_t" > > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > > DATA { > > (0): { > > 1000, > > { > > 750000, > > 4, > > 1, > > 7500, > > 3, > > 1 > > }, > > 1, > > 1, > > "elec1" > > } > > } > > } > > ATTRIBUTE "BmiChanExt2" { > > DATATYPE "/channel/BmiChanExt2Attr_t" > > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > > DATA { > > (0): { > > -8191, > > 8191, > > -8191, > > 8191, > > "uV" > > } > > } > > } > > DATASET "continuous_set" { > > DATATYPE H5T_STD_I16LE > > DATASPACE SIMPLE { ( 631 ) / ( H5S_UNLIMITED ) } > > DATA {... > > } > > } > > } > > } > > } > > } > > > -- > Antonio Valentino > > > > ------------------------------------------------------------------------------ > Keep yourself connected to Go Parallel: > INSIGHTS What's next for parallel hardware, programming and related areas? > Interviews and blogs by thought leaders keep you ahead of the curve. > http://goparallel.sourceforge.net > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Antonio V. <ant...@ti...> - 2012-11-28 07:04:24
|
Hi Anthony, hi dashesy, Il giorno 28/nov/2012, alle ore 00:57, Anthony Scopatz <sc...@gm...> ha scritto: > This [1] seems to indicate that this kind of thing should be supported via numpy structured arrays. However, I bet that this data set did not start out as a numpy structured array. This might explain the problem if the flavor is wrong. I would think that a fix should be relatively easy. > > Be Well > Anthony > > 1. http://pytables.github.com/usersguide/libref/declarative_classes.html?highlight=attr#the-attributeset-class > I'm not sure that PyTables is able to handle variable length strings in compound data types at the moment. > > On Tue, Nov 27, 2012 at 5:17 PM, dashesy <da...@gm...> wrote: > I have a file that has attributes with nested compound type, when > reading it with PyTables 2.4.0 I get this error: > > C:\Python27\lib\site-packages\tables\attributeset.py:293: > DataTypeWarning: Unsupported type for attribute 'BmiRoot' in node '/'. > Offending HDF5 class: 6 > value = self._g_getAttr(self._v_node, name) > C:\Python27\lib\site-packages\tables\attributeset.py:293: > DataTypeWarning: Unsupported type for attribute 'BmiChanExt' in node > 'channel00001'. Offending HDF5 class: 6 > value = self._g_getAttr(self._v_node, name) > Yes, it is not clear > Hard to say what exactly happens, just wanted to know if this is not > already fixed in newer versions I will be more than happy to work on > it, any pointers as to where to look is appreciated. > I don't thing that there are changes that can impact on this issue. Anyway you can give a try to the development branch [1] Any help is very appreciated [1] https://github.com/PyTables/PyTables > Here is the (partial) dump of the file (for brevity I deleted > non-related data parts but can provide the full file if needed): > > HDF5 "pause5-10-5.ns2.h5" { > GROUP "/" { > ATTRIBUTE "BmiRoot" { > DATATYPE "/BmiRootAttr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > 1, > 0, > 0, > 1, > "2008-12-02 22:57:02.251000", > "1 kS/s", > "" > } > } > } > DATATYPE "BmiRootAttr_t" H5T_COMPOUND { > H5T_STD_U32LE "MajorVersion"; > H5T_STD_U32LE "MinorVersion"; > H5T_STD_U32LE "Flags"; > H5T_STD_U32LE "GroupCount"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Date"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Application"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Comment"; > } > GROUP "channel" { > DATATYPE "BmiChanAttr_t" H5T_COMPOUND { > H5T_STD_U16LE "ID"; > H5T_IEEE_F32LE "Clock"; > H5T_IEEE_F32LE "SampleRate"; > H5T_STD_U8LE "SampleBits"; > } > DATATYPE "BmiChanExt2Attr_t" H5T_COMPOUND { > H5T_STD_I32LE "DigitalMin"; > H5T_STD_I32LE "DigitalMax"; > H5T_STD_I32LE "AnalogMin"; > H5T_STD_I32LE "AnalogMax"; > H5T_STRING { > STRSIZE 16; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "AnalogUnit"; > } > DATATYPE "BmiChanExtAttr_t" H5T_COMPOUND { > H5T_IEEE_F64LE "NanoVoltsPerLSB"; > H5T_COMPOUND { > H5T_STD_U32LE "HighPassFreq"; > H5T_STD_U32LE "HighPassOrder"; > H5T_STD_U16LE "HighPassType"; > H5T_STD_U32LE "LowPassFreq"; > H5T_STD_U32LE "LowPassOrder"; > H5T_STD_U16LE "LowPassType"; > } "Filter"; > H5T_STD_U8LE "PhysicalConnector"; > H5T_STD_U8LE "ConnectorPin"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Label"; > } > DATATYPE "BmiChanFiltAttr_t" H5T_COMPOUND { > H5T_STD_U32LE "HighPassFreq"; > H5T_STD_U32LE "HighPassOrder"; > H5T_STD_U16LE "HighPassType"; > H5T_STD_U32LE "LowPassFreq"; > H5T_STD_U32LE "LowPassOrder"; > H5T_STD_U16LE "LowPassType"; > } > GROUP "channel00001" { > ATTRIBUTE "BmiChan" { > DATATYPE "/channel/BmiChanAttr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > 1, > 30000, > 1000, > 16 > } > } > } > ATTRIBUTE "BmiChanExt" { > DATATYPE "/channel/BmiChanExtAttr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > 1000, > { > 750000, > 4, > 1, > 7500, > 3, > 1 > }, > 1, > 1, > "elec1" > } > } > } > ATTRIBUTE "BmiChanExt2" { > DATATYPE "/channel/BmiChanExt2Attr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > -8191, > 8191, > -8191, > 8191, > "uV" > } > } > } > DATASET "continuous_set" { > DATATYPE H5T_STD_I16LE > DATASPACE SIMPLE { ( 631 ) / ( H5S_UNLIMITED ) } > DATA {... > } > } > } > } > } > } -- Antonio Valentino |
From: Anthony S. <sc...@gm...> - 2012-11-27 23:58:09
|
This [1] seems to indicate that this kind of thing should be supported via numpy structured arrays. However, I bet that this data set did not start out as a numpy structured array. This might explain the problem if the flavor is wrong. I would think that a fix should be relatively easy. Be Well Anthony 1. http://pytables.github.com/usersguide/libref/declarative_classes.html?highlight=attr#the-attributeset-class On Tue, Nov 27, 2012 at 5:17 PM, dashesy <da...@gm...> wrote: > I have a file that has attributes with nested compound type, when > reading it with PyTables 2.4.0 I get this error: > > C:\Python27\lib\site-packages\tables\attributeset.py:293: > DataTypeWarning: Unsupported type for attribute 'BmiRoot' in node '/'. > Offending HDF5 class: 6 > value = self._g_getAttr(self._v_node, name) > C:\Python27\lib\site-packages\tables\attributeset.py:293: > DataTypeWarning: Unsupported type for attribute 'BmiChanExt' in node > 'channel00001'. Offending HDF5 class: 6 > value = self._g_getAttr(self._v_node, name) > > Hard to say what exactly happens, just wanted to know if this is not > already fixed in newer versions I will be more than happy to work on > it, any pointers as to where to look is appreciated. > > Here is the (partial) dump of the file (for brevity I deleted > non-related data parts but can provide the full file if needed): > > HDF5 "pause5-10-5.ns2.h5" { > GROUP "/" { > ATTRIBUTE "BmiRoot" { > DATATYPE "/BmiRootAttr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > 1, > 0, > 0, > 1, > "2008-12-02 22:57:02.251000", > "1 kS/s", > "" > } > } > } > DATATYPE "BmiRootAttr_t" H5T_COMPOUND { > H5T_STD_U32LE "MajorVersion"; > H5T_STD_U32LE "MinorVersion"; > H5T_STD_U32LE "Flags"; > H5T_STD_U32LE "GroupCount"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Date"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Application"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Comment"; > } > GROUP "channel" { > DATATYPE "BmiChanAttr_t" H5T_COMPOUND { > H5T_STD_U16LE "ID"; > H5T_IEEE_F32LE "Clock"; > H5T_IEEE_F32LE "SampleRate"; > H5T_STD_U8LE "SampleBits"; > } > DATATYPE "BmiChanExt2Attr_t" H5T_COMPOUND { > H5T_STD_I32LE "DigitalMin"; > H5T_STD_I32LE "DigitalMax"; > H5T_STD_I32LE "AnalogMin"; > H5T_STD_I32LE "AnalogMax"; > H5T_STRING { > STRSIZE 16; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "AnalogUnit"; > } > DATATYPE "BmiChanExtAttr_t" H5T_COMPOUND { > H5T_IEEE_F64LE "NanoVoltsPerLSB"; > H5T_COMPOUND { > H5T_STD_U32LE "HighPassFreq"; > H5T_STD_U32LE "HighPassOrder"; > H5T_STD_U16LE "HighPassType"; > H5T_STD_U32LE "LowPassFreq"; > H5T_STD_U32LE "LowPassOrder"; > H5T_STD_U16LE "LowPassType"; > } "Filter"; > H5T_STD_U8LE "PhysicalConnector"; > H5T_STD_U8LE "ConnectorPin"; > H5T_STRING { > STRSIZE H5T_VARIABLE; > STRPAD H5T_STR_NULLTERM; > CSET H5T_CSET_ASCII; > CTYPE H5T_C_S1; > } "Label"; > } > DATATYPE "BmiChanFiltAttr_t" H5T_COMPOUND { > H5T_STD_U32LE "HighPassFreq"; > H5T_STD_U32LE "HighPassOrder"; > H5T_STD_U16LE "HighPassType"; > H5T_STD_U32LE "LowPassFreq"; > H5T_STD_U32LE "LowPassOrder"; > H5T_STD_U16LE "LowPassType"; > } > GROUP "channel00001" { > ATTRIBUTE "BmiChan" { > DATATYPE "/channel/BmiChanAttr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > 1, > 30000, > 1000, > 16 > } > } > } > ATTRIBUTE "BmiChanExt" { > DATATYPE "/channel/BmiChanExtAttr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > 1000, > { > 750000, > 4, > 1, > 7500, > 3, > 1 > }, > 1, > 1, > "elec1" > } > } > } > ATTRIBUTE "BmiChanExt2" { > DATATYPE "/channel/BmiChanExt2Attr_t" > DATASPACE SIMPLE { ( 1 ) / ( 1 ) } > DATA { > (0): { > -8191, > 8191, > -8191, > 8191, > "uV" > } > } > } > DATASET "continuous_set" { > DATATYPE H5T_STD_I16LE > DATASPACE SIMPLE { ( 631 ) / ( H5S_UNLIMITED ) } > DATA {... > } > } > } > } > } > } > > > ------------------------------------------------------------------------------ > Keep yourself connected to Go Parallel: > DESIGN Expert tips on starting your parallel project right. > http://goparallel.sourceforge.net > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: dashesy <da...@gm...> - 2012-11-27 23:17:42
|
I have a file that has attributes with nested compound type, when reading it with PyTables 2.4.0 I get this error: C:\Python27\lib\site-packages\tables\attributeset.py:293: DataTypeWarning: Unsupported type for attribute 'BmiRoot' in node '/'. Offending HDF5 class: 6 value = self._g_getAttr(self._v_node, name) C:\Python27\lib\site-packages\tables\attributeset.py:293: DataTypeWarning: Unsupported type for attribute 'BmiChanExt' in node 'channel00001'. Offending HDF5 class: 6 value = self._g_getAttr(self._v_node, name) Hard to say what exactly happens, just wanted to know if this is not already fixed in newer versions I will be more than happy to work on it, any pointers as to where to look is appreciated. Here is the (partial) dump of the file (for brevity I deleted non-related data parts but can provide the full file if needed): HDF5 "pause5-10-5.ns2.h5" { GROUP "/" { ATTRIBUTE "BmiRoot" { DATATYPE "/BmiRootAttr_t" DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): { 1, 0, 0, 1, "2008-12-02 22:57:02.251000", "1 kS/s", "" } } } DATATYPE "BmiRootAttr_t" H5T_COMPOUND { H5T_STD_U32LE "MajorVersion"; H5T_STD_U32LE "MinorVersion"; H5T_STD_U32LE "Flags"; H5T_STD_U32LE "GroupCount"; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } "Date"; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } "Application"; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } "Comment"; } GROUP "channel" { DATATYPE "BmiChanAttr_t" H5T_COMPOUND { H5T_STD_U16LE "ID"; H5T_IEEE_F32LE "Clock"; H5T_IEEE_F32LE "SampleRate"; H5T_STD_U8LE "SampleBits"; } DATATYPE "BmiChanExt2Attr_t" H5T_COMPOUND { H5T_STD_I32LE "DigitalMin"; H5T_STD_I32LE "DigitalMax"; H5T_STD_I32LE "AnalogMin"; H5T_STD_I32LE "AnalogMax"; H5T_STRING { STRSIZE 16; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } "AnalogUnit"; } DATATYPE "BmiChanExtAttr_t" H5T_COMPOUND { H5T_IEEE_F64LE "NanoVoltsPerLSB"; H5T_COMPOUND { H5T_STD_U32LE "HighPassFreq"; H5T_STD_U32LE "HighPassOrder"; H5T_STD_U16LE "HighPassType"; H5T_STD_U32LE "LowPassFreq"; H5T_STD_U32LE "LowPassOrder"; H5T_STD_U16LE "LowPassType"; } "Filter"; H5T_STD_U8LE "PhysicalConnector"; H5T_STD_U8LE "ConnectorPin"; H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } "Label"; } DATATYPE "BmiChanFiltAttr_t" H5T_COMPOUND { H5T_STD_U32LE "HighPassFreq"; H5T_STD_U32LE "HighPassOrder"; H5T_STD_U16LE "HighPassType"; H5T_STD_U32LE "LowPassFreq"; H5T_STD_U32LE "LowPassOrder"; H5T_STD_U16LE "LowPassType"; } GROUP "channel00001" { ATTRIBUTE "BmiChan" { DATATYPE "/channel/BmiChanAttr_t" DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): { 1, 30000, 1000, 16 } } } ATTRIBUTE "BmiChanExt" { DATATYPE "/channel/BmiChanExtAttr_t" DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): { 1000, { 750000, 4, 1, 7500, 3, 1 }, 1, 1, "elec1" } } } ATTRIBUTE "BmiChanExt2" { DATATYPE "/channel/BmiChanExt2Attr_t" DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): { -8191, 8191, -8191, 8191, "uV" } } } DATASET "continuous_set" { DATATYPE H5T_STD_I16LE DATASPACE SIMPLE { ( 631 ) / ( H5S_UNLIMITED ) } DATA {... } } } } } } |
From: Anthony S. <sc...@gm...> - 2012-11-26 01:39:21
|
On Mon, Nov 19, 2012 at 12:59 PM, Jon Wilson <js...@fn...> wrote: > Hi Anthony, > > > > > On 11/17/2012 11:49 AM, Anthony Scopatz wrote: > > Hi Jon, > > Barring changes to numexpr itself, this is exactly what I am suggesting. > Well,, either writing one query expr per bin or (more cleverly) writing > one expr which when evaluated for a row returns the integer bin number (1, > 2, 3,...) this row falls in. Then you can simply count() for each bin > number. > > For example, if you wanted to histogram data which ran from [0,100] into > 10 bins, then the expr "r/10" into a dtype=int would do the trick. This > has the advantage of only running over the data once. (Also, I am not > convinced that running over the data multiple times is less efficient than > doing row-based iteration. You would have to test it on your data to find > out.) > > >> It is a reduction operation, and would greatly benefit from chunking, I >> expect. Not unlike sum(), which is implemented as a specially supported >> reduction operation inside numexpr (buggily, last I checked). I suspect >> that a substantial improvement in histogramming requires direct support >> from either pytables or from numexpr. I don't suppose that there might be a >> chunked-reduction interface exposed somewhere that I could hook into? >> > > This is definitively as feature to request from numexpr. > > I've been fiddling around with Stephen's code a bit, and it looks like the > best way to do things is to read chunks (whether exactly of table.chunksize > or not is a matter for optimization) of the data in at a time, and create > histograms of those chunks. Then combining the histograms is a trivial sum > operation. This type of approach can be generically applied in many cases, > I suspect, where row-by-row iteration is prohibitively slow, but the > dataset is too large to fit into memory. As I understand, this idea is the > primary win of PyTables in the first place! > > So, I think it would be extraordinarily helpful to provide a > chunked-iteration interface for this sort of use case. It can be as simple > as a wrapper around Table.read(): > > class Table: > def chunkiter(self, field=None): > while n*self.chunksize < self.nrows: > yield self.read(n*self.chunksize, (n+1)*self.chunksize, > field=field) > > Then I can write something like > bins = linspace(-1,1, 101) > hist = sum(histogram(chunk, bins=bins) for chunk in > mytable.chunkiter(myfield)) > > Preliminary tests seem to indicate that, for a table with 1 column and 10M > rows, reading in "chunks" of 10x chunksize gives the best > read-time-per-row. This is perhaps naive as regards chunksize black magic, > though... > Hello Jon, Sorry about the slow reply, but I think that what is proposed in issue #27 [1] would solve the above by default, right? Maybe you could pull Josh's code and test it on the above example to make sure. And then we could go ahead and merge this in :). > And of course, if implemented by numexpr, it could benefit from the nice > automatic multithreading there. > This would be nice, but as you point out, not totally necessary here. > > Also, I might dig in a bit and see about extending the "field" argument to > read so it can read multiple fields at once (to do N-dimensional > histograms), as you suggested in a previous mail some months ago. > Also super cool, but not immediate ;) Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/27 > Best Regards, > Jon > |
From: Ondřej Č. <ond...@gm...> - 2012-11-24 21:46:32
|
On Sat, Nov 24, 2012 at 1:42 PM, Ondřej Čertík <ond...@gm...> wrote: > Hi, > > I am using Ubuntu 12.04, numexpr-1.4.2, hdf5-1.8.6 and tables-2.4.0. I > am using the latest numpy 1.7.x from the release branch, in > particular, I am using > the commit 3a52aa0. > > After installing everything I get: > > $ python > Python 2.6.4 (r264:75706, Dec 17 2011, 17:17:12) > [GCC 4.6.1] on linux3 > Type "help", "copyright", "credits" or "license" for more information. >>>> import tables > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/__init__.py", > line 64, in <module> > from tables.file import File, openFile, copyFile > File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/file.py", > line 47, in <module> > from tables.table import Table > File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/table.py", > line 27, in <module> > from tables import tableExtension > File "tableExtension.pyx", line 31, in init tables.tableExtension > (tables/tableExtension.c:16214) > File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/conditions.py", > line 31, in <module> > from numexpr.necompiler import typecode_to_kind > ImportError: No module named necompiler >>>> > > But numexpr itself seems to be working just fine: > >>>> from numexpr.necompiler import typecode_to_kind >>>> > > > So there must be some trick here, and I didn't figure it out so far. > Does anyone know what is going on? It looks like some paths are not > propagated properly in tableExtension.pyx? Ok, I forgot to manually erase old tables installation, i.e. this fixed my problem: rm -rf /home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/ reinstall tables. Sorry about that. This feature of Python is highly annoying, that it picks up the old installation and things stop working. Ondrej |
From: Ondřej Č. <ond...@gm...> - 2012-11-24 21:42:26
|
Hi, I am using Ubuntu 12.04, numexpr-1.4.2, hdf5-1.8.6 and tables-2.4.0. I am using the latest numpy 1.7.x from the release branch, in particular, I am using the commit 3a52aa0. After installing everything I get: $ python Python 2.6.4 (r264:75706, Dec 17 2011, 17:17:12) [GCC 4.6.1] on linux3 Type "help", "copyright", "credits" or "license" for more information. >>> import tables Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/__init__.py", line 64, in <module> from tables.file import File, openFile, copyFile File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/file.py", line 47, in <module> from tables.table import Table File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/table.py", line 27, in <module> from tables import tableExtension File "tableExtension.pyx", line 31, in init tables.tableExtension (tables/tableExtension.c:16214) File "/home/ondrej/repos/qsnake/local/lib/python2.6/site-packages/tables/conditions.py", line 31, in <module> from numexpr.necompiler import typecode_to_kind ImportError: No module named necompiler >>> But numexpr itself seems to be working just fine: >>> from numexpr.necompiler import typecode_to_kind >>> So there must be some trick here, and I didn't figure it out so far. Does anyone know what is going on? It looks like some paths are not propagated properly in tableExtension.pyx? Ondrej |
From: Jon W. <js...@fn...> - 2012-11-19 20:59:51
|
Hi Anthony, On 11/17/2012 11:49 AM, Anthony Scopatz wrote: > Hi Jon, > > Barring changes to numexpr itself, this is exactly what I am > suggesting. Well,, either writing one query expr per bin or (more > cleverly) writing one expr which when evaluated for a row returns the > integer bin number (1, 2, 3,...) this row falls in. Then you can > simply count() for each bin number. > > For example, if you wanted to histogram data which ran from [0,100] > into 10 bins, then the expr "r/10" into a dtype=int would do the > trick. This has the advantage of only running over the data once. > (Also, I am not convinced that running over the data multiple times > is less efficient than doing row-based iteration. You would have to > test it on your data to find out.) > > It is a reduction operation, and would greatly benefit from > chunking, I expect. Not unlike sum(), which is implemented as a > specially supported reduction operation inside numexpr (buggily, > last I checked). I suspect that a substantial improvement in > histogramming requires direct support from either pytables or from > numexpr. I don't suppose that there might be a chunked-reduction > interface exposed somewhere that I could hook into? > > > This is definitively as feature to request from numexpr. I've been fiddling around with Stephen's code a bit, and it looks like the best way to do things is to read chunks (whether exactly of table.chunksize or not is a matter for optimization) of the data in at a time, and create histograms of those chunks. Then combining the histograms is a trivial sum operation. This type of approach can be generically applied in many cases, I suspect, where row-by-row iteration is prohibitively slow, but the dataset is too large to fit into memory. As I understand, this idea is the primary win of PyTables in the first place! So, I think it would be extraordinarily helpful to provide a chunked-iteration interface for this sort of use case. It can be as simple as a wrapper around Table.read(): class Table: def chunkiter(self, field=None): while n*self.chunksize < self.nrows: yield self.read(n*self.chunksize, (n+1)*self.chunksize, field=field) Then I can write something like bins = linspace(-1,1, 101) hist = sum(histogram(chunk, bins=bins) for chunk in mytable.chunkiter(myfield)) Preliminary tests seem to indicate that, for a table with 1 column and 10M rows, reading in "chunks" of 10x chunksize gives the best read-time-per-row. This is perhaps naive as regards chunksize black magic, though... And of course, if implemented by numexpr, it could benefit from the nice automatic multithreading there. Also, I might dig in a bit and see about extending the "field" argument to read so it can read multiple fields at once (to do N-dimensional histograms), as you suggested in a previous mail some months ago. Best Regards, Jon |
From: Shyam <shy...@gm...> - 2012-11-19 02:27:18
|
Hi I am trying a simple example of creating a VLArray as follows filename = "test.h5" h5file = openFile(filename, mode = "w", title = "Test file") group = h5file.createGroup("/", 'detector', 'Detector information') vlarray2 = h5file.createVLArray(group, 'vlarray2',StringAtom(itemsize=20), "ragged array of strings", filters=Filters(1)) vlarray2.flavor = 'python' vlarray2.append(['5', '66']) vlarray2.append(['5', '6', '777']) vlarray2.append(['5', '6', '9', '88']) This seems fine but if I try to expand the list of one of teh VLArray rows vlarray2[0] = ['5' , '66' , '13' ] I get an error as ValueError: Length of value (3) is larger than number of elements in row (2) So this mean that I cannot increase the list in runtime. If yes, then is there any other way of achieving this ? Re |
From: David W. <so...@av...> - 2012-11-18 23:29:48
|
Yes _please_ Stephen. It would be much appreciated. On 19/11/2012, at 8:12 AM, Jon Wilson wrote: > Hi Stephen, > This sounds fantastic, and exactly what i'm looking for. I'll take a closer look tomorrow. > Jon > > Stephen Simmons <ma...@st...> wrote: > Back in 2006/07 I wrote an optimized histogram function for pytables + > numpy. The main steps were: - Read in chunksize-sections of the pytables > array so the HDF5 library just needs to decompress full blocks of data > from disk into memory; eliminates subsequent copying/merging of partial > data blocks - Modify numpy's bincount function to be more suitable for > high-speed histograms by avoiding data type conversions, eliminate > initial pass to determine bounds, etc. - Also I modified the numpy > histogram function to update existing histogram counts. This meant huge > pytables datasets could be histogrammed by reading in successive chunks. > - I also wrote numpy function in C to do weighted averages and simple > joins. Net result of optimising both the pytables data storage and the > numpy histogramming was probably a 50x increase in > speed. Certainly I > was getting >1m rows/sec for weighted average histograms, using a 2005 > Dell laptop. I had plans to submit it as a patch to numpy, but work > priorities at the time took me in another direction. One email about it > with some C code is here: > http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html > I can send a proper Python source package for it if anyone is > interested. Regards Stephen Yes _please_ Stephen. It would be much appreciated. > Message: 3 > Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted > <fa...@gm...> Subject: Re: [Pytables-users] Histogramming 1000x too > slow To: Discussion list for PyTables > <pyt...@li...> Message-ID: > <50A...@gm...> Content-Type: text/plain; charset=ISO-8859-1; > format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote: > > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? > > Indeed a 1000x slowness is quite a lot, but it is important to stress > that you are doing an disk operation whenever you are accessing a data > element, and that takes time. Perhaps using Array or CArray would make > times a bit better, but frankly, I don't think this is going to buy you > too much speed. > > The problem here is that you have too many layers, and this makes access > slower. You may have better luck with > carray > (https://github.com/FrancescAlted/carray), that supports this sort of > operations, but using a much simpler persistence machinery. At any > rate, the results are far better than PyTables: > > In [6]: import numpy as np > > In [7]: import carray as ca > > In [8]: N = 1e7 > > In [9]: a = np.random.rand(N) > > In [10]: %time h = np.histogram(a) > CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s > Wall time: 0.55 s > > In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray') > > In [12]: %time h = np.histogram(ad) > CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s > Wall time: 5.81 s > > So, the overhead for using a disk-based array is just 10x (not 1000x as > in PyTables). I don't know if a 10x slowdown is acceptable to you, but > in case you need more speed, you could probably implement the histogram > as a method of the carray class in > Cython: > > https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651 > > It should not be too difficult to come up with an optimal implementation > using a chunk-based approach. > > -- Francesc Alted > > > > > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users _________________________________________________ experimental polymedia: www.avatar.com.au Sonic Communications Research Group, University of Canberra: creative.canberra.edu.au/scrg |
From: Jon W. <js...@fn...> - 2012-11-18 21:12:47
|
Hi Stephen, This sounds fantastic, and exactly what i'm looking for. I'll take a closer look tomorrow. Jon Stephen Simmons <ma...@st...> wrote: >Back in 2006/07 I wrote an optimized histogram function for pytables + >numpy. The main steps were: - Read in chunksize-sections of the >pytables >array so the HDF5 library just needs to decompress full blocks of data >from disk into memory; eliminates subsequent copying/merging of partial > >data blocks - Modify numpy's bincount function to be more suitable for >high-speed histograms by avoiding data type conversions, eliminate >initial pass to determine bounds, etc. - Also I modified the numpy >histogram function to update existing histogram counts. This meant huge > >pytables datasets could be histogrammed by reading in successive >chunks. >- I also wrote numpy function in C to do weighted averages and simple >joins. Net result of optimising both the pytables data storage and the >numpy histogramming was probably a 50x increase in speed. Certainly I >was getting >1m rows/sec for weighted average histograms, using a 2005 >Dell laptop. I had plans to submit it as a patch to numpy, but work >priorities at the time took me in another direction. One email about it > >with some C code is here: >http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html > >I can send a proper Python source package for it if anyone is >interested. Regards Stephen ------------------------------ Message: 3 >Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted ><fa...@gm...> Subject: Re: [Pytables-users] Histogramming 1000x >too >slow To: Discussion list for PyTables ><pyt...@li...> Message-ID: ><50A...@gm...> Content-Type: text/plain; >charset=ISO-8859-1; >format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote: > >> Hi all, >> I am trying to find the best way to make histograms from large data >> sets. Up to now, I've been just loading entire columns into >in-memory >> numpy arrays and making histograms from those. However, I'm >currently >> working on a handful of datasets where this is prohibitively memory >> intensive (causing an out-of-memory kernel panic on a shared machine >> that you have to open a ticket to have rebooted makes you a little >> gun-shy), so I am now exploring other options. >> >> I know that the Column object is rather nicely set up to act, in some >> circumstances, like a numpy ndarray. So my first thought is to try >just >> creating the histogram out of the Column object directly. This is, >> however, 1000x slower than loading it into memory and creating the >> histogram from the in-memory array. Please see my test notebook at: >> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html >> >> For such a small table, loading into memory is not an issue. For >larger >> tables, though, it is a problem, and I had hoped that pytables was >> optimized so that histogramming directly from disk would proceed no >> slower than loading into memory and histogramming. Is there some >other >> way of accessing the column (or Array or CArray) data that will make >> faster histograms? > >Indeed a 1000x slowness is quite a lot, but it is important to stress >that you are doing an disk operation whenever you are accessing a data >element, and that takes time. Perhaps using Array or CArray would make >times a bit better, but frankly, I don't think this is going to buy you >too much speed. > >The problem here is that you have too many layers, and this makes >access >slower. You may have better luck with carray >(https://github.com/FrancescAlted/carray), that supports this sort of >operations, but using a much simpler persistence machinery. At any >rate, the results are far better than PyTables: > >In [6]: import numpy as np > >In [7]: import carray as ca > >In [8]: N = 1e7 > >In [9]: a = np.random.rand(N) > >In [10]: %time h = np.histogram(a) >CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s >Wall time: 0.55 s > >In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray') > >In [12]: %time h = np.histogram(ad) >CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s >Wall time: 5.81 s > >So, the overhead for using a disk-based array is just 10x (not 1000x as >in PyTables). I don't know if a 10x slowdown is acceptable to you, but >in case you need more speed, you could probably implement the histogram >as a method of the carray class in Cython: > >https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651 > >It should not be too difficult to come up with an optimal >implementation >using a chunk-based approach. > >-- Francesc Alted ------------------------------ > > >------------------------------------------------------------------------------ >Monitor your physical, virtual and cloud infrastructure from a single >web console. Get in-depth insight into apps, servers, databases, >vmware, >SAP, cloud infrastructure, etc. Download 30-day Free Trial. >Pricing starts from $795 for 25 servers or applications! >http://p.sf.net/sfu/zoho_dev2dev_nov >_______________________________________________ >Pytables-users mailing list >Pyt...@li... >https://lists.sourceforge.net/lists/listinfo/pytables-users -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. |
From: Stephen S. <ma...@st...> - 2012-11-18 18:51:16
|
Back in 2006/07 I wrote an optimized histogram function for pytables + numpy. The main steps were: - Read in chunksize-sections of the pytables array so the HDF5 library just needs to decompress full blocks of data from disk into memory; eliminates subsequent copying/merging of partial data blocks - Modify numpy's bincount function to be more suitable for high-speed histograms by avoiding data type conversions, eliminate initial pass to determine bounds, etc. - Also I modified the numpy histogram function to update existing histogram counts. This meant huge pytables datasets could be histogrammed by reading in successive chunks. - I also wrote numpy function in C to do weighted averages and simple joins. Net result of optimising both the pytables data storage and the numpy histogramming was probably a 50x increase in speed. Certainly I was getting >1m rows/sec for weighted average histograms, using a 2005 Dell laptop. I had plans to submit it as a patch to numpy, but work priorities at the time took me in another direction. One email about it with some C code is here: http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html I can send a proper Python source package for it if anyone is interested. Regards Stephen ------------------------------ Message: 3 Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted <fa...@gm...> Subject: Re: [Pytables-users] Histogramming 1000x too slow To: Discussion list for PyTables <pyt...@li...> Message-ID: <50A...@gm...> Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote: > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? Indeed a 1000x slowness is quite a lot, but it is important to stress that you are doing an disk operation whenever you are accessing a data element, and that takes time. Perhaps using Array or CArray would make times a bit better, but frankly, I don't think this is going to buy you too much speed. The problem here is that you have too many layers, and this makes access slower. You may have better luck with carray (https://github.com/FrancescAlted/carray), that supports this sort of operations, but using a much simpler persistence machinery. At any rate, the results are far better than PyTables: In [6]: import numpy as np In [7]: import carray as ca In [8]: N = 1e7 In [9]: a = np.random.rand(N) In [10]: %time h = np.histogram(a) CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s Wall time: 0.55 s In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray') In [12]: %time h = np.histogram(ad) CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s Wall time: 5.81 s So, the overhead for using a disk-based array is just 10x (not 1000x as in PyTables). I don't know if a 10x slowdown is acceptable to you, but in case you need more speed, you could probably implement the histogram as a method of the carray class in Cython: https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651 It should not be too difficult to come up with an optimal implementation using a chunk-based approach. -- Francesc Alted ------------------------------ |
From: Lukas S. <luk...@gm...> - 2012-11-18 08:11:00
|
2012. 11. 17. 오후 12:46에 <pyt...@li...>님이 작성: > Send Pytables-users mailing list submissions to > pyt...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/pytables-users > or, via email, send a message with subject or body 'help' to > pyt...@li... > > You can reach the person managing the list at > pyt...@li... > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Pytables-users digest..." > > > Today's Topics: > > 1. Re: pyTable index from c++ (Jim Knoll) > 2. Store a reference to a dataset (Juan Manuel V?zquez Tovar) > 3. Histogramming 1000x too slow (Jon Wilson) > 4. Re: Histogramming 1000x too slow (Anthony Scopatz) > 5. Re: Histogramming 1000x too slow (Jon Wilson) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 9 Nov 2012 15:26:38 -0600 > From: Jim Knoll <jim...@sp...> > Subject: Re: [Pytables-users] pyTable index from c++ > To: 'Discussion list for PyTables' > <pyt...@li...> > Message-ID: > < > 142...@SP...> > Content-Type: text/plain; charset="us-ascii" > > Thanks for taking the time. > > Most of our tables are very wide lots of col.... and simple conditions > are common.... so that is why in-kernel makes almost no impact for me. > > -----Original Message----- > From: Francesc Alted [mailto:fa...@gm...] > Sent: Friday, November 09, 2012 11:27 AM > To: pyt...@li... > Subject: Re: [Pytables-users] pyTable index from c++ > > Well, expected performance of in-kernel (numexpr powered) queries wrt > regular (python) queries largely depends on where the bottleneck is. If > your table has a lot of columns, then the bottleneck is going to be more > on the I/O side, so you cannot expect a large difference in performance. > However, if your table has a small number of columns, then there is more > likelihood that bottleneck is CPU, and your chances to experiment a > difference are higher. > > Of course, having complex queries (i.e. queries that take conditions > over several columns, or just combinations of conditions in the same > column) makes the query more CPU intensive, and in-kernel normally wins > by a comfortable margin. > > Finally, what indexing is doing is to reduce the number of rows where > the conditions have to be evaluated, so depending on the cardinality of > the query and the associated index, you can get more or less speedup. > > Francesc > > On 11/9/12 5:12 PM, Jim Knoll wrote: > > > > Thanks for the reply. I will put some investigation of C++ access on > > my list for items to look at over the slow holiday season. > > > > For the short term we will store a C++ ready index as a different > > table object in the same h5 file. It will work... just a bit of a waste > > on disk space. > > > > One follow up question > > > > Why would my performance of > > > > for row in node.where('stringField == "SomeString"'): > > > > *not*be noticeably faster than > > > > for row in node: > > > > if row.stringField == "SomeString" : > > > > Specifically when there is no index. I understand and see the speed > > improvement only when I have a index. I expected to see some benefit > > from numexpr even with no index. I expected node.where() to be much > > faster. What I see is identical performance. Is numexpr benefit only > > seen for complex math like (floatField ** intField > otherFloatField) > > I did not see that to be the case on my first attempt.... Seems that I > > only benefit from a index. > > > > *From:*Anthony Scopatz [mailto:sc...@gm...] > > *Sent:* Friday, November 09, 2012 12:24 AM > > *To:* Discussion list for PyTables > > *Subject:* Re: [Pytables-users] pyTable index from c++ > > > > On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll > > <jim...@sp... <mailto:jim...@sp...>> > > wrote: > > > > I love the index function and promote the internal use of PyTables at > > my company. The availability of a indexed method to speed the search > > is the main reason why. > > > > We are a mixed shop using c++ to create H5 (just for the raw speed ... > > need to keep up with streaming data) End users start with python > > pyTables to consume the data. (Often after we have created indexes > > from python pytables.col.col1.createIndex()) > > > > Sometimes the users come up with something we want to do thousands of > > times and performance is critical. But then we are falling back to c++ > > We can use our own index method but would like to make dbl use of the > > PyTables index. > > > > I know the python table.where( is implemented in C. > > > > Hi Jim, > > > > This is only kind of true. Querying (ie all of the where*() methods) > > are actually mostly written in Python in the tables.py and > > expressions.py files. However, they make use of numexpr [1]. > > > > Is there a way to access that from c or c++? Don't mind if I need > > to do work to get the result I think in my case the work may be > > worth it. > > > > *PLAN 1:* One possibility is that the parts of PyTables are written in > > Cython. We could maybe try (without making any edits to these files) > > to convert them to Cython. This has the advantage that for Cython > > files, if you write the appropriate C++ header file and link against > > the shared library correctly, it is possible to access certain > > functions from C/C++. BUT, I am not sure how much of speed boost you > > would get out of this since you would still be calling out to the > > Python interpreter to get these result. You are just calling Python's > > virtual machine from C++ rather than calling it from Python (like > > normal). This has the advantage that you would basically get access to > > these functions acting on tables from C++. > > > > *PLAN 2:* Alternatively, numexpr itself is mostly written in C++ > > already. You should be able to call core numexpr functions directly. > > However, you would have to feed it data that you read from the tables > > yourself. These could even be table indexes. On a personal note, if > > you get code working that does this, I would be interested in seeing > > your implementation. (I have another project where I have tables that > > I want to query from C++) > > > > Let us know what route you ultimately end up taking or if you have any > > further questions! > > > > Be Well > > > > Anthony > > > > 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr > > > > > ------------------------------------------------------------------------ > > > > *Jim Knoll** > > *Data Developer** > > > > Spot Trading L.L.C > > 440 South LaSalle St., Suite 2800 > > Chicago, IL 60605 > > Office: 312.362.4550 <tel:312.362.4550> > > Direct: 312-362-4798 <tel:312-362-4798> > > Fax: 312.362.4551 <tel:312.362.4551> > > jim...@sp... <mailto:jim...@sp...> > > www.spottradingllc.com <http://www.spottradingllc.com/> > > > > > ------------------------------------------------------------------------ > > > > The information contained in this message may be privileged and > > confidential and protected from disclosure. If the reader of this > > message is not the intended recipient, or an employee or agent > > responsible for delivering this message to the intended recipient, > > you are hereby notified that any dissemination, distribution or > > copying of this communication is strictly prohibited. If you have > > received this communication in error, please notify us immediately > > by replying to the message and deleting it from your computer. > > Thank you. Spot Trading, LLC > > > > > > > ------------------------------------------------------------------------------ > > Everyone hates slow websites. So do we. > > Make your web apps faster with AppDynamics > > Download AppDynamics Lite for free today: > > http://p.sf.net/sfu/appdyn_d2d_nov > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > Everyone hates slow websites. So do we. > > Make your web apps faster with AppDynamics > > Download AppDynamics Lite for free today: > > http://p.sf.net/sfu/appdyn_d2d_nov > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------ > > Message: 2 > Date: Sun, 11 Nov 2012 01:39:33 +0100 > From: Juan Manuel V?zquez Tovar <jmv...@gm...> > Subject: [Pytables-users] Store a reference to a dataset > To: Discussion list for PyTables > <pyt...@li...> > Message-ID: > < > CAD...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > Hello, > > I have to deal in pytables with a very large dataset. The file already > compressed with blosc5 is about 5GB. Is it possible to store objects within > the same file, each of them containing a reference to a certain search over > the dataset? > It is like having a large numpy array and a mask of it in the same pytables > file. > > Thank you, > > Juanma > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 3 > Date: Fri, 16 Nov 2012 11:02:25 -0600 > From: Jon Wilson <js...@fn...> > Subject: [Pytables-users] Histogramming 1000x too slow > To: pyt...@li... > Message-ID: <50A...@fn...> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? > Regards, > Jon > > > > ------------------------------ > > Message: 4 > Date: Fri, 16 Nov 2012 17:10:41 -0800 > From: Anthony Scopatz <sc...@gm...> > Subject: Re: [Pytables-users] Histogramming 1000x too slow > To: Discussion list for PyTables > <pyt...@li...> > Message-ID: > < > CAP...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: > > > Hi all, > > I am trying to find the best way to make histograms from large data > > sets. Up to now, I've been just loading entire columns into in-memory > > numpy arrays and making histograms from those. However, I'm currently > > working on a handful of datasets where this is prohibitively memory > > intensive (causing an out-of-memory kernel panic on a shared machine > > that you have to open a ticket to have rebooted makes you a little > > gun-shy), so I am now exploring other options. > > > > I know that the Column object is rather nicely set up to act, in some > > circumstances, like a numpy ndarray. So my first thought is to try just > > creating the histogram out of the Column object directly. This is, > > however, 1000x slower than loading it into memory and creating the > > histogram from the in-memory array. Please see my test notebook at: > > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > > > For such a small table, loading into memory is not an issue. For larger > > tables, though, it is a problem, and I had hoped that pytables was > > optimized so that histogramming directly from disk would proceed no > > slower than loading into memory and histogramming. Is there some other > > way of accessing the column (or Array or CArray) data that will make > > faster histograms? > > > > Hi Jon, > > This is not surprising since the column object itself is going to be > iterated > over per row. As you found, reading in each row individually will be > prohibitively expensive as compared to reading in all the data at one. > > To do this in the right way for data that is larger than system memory, you > need to read it in in chunks. Luckily there are tools to help you automate > this process already in PyTables. I would recommend that you use > expressions [1] or queries [2] to do your historgramming more efficiently. > > Be Well > Anthony > > 1. http://pytables.github.com/usersguide/libref/expr_class.html > 2. > > http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying > > > > > Regards, > > Jon > > > > > > > ------------------------------------------------------------------------------ > > Monitor your physical, virtual and cloud infrastructure from a single > > web console. Get in-depth insight into apps, servers, databases, vmware, > > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > > Pricing starts from $795 for 25 servers or applications! > > http://p.sf.net/sfu/zoho_dev2dev_nov > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 5 > Date: Fri, 16 Nov 2012 21:33:46 -0600 > From: Jon Wilson <js...@fn...> > Subject: Re: [Pytables-users] Histogramming 1000x too slow > To: Discussion list for PyTables > <pyt...@li...>, Anthony Scopatz > <sc...@gm...> > Message-ID: <4c2...@em...> > Content-Type: text/plain; charset="utf-8" > > Hi Anthony, > I don't think that either of these help me here (unless I've misunderstood > something). I need to fill the histogram with every row in the table, so > querying doesn't gain me anything. (especially since the query just returns > an iterator over rows) I also don't need (at the moment) to compute any > function of the column data, just count (weighted) entries into various > bins. I suppose I could write one Expr for each bin of my histogram, but > that seems dreadfully inefficient and probably difficult to maintain. > > It is a reduction operation, and would greatly benefit from chunking, I > expect. Not unlike sum(), which is implemented as a specially supported > reduction operation inside numexpr (buggily, last I checked). I suspect > that a substantial improvement in histogramming requires direct support > from either pytables or from numexpr. I don't suppose that there might be > a chunked-reduction interface exposed somewhere that I could hook into? > Jon > > Anthony Scopatz <sc...@gm...> wrote: > > >On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: > > > >> Hi all, > >> I am trying to find the best way to make histograms from large data > >> sets. Up to now, I've been just loading entire columns into > >in-memory > >> numpy arrays and making histograms from those. However, I'm > >currently > >> working on a handful of datasets where this is prohibitively memory > >> intensive (causing an out-of-memory kernel panic on a shared machine > >> that you have to open a ticket to have rebooted makes you a little > >> gun-shy), so I am now exploring other options. > >> > >> I know that the Column object is rather nicely set up to act, in some > >> circumstances, like a numpy ndarray. So my first thought is to try > >just > >> creating the histogram out of the Column object directly. This is, > >> however, 1000x slower than loading it into memory and creating the > >> histogram from the in-memory array. Please see my test notebook at: > >> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > >> > >> For such a small table, loading into memory is not an issue. For > >larger > >> tables, though, it is a problem, and I had hoped that pytables was > >> optimized so that histogramming directly from disk would proceed no > >> slower than loading into memory and histogramming. Is there some > >other > >> way of accessing the column (or Array or CArray) data that will make > >> faster histograms? > >> > > > >Hi Jon, > > > >This is not surprising since the column object itself is going to be > >iterated > >over per row. As you found, reading in each row individually will be > >prohibitively expensive as compared to reading in all the data at one. > > > >To do this in the right way for data that is larger than system memory, > >you > >need to read it in in chunks. Luckily there are tools to help you > >automate > >this process already in PyTables. I would recommend that you use > >expressions [1] or queries [2] to do your historgramming more > >efficiently. > > > >Be Well > >Anthony > > > >1. http://pytables.github.com/usersguide/libref/expr_class.html > >2. > > > http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying > > > > > > > >> Regards, > >> Jon > >> > >> > >> > > >------------------------------------------------------------------------------ > >> Monitor your physical, virtual and cloud infrastructure from a single > >> web console. Get in-depth insight into apps, servers, databases, > >vmware, > >> SAP, cloud infrastructure, etc. Download 30-day Free Trial. > >> Pricing starts from $795 for 25 servers or applications! > >> http://p.sf.net/sfu/zoho_dev2dev_nov > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > > > > >------------------------------------------------------------------------ > > > > >------------------------------------------------------------------------------ > >Monitor your physical, virtual and cloud infrastructure from a single > >web console. Get in-depth insight into apps, servers, databases, > >vmware, > >SAP, cloud infrastructure, etc. Download 30-day Free Trial. > >Pricing starts from $795 for 25 servers or applications! > >http://p.sf.net/sfu/zoho_dev2dev_nov > > > >------------------------------------------------------------------------ > > > >_______________________________________________ > >Pytables-users mailing list > >Pyt...@li... > >https://lists.sourceforge.net/lists/listinfo/pytables-users > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > > ------------------------------ > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > End of Pytables-users Digest, Vol 78, Issue 6 > ********************************************* > |
From: Lukas S. <luk...@gm...> - 2012-11-18 08:10:42
|
2012. 11. 17. 오후 12:46에 <pyt...@li...>님이 작성: > > Send Pytables-users mailing list submissions to > pyt...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/pytables-users > or, via email, send a message with subject or body 'help' to > pyt...@li... > > You can reach the person managing the list at > pyt...@li... > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Pytables-users digest..." > > > Today's Topics: > > 1. Re: pyTable index from c++ (Jim Knoll) > 2. Store a reference to a dataset (Juan Manuel V?zquez Tovar) > 3. Histogramming 1000x too slow (Jon Wilson) > 4. Re: Histogramming 1000x too slow (Anthony Scopatz) > 5. Re: Histogramming 1000x too slow (Jon Wilson) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 9 Nov 2012 15:26:38 -0600 > From: Jim Knoll <jim...@sp...> > Subject: Re: [Pytables-users] pyTable index from c++ > To: 'Discussion list for PyTables' > <pyt...@li...> > Message-ID: > < 142...@SP...> > Content-Type: text/plain; charset="us-ascii" > > Thanks for taking the time. > > Most of our tables are very wide lots of col.... and simple conditions are common.... so that is why in-kernel makes almost no impact for me. > > -----Original Message----- > From: Francesc Alted [mailto:fa...@gm...] > Sent: Friday, November 09, 2012 11:27 AM > To: pyt...@li... > Subject: Re: [Pytables-users] pyTable index from c++ > > Well, expected performance of in-kernel (numexpr powered) queries wrt > regular (python) queries largely depends on where the bottleneck is. If > your table has a lot of columns, then the bottleneck is going to be more > on the I/O side, so you cannot expect a large difference in performance. > However, if your table has a small number of columns, then there is more > likelihood that bottleneck is CPU, and your chances to experiment a > difference are higher. > > Of course, having complex queries (i.e. queries that take conditions > over several columns, or just combinations of conditions in the same > column) makes the query more CPU intensive, and in-kernel normally wins > by a comfortable margin. > > Finally, what indexing is doing is to reduce the number of rows where > the conditions have to be evaluated, so depending on the cardinality of > the query and the associated index, you can get more or less speedup. > > Francesc > > On 11/9/12 5:12 PM, Jim Knoll wrote: > > > > Thanks for the reply. I will put some investigation of C++ access on > > my list for items to look at over the slow holiday season. > > > > For the short term we will store a C++ ready index as a different > > table object in the same h5 file. It will work... just a bit of a waste > > on disk space. > > > > One follow up question > > > > Why would my performance of > > > > for row in node.where('stringField == "SomeString"'): > > > > *not*be noticeably faster than > > > > for row in node: > > > > if row.stringField == "SomeString" : > > > > Specifically when there is no index. I understand and see the speed > > improvement only when I have a index. I expected to see some benefit > > from numexpr even with no index. I expected node.where() to be much > > faster. What I see is identical performance. Is numexpr benefit only > > seen for complex math like (floatField ** intField > otherFloatField) > > I did not see that to be the case on my first attempt.... Seems that I > > only benefit from a index. > > > > *From:*Anthony Scopatz [mailto:sc...@gm...] > > *Sent:* Friday, November 09, 2012 12:24 AM > > *To:* Discussion list for PyTables > > *Subject:* Re: [Pytables-users] pyTable index from c++ > > > > On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll > > <jim...@sp... <mailto:jim...@sp...>> > > wrote: > > > > I love the index function and promote the internal use of PyTables at > > my company. The availability of a indexed method to speed the search > > is the main reason why. > > > > We are a mixed shop using c++ to create H5 (just for the raw speed ... > > need to keep up with streaming data) End users start with python > > pyTables to consume the data. (Often after we have created indexes > > from python pytables.col.col1.createIndex()) > > > > Sometimes the users come up with something we want to do thousands of > > times and performance is critical. But then we are falling back to c++ > > We can use our own index method but would like to make dbl use of the > > PyTables index. > > > > I know the python table.where( is implemented in C. > > > > Hi Jim, > > > > This is only kind of true. Querying (ie all of the where*() methods) > > are actually mostly written in Python in the tables.py and > > expressions.py files. However, they make use of numexpr [1]. > > > > Is there a way to access that from c or c++? Don't mind if I need > > to do work to get the result I think in my case the work may be > > worth it. > > > > *PLAN 1:* One possibility is that the parts of PyTables are written in > > Cython. We could maybe try (without making any edits to these files) > > to convert them to Cython. This has the advantage that for Cython > > files, if you write the appropriate C++ header file and link against > > the shared library correctly, it is possible to access certain > > functions from C/C++. BUT, I am not sure how much of speed boost you > > would get out of this since you would still be calling out to the > > Python interpreter to get these result. You are just calling Python's > > virtual machine from C++ rather than calling it from Python (like > > normal). This has the advantage that you would basically get access to > > these functions acting on tables from C++. > > > > *PLAN 2:* Alternatively, numexpr itself is mostly written in C++ > > already. You should be able to call core numexpr functions directly. > > However, you would have to feed it data that you read from the tables > > yourself. These could even be table indexes. On a personal note, if > > you get code working that does this, I would be interested in seeing > > your implementation. (I have another project where I have tables that > > I want to query from C++) > > > > Let us know what route you ultimately end up taking or if you have any > > further questions! > > > > Be Well > > > > Anthony > > > > 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr > > > > ------------------------------------------------------------------------ > > > > *Jim Knoll** > > *Data Developer** > > > > Spot Trading L.L.C > > 440 South LaSalle St., Suite 2800 > > Chicago, IL 60605 > > Office: 312.362.4550 <tel:312.362.4550> > > Direct: 312-362-4798 <tel:312-362-4798> > > Fax: 312.362.4551 <tel:312.362.4551> > > jim...@sp... <mailto:jim...@sp...> > > www.spottradingllc.com <http://www.spottradingllc.com/> > > > > ------------------------------------------------------------------------ > > > > The information contained in this message may be privileged and > > confidential and protected from disclosure. If the reader of this > > message is not the intended recipient, or an employee or agent > > responsible for delivering this message to the intended recipient, > > you are hereby notified that any dissemination, distribution or > > copying of this communication is strictly prohibited. If you have > > received this communication in error, please notify us immediately > > by replying to the message and deleting it from your computer. > > Thank you. Spot Trading, LLC > > > > > > ------------------------------------------------------------------------------ > > Everyone hates slow websites. So do we. > > Make your web apps faster with AppDynamics > > Download AppDynamics Lite for free today: > > http://p.sf.net/sfu/appdyn_d2d_nov > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > ------------------------------------------------------------------------------ > > Everyone hates slow websites. So do we. > > Make your web apps faster with AppDynamics > > Download AppDynamics Lite for free today: > > http://p.sf.net/sfu/appdyn_d2d_nov > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------ > > Message: 2 > Date: Sun, 11 Nov 2012 01:39:33 +0100 > From: Juan Manuel V?zquez Tovar <jmv...@gm...> > Subject: [Pytables-users] Store a reference to a dataset > To: Discussion list for PyTables > <pyt...@li...> > Message-ID: > < CAD...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > Hello, > > I have to deal in pytables with a very large dataset. The file already > compressed with blosc5 is a |
From: Francesc A. <fa...@gm...> - 2012-11-17 22:54:49
|
On 11/16/12 6:02 PM, Jon Wilson wrote: > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? Indeed a 1000x slowness is quite a lot, but it is important to stress that you are doing an disk operation whenever you are accessing a data element, and that takes time. Perhaps using Array or CArray would make times a bit better, but frankly, I don't think this is going to buy you too much speed. The problem here is that you have too many layers, and this makes access slower. You may have better luck with carray (https://github.com/FrancescAlted/carray), that supports this sort of operations, but using a much simpler persistence machinery. At any rate, the results are far better than PyTables: In [6]: import numpy as np In [7]: import carray as ca In [8]: N = 1e7 In [9]: a = np.random.rand(N) In [10]: %time h = np.histogram(a) CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s Wall time: 0.55 s In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray') In [12]: %time h = np.histogram(ad) CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s Wall time: 5.81 s So, the overhead for using a disk-based array is just 10x (not 1000x as in PyTables). I don't know if a 10x slowdown is acceptable to you, but in case you need more speed, you could probably implement the histogram as a method of the carray class in Cython: https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651 It should not be too difficult to come up with an optimal implementation using a chunk-based approach. -- Francesc Alted |
From: David W. <dav...@gm...> - 2012-11-17 20:31:53
|
I've been using (and recommend) Pandas http://pandas.pydata.org/ along with this book: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDIQFjAA&url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920023784.do&ei=GfSnUJSbGqm5ywH7poCwDA&usg=AFQjCNEJuio5DbubgyNQR4Tp9iM1RClZHA Good luck, Dave On Fri, Nov 16, 2012 at 11:02 AM, Jon Wilson <js...@fn...> wrote: > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? > Regards, > Jon > > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > -- David C. Wilson (612) 460-1329 dav...@gm... http://www.linkedin.com/in/davidcwilson |
From: Anthony S. <sc...@gm...> - 2012-11-17 17:50:05
|
On Fri, Nov 16, 2012 at 7:33 PM, Jon Wilson <js...@fn...> wrote: > Hi Anthony, > I don't think that either of these help me here (unless I've misunderstood > something). I need to fill the histogram with every row in the table, so > querying doesn't gain me anything. (especially since the query just returns > an iterator over rows) I also don't need (at the moment) to compute any > function of the column data, just count (weighted) entries into various > bins. I suppose I could write one Expr for each bin of my histogram, but > that seems dreadfully inefficient and probably difficult to maintain. > Hi Jon, Barring changes to numexpr itself, this is exactly what I am suggesting. Well,, either writing one query expr per bin or (more cleverly) writing one expr which when evaluated for a row returns the integer bin number (1, 2, 3,...) this row falls in. Then you can simply count() for each bin number. For example, if you wanted to histogram data which ran from [0,100] into 10 bins, then the expr "r/10" into a dtype=int would do the trick. This has the advantage of only running over the data once. (Also, I am not convinced that running over the data multiple times is less efficient than doing row-based iteration. You would have to test it on your data to find out.) > It is a reduction operation, and would greatly benefit from chunking, I > expect. Not unlike sum(), which is implemented as a specially supported > reduction operation inside numexpr (buggily, last I checked). I suspect > that a substantial improvement in histogramming requires direct support > from either pytables or from numexpr. I don't suppose that there might be a > chunked-reduction interface exposed somewhere that I could hook into? > This is definitively as feature to request from numexpr. Be Well Anthony > Jon > > Anthony Scopatz <sc...@gm...> wrote: >> >> On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: >> >>> Hi all, >>> I am trying to find the best way to make histograms from large data >>> sets. Up to now, I've been just loading entire columns into in-memory >>> numpy arrays and making histograms from those. However, I'm currently >>> working on a handful of datasets where this is prohibitively memory >>> intensive (causing an out-of-memory kernel panic on a shared machine >>> that you have to open a ticket to have rebooted makes you a little >>> gun-shy), so I am now exploring other options. >>> >>> I know that the Column object is rather nicely set up to act, in some >>> circumstances, like a numpy ndarray. So my first thought is to try just >>> creating the histogram out of the Column object directly. This is, >>> however, 1000x slower than loading it into memory and creating the >>> histogram from the in-memory array. Please see my test notebook at: >>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html >>> >>> For such a small table, loading into memory is not an issue. For larger >>> tables, though, it is a problem, and I had hoped that pytables was >>> optimized so that histogramming directly from disk would proceed no >>> slower than loading into memory and histogramming. Is there some other >>> way of accessing the column (or Array or CArray) data that will make >>> faster histograms? >>> >> >> Hi Jon, >> >> This is not surprising since the column object itself is going to be >> iterated >> over per row. As you found, reading in each row individually will be >> prohibitively expensive as compared to reading in all the data at one. >> >> To do this in the right way for data that is larger than system memory, >> you >> need to read it in in chunks. Luckily there are tools to help you >> automate >> this process already in PyTables. I would recommend that you use >> expressions [1] or queries [2] to do your historgramming more efficiently. >> >> Be Well >> Anthony >> >> 1. http://pytables.github.com/usersguide/libref/expr_class.html >> 2. >> http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying >> >> >> >>> Regards, >>> Jon >>> >>> >>> ------------------------------------------------------------------------------ >>> Monitor your physical, virtual and cloud infrastructure from a single >>> web console. Get in-depth insight into apps, servers, databases, vmware, >>> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >>> Pricing starts from $795 for 25 servers or applications! >>> http://p.sf.net/sfu/zoho_dev2dev_nov >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >> >> ------------------------------ >> >> Monitor your physical, virtual and cloud infrastructure from a single >> >> >> web console. Get in-depth insight into apps, servers, databases, vmware, >> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >> Pricing starts from $795 for 25 servers or applications! >> http://p.sf.net/sfu/zoho_dev2dev_nov >> >> ------------------------------ >> >> Pytables-users mailing list >> >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. > |
From: Jon W. <js...@fn...> - 2012-11-17 03:46:38
|
Hi Anthony, I don't think that either of these help me here (unless I've misunderstood something). I need to fill the histogram with every row in the table, so querying doesn't gain me anything. (especially since the query just returns an iterator over rows) I also don't need (at the moment) to compute any function of the column data, just count (weighted) entries into various bins. I suppose I could write one Expr for each bin of my histogram, but that seems dreadfully inefficient and probably difficult to maintain. It is a reduction operation, and would greatly benefit from chunking, I expect. Not unlike sum(), which is implemented as a specially supported reduction operation inside numexpr (buggily, last I checked). I suspect that a substantial improvement in histogramming requires direct support from either pytables or from numexpr. I don't suppose that there might be a chunked-reduction interface exposed somewhere that I could hook into? Jon Anthony Scopatz <sc...@gm...> wrote: >On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: > >> Hi all, >> I am trying to find the best way to make histograms from large data >> sets. Up to now, I've been just loading entire columns into >in-memory >> numpy arrays and making histograms from those. However, I'm >currently >> working on a handful of datasets where this is prohibitively memory >> intensive (causing an out-of-memory kernel panic on a shared machine >> that you have to open a ticket to have rebooted makes you a little >> gun-shy), so I am now exploring other options. >> >> I know that the Column object is rather nicely set up to act, in some >> circumstances, like a numpy ndarray. So my first thought is to try >just >> creating the histogram out of the Column object directly. This is, >> however, 1000x slower than loading it into memory and creating the >> histogram from the in-memory array. Please see my test notebook at: >> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html >> >> For such a small table, loading into memory is not an issue. For >larger >> tables, though, it is a problem, and I had hoped that pytables was >> optimized so that histogramming directly from disk would proceed no >> slower than loading into memory and histogramming. Is there some >other >> way of accessing the column (or Array or CArray) data that will make >> faster histograms? >> > >Hi Jon, > >This is not surprising since the column object itself is going to be >iterated >over per row. As you found, reading in each row individually will be >prohibitively expensive as compared to reading in all the data at one. > >To do this in the right way for data that is larger than system memory, >you >need to read it in in chunks. Luckily there are tools to help you >automate >this process already in PyTables. I would recommend that you use >expressions [1] or queries [2] to do your historgramming more >efficiently. > >Be Well >Anthony > >1. http://pytables.github.com/usersguide/libref/expr_class.html >2. >http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying > > > >> Regards, >> Jon >> >> >> >------------------------------------------------------------------------------ >> Monitor your physical, virtual and cloud infrastructure from a single >> web console. Get in-depth insight into apps, servers, databases, >vmware, >> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >> Pricing starts from $795 for 25 servers or applications! >> http://p.sf.net/sfu/zoho_dev2dev_nov >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------------ >Monitor your physical, virtual and cloud infrastructure from a single >web console. Get in-depth insight into apps, servers, databases, >vmware, >SAP, cloud infrastructure, etc. Download 30-day Free Trial. >Pricing starts from $795 for 25 servers or applications! >http://p.sf.net/sfu/zoho_dev2dev_nov > >------------------------------------------------------------------------ > >_______________________________________________ >Pytables-users mailing list >Pyt...@li... >https://lists.sourceforge.net/lists/listinfo/pytables-users -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. |