You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Alvaro T. C. <al...@mi...> - 2012-04-25 12:05:51
|
Hi, a minor update on this thread >> * a bool array of 10**8 elements with True in two separate slices of >> length 10**6 each compresses by ~350. Using .wheretrue to obtain >> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy >> array). The resulting filesize is 248kb, still far from storing the 4 >> or 6 integer indexes that define the slices (I am experimenting with >> an approach for scientific databases where this is a concern). > > Oh, you were asking for a 8 to 1 compressor (booleans as bits), but > apparently a 350 to 1 is not enough? :) Here I expected more from a run-length-like compression scheme. My array would be compressible to the following representation: (0, x) : 0 (x, x+10**6) : 1 (x+10**6, y) : 0 (y, y+10**6) : 1 (y+10**6, 10**8) : 0 or just: (x, x+10**6) : 1 (y, y+10**6) : 1 where x and y are two reasonable integers (i.e. in range and with no overlap). >> * how blosc choses the chunklen is black magic for me, but it seems to >> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to >> 64*1024 when CArraying only one row). > > Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could > you detail a bit more how you achieve this result? Providing an example > would be very useful. I revisited this issue. While in PyTables CArray the guesses are reasonable, the problem is in carray.carray (or in its reporting of chunklen). This is the offender: carray((64, 15600000), int16) nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78 cparams := cparams(clevel=5, shuffle=True) In [87]: x.chunklen Out[87]: 1 Could it be that carray is not reporting the second dimension of the chunkshape? (in PyTables, this is 262144) The fact that both PyTable's CArray and carray.carray are named carray is a bit confusing. > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Alvaro T. C. <al...@mi...> - 2012-04-25 11:13:37
|
Hi, Thanks for the clarification. I retried today both with a normal and a completely sorted index on a a blosc-compressed table (complevel 5) and could not reproduce the putative bug either. -á. On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz <sc...@gm...> wrote: > On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted <fa...@py...> wrote: >> >> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote: >> > Some complementary info (I copy the details of the tables below) >> > >> > timeit vals = numpy.fromiter((x['val'] for x in >> > my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) >> > 1 loops, best of 3: 30.4 s per loop >> > >> > >> > Using the compressed and indexed version, it mysteriously does not >> > work (output is empty list) >> >>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), >> >>>> dtype=np.int16) >> >>>> cvals >> > array([], dtype=int16) >> >> This smells like a bug, but I cannot reproduce it. Could you send an >> self-contained example reproducing this behavior? > > > I am not able to reproduce this either... > >> >> >> -- >> Francesc Alted >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-04-24 03:40:23
|
On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted <fa...@py...> wrote: > On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote: > > Some complementary info (I copy the details of the tables below) > > > > timeit vals = numpy.fromiter((x['val'] for x in > > my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) > > 1 loops, best of 3: 30.4 s per loop > > > > > > Using the compressed and indexed version, it mysteriously does not > > work (output is empty list) > >>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), > dtype=np.int16) > >>>> cvals > > array([], dtype=int16) > > This smells like a bug, but I cannot reproduce it. Could you send an > self-contained example reproducing this behavior? > I am not able to reproduce this either... > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@py...> - 2012-04-24 02:14:49
|
On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote: > Some complementary info (I copy the details of the tables below) > > timeit vals = numpy.fromiter((x['val'] for x in > my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) > 1 loops, best of 3: 30.4 s per loop > > > Using the compressed and indexed version, it mysteriously does not > work (output is empty list) >>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), dtype=np.int16) >>>> cvals > array([], dtype=int16) This smells like a bug, but I cannot reproduce it. Could you send an self-contained example reproducing this behavior? -- Francesc Alted |
From: Francesc A. <fa...@py...> - 2012-04-24 02:10:21
|
On 4/18/12 12:33 PM, Alvaro Tejero Cantero wrote: > A single array with 312 000 000 int 16 values. > > Two (uncompressed) ways to store it: > > * Array > >>>> wa02[:10] > array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16 > > * Table wtab02 (single column, named 'val') >>>> wtab02[:10] > array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,), > (338,), (357,)], > dtype=[('val', '<i2')]) > > read time respectively 120 ms, 220 ms. > >>>> timeit big=np.nonzero(wa02[:]>1) > 1 loops, best of 3: 1.66 s per loop > >>>> timeit bigtab=wtab02.getWhereList('val>1') > 1 loops, best of 3: 119 s per loop Yes, this is expected. The fact that one method is much faster than the other is precisely that one is designed for operating out-of-core, while the other is operating completely in-memory, and this has a cost. But that does not mean that out-of-core has to be necessarily slower. Look at this: In [107]: da Out[107]: /da (Array(10000000,)) '' atom := Int16Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None In [108]: dra Out[108]: /dra (Table(10000000,), shuffle, blosc(5)) '' description := { "a": Int16Col(shape=(), dflt=0, pos=0)} byteorder := 'little' chunkshape := (65536,) In [127]: time r = np.argwhere(da[:] == 1) CPU times: user 0.08 s, sys: 0.02 s, total: 0.10 s Wall time: 0.10 s In [111]: time l = dra.getWhereList('a == 1') CPU times: user 0.10 s, sys: 0.01 s, total: 0.11 s Wall time: 0.11 s So, tables' getWhereList() perfomance is pretty close to NumPy, even if the former is using compression. This is a great achievement. Why I'm getting very different results than you is this: In [119]: len(l) Out[119]: 153 That is, the selectivity of the query is extremely high (153 out of 10 million elements), which is the scenario where queries are designed to shine. If you use indexing, then you can get even more speed: In [131]: dra.cols.a.createCSIndex() Out[131]: 10000000 In [132]: time l = dra.getWhereList('a == 1') CPU times: user 0.02 s, sys: 0.01 s, total: 0.03 s Wall time: 0.02 s In your case, using small selectivities (you are asking possibly for almost 50% of the initial datasets, perhaps less or perhaps more, depending on your data pattern), makes the data object creation (one for iteration in loop) in PyTables the big overhead: In [134]: time r = np.argwhere(da[:] > 1) CPU times: user 1.03 s, sys: 0.03 s, total: 1.06 s Wall time: 1.12 s In [135]: time l = dra.getWhereList('a > 1') CPU times: user 5.62 s, sys: 0.16 s, total: 5.78 s Wall time: 5.89 s Now getWhereList() is more than 5x times slower. Removing the index helps a bit here: In [136]: dra.cols.a.removeIndex() In [137]: time l = dra.getWhereList('a > 1') CPU times: user 5.10 s, sys: 0.12 s, total: 5.22 s Wall time: 5.30 s But, if the internal query machinery in PyTables is the same, why it takes longer? The short answer is object creation (and some data copy). getWhereList() can be expressed like this: In [165]: time l = np.array([r.nrow for r in dra.where('a > 1')]) CPU times: user 5.54 s, sys: 0.09 s, total: 5.63 s Wall time: 5.71 s Now, if we count the time to get the coordinates only: In [159]: time s = [r.nrow for r in dra.where('a > 1')] CPU times: user 3.86 s, sys: 0.08 s, total: 3.95 s Wall time: 4.02 s This time is a bit long, but this is due to the .nrow implementation (a Cython property of the Row class; I wonder if this could be accelerated somewhat). In general, the Row iterator can be much faster, like for example, in getting values: In [161]: time s = [r['a'] for r in dra.where('a > 1')] CPU times: user 1.57 s, sys: 0.07 s, total: 1.63 s Wall time: 1.61 s and you can notice that this is barely the time that it takes a pure list creation: In [139]: time l = [r for r in xrange(len(l))] CPU times: user 1.44 s, sys: 0.11 s, total: 1.55 s Wall time: 1.53 s So, the 'slow' times that you are seeing are a consequence of the different data object creation and the internal data copies (for building the final NumPy array). NumPy is much faster because all this process is made in pure C. But again, this does not preclude the fact that queries in PyTables are actually fast --and potentially much faster than NumPy for high selectivities and indexing. Hope this helps, -- Francesc Alted |
From: Luc K. <luc...@ho...> - 2012-04-23 06:49:51
|
I've noticed that the solution of the related issues are taken up in the recent python versions. For example for the new python 2.7.3. (http://hg.python.org/cpython/file/d46c1973d3c4/Misc/NEWS) The concerned issues were: http://bugs.python.org/issue4120 http://bugs.python.org/issue7833 I asked Mark Hammond who maintains http://pypi.python.org/pypi/pywin32/214 what's the next step in order to get the issues resolved. It appears that pytables should be rebuilt against python 2.7.3. in order to get the issues resolved if one wishes to use pytables in combination with pywin32. -------------------------------------------------- From: "Luc Kesters" <luc...@ho...> Sent: Wednesday, February 22, 2012 10:52 AM To: <pyt...@li...> Subject: Re: [Pytables-users] Installation problems windows > FYI: I've noticed a new version of win32 and asked what the situation of > the before mentioned issues was. > The new version of win32 doesn't resolve the problem. > See the reaction of Mark Hammond on Feb 20, 2012; 11:24pm (: > http://python.6.n6.nabble.com/ANN-pywin32-build-217-released-td4463462.html) > > Message: 4 > Date: Sat, 11 Feb 2012 14:38:15 -0600 > From: Anthony Scopatz <sc...@gm...> > Subject: Re: [Pytables-users] Pytables-users Digest, Vol 69, Issue 2 > To: Discussion list for PyTables > <pyt...@li...> > Message-ID: > <CAPk-6T7Jr9VPEn09QfO2-BV=wM1vBkOacKa1u=nDS...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > Hello Luc, > > Yes, this does seem to me more along the > lines of the issues that you are experiencing. > Unfortunately, we'll probably just have wait > for an upstream solution... Thanks for letting > us know about it though. > > Be Well > Anthony > > On Fri, Feb 10, 2012 at 2:27 PM, Luc Kesters > <luc...@ho...>wrote: > >> The python path is the same. So no luck, but I think the problem lies >> elsewhere. Today I had the same problem when upgrading the package >> pandas. >> I've posted a question there and looking further I came op with: >> >> >> https://groups.google.com/forum/?fromgroups#!topic/isapi_wsgi-dev/A_orSF7CKB0 >> >> which mentions the following issues: >> http://bugs.python.org/issue4120 >> http://bugs.python.org/issue7833 >> >> I read in the last one that maybe a solution is underway. >> > >> > ------------------------------ > > |
From: Anthony S. <sc...@gm...> - 2012-04-19 17:24:02
|
On Thu, Apr 19, 2012 at 11:46 AM, Alvaro Tejero Cantero <al...@mi...>wrote: > I have to run, but here's what you requested (I won't be back on this > computer until monday) > > >>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')], > dtype=np.int16) > >>> cvals > array([], dtype=int16) > Hmmm... > > >>> timeit big=np.argwhere(np.greater(wa02[:], 1)) > 1 loops, best of 3: 15.3 s per loop > > this gives me a mask, argwhere() should not give you a mask. It should give you the coordinates.<http://docs.scipy.org/doc/numpy/reference/generated/numpy.argwhere.html> Also it seems like np.argwhere(np.greater(wa02[:], 1)) and np.argwhere(wa02[:]>1) should run in the same amount of time. At this point though we are just comparing the performance of numpy routines. What we really want is to compare numpy to PyTables. Maybe I'll try playing around with this this weekend. > that I can get with > > >>> big2 = wa02[:]>1 > >>> np.alltrue(big == big2) > True > > and in far less time: > >>> timeit big2 = wa02[:]>1 > 1 loops, best of 3: 348 ms per loop > > > > > -á. > > /raw/t0/wa02 (Array(312000000,)) '' > atom := Int16Atom(shape=(), dflt=0) > maindim := 0 > flavor := 'numpy' > byteorder := 'little' > chunkshape := None > > > On Thu, Apr 19, 2012 at 15:33, Anthony Scopatz <sc...@gm...> wrote: > > I was interested in how long it takes to iterate, since this is arguably > > where the > > majority of the time is spent. > > > > On Thu, Apr 19, 2012 at 8:43 AM, Alvaro Tejero Cantero <al...@mi...> > > wrote: > >> > >> Some complementary info (I copy the details of the tables below) > >> > >> timeit vals = numpy.fromiter((x['val'] for x in > >> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) > >> 1 loops, best of 3: 30.4 s per loop > >> > >> > >> Using the compressed and indexed version, it mysteriously does not > >> work (output is empty list) > >> >>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), > >> >>> dtype=np.int16) > >> >>> cvals > >> array([], dtype=int16) > > > > > > This doesn't work because numpy doesn't accept generators. The following > > should work: > >>>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')], > >>>> dtype=np.int16) > > > > Also, I am a little concerned that np.nonzero() doesn't really compare to > > Table.getWhereList('val>1'). Testing for all zero bits should be a lot > > faster > > than a numeric comparison. Could you instead try the same actual > operation > > in numpy as whereList(): > > > >>>> timeit big=np.argwhere(np.greater(wa02[:], 1)) > > > > Thanks! > > Anthony > > > >> > >> > >> But it does if we skip using where ( I don't print cvals, but it is > >> correct ) > >> >>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if > x['val']>1), > >> >>> dtype=np.int16) > >> 1 loops, best of 3: 54.8 s per loop > >> > >> (the version with longer chunklen works fine and times to 30.7s). > >> > >> > >> -á. > >> > >> wtab02: not compressed, not indexed, small chunklen: > >> /raw/t0/wtab02 (Table(312000000,)) '' > >> description := { > >> "val": Int16Col(shape=(), dflt=0, pos=0)} > >> byteorder := 'little' > >> chunkshape := (32768,) > >> > >> larger chunklen (as calculated from expectedrows=312000000) > >> /raw/t0/wcetab02 (Table(312000000,)) 'test' > >> description := { > >> "val": Int16Col(shape=(), dflt=0, pos=0)} > >> byteorder := 'little' > >> chunkshape := (131072,) > >> > >> wctab02: compressed, with CSI index > >> /raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test' > >> description := { > >> "val": Int16Col(shape=(), dflt=0, pos=0)} > >> byteorder := 'little' > >> chunkshape := (32768,) > >> autoIndex := True > >> colindexes := { > >> "val": Index(9, full, shuffle, zlib(1)).is_CSI=True} > >> > >> > >> > >> On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...> > >> wrote: > >> > where will give me an iterator over the /values/; in this case I > >> > wanted the indexes. Plus, it will give me an iterator, so it will be > >> > trivially fast. > >> > > >> > Are you interested in the timings of where + building a list? or where > >> > + building an array? > >> > > >> > > >> > -á. > >> > > >> > > >> > > >> > On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...> > >> > wrote: > >> >> > >> > >> > >> > ------------------------------------------------------------------------------ > >> For Developers, A Lot Can Happen In A Second. > >> Boundary is the first to Know...and Tell You. > >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > >> http://p.sf.net/sfu/Boundary-d2dvs2 > >> > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > For Developers, A Lot Can Happen In A Second. > > Boundary is the first to Know...and Tell You. > > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > > http://p.sf.net/sfu/Boundary-d2dvs2 > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-19 16:47:07
|
I have to run, but here's what you requested (I won't be back on this computer until monday) >>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')], dtype=np.int16) >>> cvals array([], dtype=int16) >>> timeit big=np.argwhere(np.greater(wa02[:], 1)) 1 loops, best of 3: 15.3 s per loop this gives me a mask, that I can get with >>> big2 = wa02[:]>1 >>> np.alltrue(big == big2) True and in far less time: >>> timeit big2 = wa02[:]>1 1 loops, best of 3: 348 ms per loop -á. /raw/t0/wa02 (Array(312000000,)) '' atom := Int16Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None On Thu, Apr 19, 2012 at 15:33, Anthony Scopatz <sc...@gm...> wrote: > I was interested in how long it takes to iterate, since this is arguably > where the > majority of the time is spent. > > On Thu, Apr 19, 2012 at 8:43 AM, Alvaro Tejero Cantero <al...@mi...> > wrote: >> >> Some complementary info (I copy the details of the tables below) >> >> timeit vals = numpy.fromiter((x['val'] for x in >> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) >> 1 loops, best of 3: 30.4 s per loop >> >> >> Using the compressed and indexed version, it mysteriously does not >> work (output is empty list) >> >>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), >> >>> dtype=np.int16) >> >>> cvals >> array([], dtype=int16) > > > This doesn't work because numpy doesn't accept generators. The following > should work: >>>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')], >>>> dtype=np.int16) > > Also, I am a little concerned that np.nonzero() doesn't really compare to > Table.getWhereList('val>1'). Testing for all zero bits should be a lot > faster > than a numeric comparison. Could you instead try the same actual operation > in numpy as whereList(): > >>>> timeit big=np.argwhere(np.greater(wa02[:], 1)) > > Thanks! > Anthony > >> >> >> But it does if we skip using where ( I don't print cvals, but it is >> correct ) >> >>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if x['val']>1), >> >>> dtype=np.int16) >> 1 loops, best of 3: 54.8 s per loop >> >> (the version with longer chunklen works fine and times to 30.7s). >> >> >> -á. >> >> wtab02: not compressed, not indexed, small chunklen: >> /raw/t0/wtab02 (Table(312000000,)) '' >> description := { >> "val": Int16Col(shape=(), dflt=0, pos=0)} >> byteorder := 'little' >> chunkshape := (32768,) >> >> larger chunklen (as calculated from expectedrows=312000000) >> /raw/t0/wcetab02 (Table(312000000,)) 'test' >> description := { >> "val": Int16Col(shape=(), dflt=0, pos=0)} >> byteorder := 'little' >> chunkshape := (131072,) >> >> wctab02: compressed, with CSI index >> /raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test' >> description := { >> "val": Int16Col(shape=(), dflt=0, pos=0)} >> byteorder := 'little' >> chunkshape := (32768,) >> autoIndex := True >> colindexes := { >> "val": Index(9, full, shuffle, zlib(1)).is_CSI=True} >> >> >> >> On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...> >> wrote: >> > where will give me an iterator over the /values/; in this case I >> > wanted the indexes. Plus, it will give me an iterator, so it will be >> > trivially fast. >> > >> > Are you interested in the timings of where + building a list? or where >> > + building an array? >> > >> > >> > -á. >> > >> > >> > >> > On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...> >> > wrote: >> >> >> >> >> ------------------------------------------------------------------------------ >> For Developers, A Lot Can Happen In A Second. >> Boundary is the first to Know...and Tell You. >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> http://p.sf.net/sfu/Boundary-d2dvs2 >> >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-04-19 14:33:38
|
I was interested in how long it takes to iterate, since this is arguably where the majority of the time is spent. On Thu, Apr 19, 2012 at 8:43 AM, Alvaro Tejero Cantero <al...@mi...>wrote: > Some complementary info (I copy the details of the tables below) > > timeit vals = numpy.fromiter((x['val'] for x in > my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) > 1 loops, best of 3: 30.4 s per loop > > > Using the compressed and indexed version, it mysteriously does not > work (output is empty list) > >>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), > dtype=np.int16) > >>> cvals > array([], dtype=int16) > This doesn't work because numpy doesn't accept generators. The following should work: >>> cvals = np.fromiter([x['val'] for x in wctab02.where('val>1')], dtype=np.int16) Also, I am a little concerned that np.nonzero() doesn't really compare to Table.getWhereList('val>1'). Testing for all zero bits *should be* a lot faster than a numeric comparison. Could you instead try the same actual operation in numpy as whereList(): >>> timeit big=np.argwhere(np.greater(wa02[:], 1)) Thanks! Anthony > > But it does if we skip using where ( I don't print cvals, but it is > correct ) > >>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if x['val']>1), > dtype=np.int16) > 1 loops, best of 3: 54.8 s per loop > > (the version with longer chunklen works fine and times to 30.7s). > > > -á. > > wtab02: not compressed, not indexed, small chunklen: > /raw/t0/wtab02 (Table(312000000,)) '' > description := { > "val": Int16Col(shape=(), dflt=0, pos=0)} > byteorder := 'little' > chunkshape := (32768,) > > larger chunklen (as calculated from expectedrows=312000000) > /raw/t0/wcetab02 (Table(312000000,)) 'test' > description := { > "val": Int16Col(shape=(), dflt=0, pos=0)} > byteorder := 'little' > chunkshape := (131072,) > > wctab02: compressed, with CSI index > /raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test' > description := { > "val": Int16Col(shape=(), dflt=0, pos=0)} > byteorder := 'little' > chunkshape := (32768,) > autoIndex := True > colindexes := { > "val": Index(9, full, shuffle, zlib(1)).is_CSI=True} > > > > On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...> > wrote: > > where will give me an iterator over the /values/; in this case I > > wanted the indexes. Plus, it will give me an iterator, so it will be > > trivially fast. > > > > Are you interested in the timings of where + building a list? or where > > + building an array? > > > > > > -á. > > > > > > > > On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...> > wrote: > >> > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-19 13:43:59
|
Some complementary info (I copy the details of the tables below) timeit vals = numpy.fromiter((x['val'] for x in my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) 1 loops, best of 3: 30.4 s per loop Using the compressed and indexed version, it mysteriously does not work (output is empty list) >>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), dtype=np.int16) >>> cvals array([], dtype=int16) But it does if we skip using where ( I don't print cvals, but it is correct ) >>> timeit cvals = np.fromiter((x['val'] for x in wctab02 if x['val']>1), dtype=np.int16) 1 loops, best of 3: 54.8 s per loop (the version with longer chunklen works fine and times to 30.7s). -á. wtab02: not compressed, not indexed, small chunklen: /raw/t0/wtab02 (Table(312000000,)) '' description := { "val": Int16Col(shape=(), dflt=0, pos=0)} byteorder := 'little' chunkshape := (32768,) larger chunklen (as calculated from expectedrows=312000000) /raw/t0/wcetab02 (Table(312000000,)) 'test' description := { "val": Int16Col(shape=(), dflt=0, pos=0)} byteorder := 'little' chunkshape := (131072,) wctab02: compressed, with CSI index /raw/t0/wctab02 (Table(312000000,), shuffle, blosc(9)) 'test' description := { "val": Int16Col(shape=(), dflt=0, pos=0)} byteorder := 'little' chunkshape := (32768,) autoIndex := True colindexes := { "val": Index(9, full, shuffle, zlib(1)).is_CSI=True} On Thu, Apr 19, 2012 at 12:46, Alvaro Tejero Cantero <al...@mi...> wrote: > where will give me an iterator over the /values/; in this case I > wanted the indexes. Plus, it will give me an iterator, so it will be > trivially fast. > > Are you interested in the timings of where + building a list? or where > + building an array? > > > -á. > > > > On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...> wrote: >> |
From: Alvaro T. C. <al...@mi...> - 2012-04-19 11:46:29
|
where will give me an iterator over the /values/; in this case I wanted the indexes. Plus, it will give me an iterator, so it will be trivially fast. Are you interested in the timings of where + building a list? or where + building an array? -á. On Wed, Apr 18, 2012 at 19:02, Anthony Scopatz <sc...@gm...> wrote: > |
From: Anthony S. <sc...@gm...> - 2012-04-18 18:02:33
|
Hello Alvaro, What are the timings using the normal where() method? http://pytables.github.com/usersguide/libref.html?highlight=where#tables.Table.where Be Well Anthony On Wed, Apr 18, 2012 at 12:33 PM, Alvaro Tejero Cantero <al...@mi...>wrote: > A single array with 312 000 000 int 16 values. > > Two (uncompressed) ways to store it: > > * Array > > >>> wa02[:10] > array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16 > > * Table wtab02 (single column, named 'val') > >>> wtab02[:10] > array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,), > (338,), (357,)], > dtype=[('val', '<i2')]) > > read time respectively 120 ms, 220 ms. > > >>> timeit big=np.nonzero(wa02[:]>1) > 1 loops, best of 3: 1.66 s per loop > > >>> timeit bigtab=wtab02.getWhereList('val>1') > 1 loops, best of 3: 119 s per loop > > with a Complete Sorted Index on val and blosc9 compression: > 1 loops, best of 3: 149 s per loop > > indicating expectedrows=312 000 000 (so that chunklen goes from 32K to > 132K) > 1 loops, best of 3: 119 s per loop > > (I wanted to compare getting a boolean mask, but it seems that Tables > don't have a .wheretrue like carrays in Francesc's carray package (?). > For reference just the mask times to 344 ms). > > --- > > Question: the difference in speed is due to in-core vs out-of-core? > > If so, and if maximum unit of data fits in memory (even considering > loading a few columns to operate among them) -> is the corollary is > 'stay in memory at all costs'? > > With this exercise, I was trying to find out what is the best > structure to hold raw data (just one col in this case), and whether > indexing could help in queries. > > -á. > > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-18 17:33:31
|
A single array with 312 000 000 int 16 values. Two (uncompressed) ways to store it: * Array >>> wa02[:10] array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16 * Table wtab02 (single column, named 'val') >>> wtab02[:10] array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,), (338,), (357,)], dtype=[('val', '<i2')]) read time respectively 120 ms, 220 ms. >>> timeit big=np.nonzero(wa02[:]>1) 1 loops, best of 3: 1.66 s per loop >>> timeit bigtab=wtab02.getWhereList('val>1') 1 loops, best of 3: 119 s per loop with a Complete Sorted Index on val and blosc9 compression: 1 loops, best of 3: 149 s per loop indicating expectedrows=312 000 000 (so that chunklen goes from 32K to 132K) 1 loops, best of 3: 119 s per loop (I wanted to compare getting a boolean mask, but it seems that Tables don't have a .wheretrue like carrays in Francesc's carray package (?). For reference just the mask times to 344 ms). --- Question: the difference in speed is due to in-core vs out-of-core? If so, and if maximum unit of data fits in memory (even considering loading a few columns to operate among them) -> is the corollary is 'stay in memory at all costs'? With this exercise, I was trying to find out what is the best structure to hold raw data (just one col in this case), and whether indexing could help in queries. -á. |
From: Alvaro T. C. <al...@mi...> - 2012-04-16 16:04:26
|
I'm continuing this thread on the dev list. -á. On Fri, Apr 13, 2012 at 21:17, Anthony Scopatz <sc...@gm...> wrote: > > On Fri, Apr 13, 2012 at 12:30 PM, Alvaro Tejero Cantero <al...@mi...> > wrote: >> >> Hi Anthony, >> >> >> >> >> How does hierarchical help here? do you create a 'singer_name'/song >> >> table? or a 'genre name'/song ?. Most of the time the physical layout >> >> in the form of a hierarchy is just an annoyance. >> > >> > I have to say that I disagree. The hierarchical features make it so >> > that >> > the data maps well to both Python objects and file systems. I feel that >> > both of these are more natural to work with than having to construct a >> > query of joins, groupbys, etc which reconstructs my data. So while this >> > is just my opinion, I feel that hierarchies are much more natural to >> > work >> > with. >> > >> >> I try to see the difference; I think I could be helped by an example. >> SQL gives you views to hide the data layout. You will just have 'a >> table' that you can read by rows. Nothing else. What happens in the >> case outlined before if I want to group songs by genre? if I encode >> the relation in the hierarchy, I will have to write the code that >> generates the view, but since there are many ways to create the >> hierarchy, the code will be specific to one data layout... I'd love to >> be wrong here. > > > Yes, this is exactly correct. You do have to write the code that generates > the view based on how your file is organized. However, encoding this > information explicitly in the hierarchy can make look up faster than if you > have to search and join several large tables. > > Because PyTables makes writing this easy in most cases, I don't mind. > >> >> >> > Also, my sense is that there would be a fair bit of overhead in this >> >> > interface >> >> > layer, which might not get you the speed boost you desire. I could >> >> > be >> >> > wrong >> >> > about this though. >> >> >> >> I think you're right in the wrapping of the results via the Python >> >> interface to SQLite. I suspect you're not about the queries executed >> >> in the virtual table, because that is left for you to implement and >> >> thus you could turn the query terms (that are handed over to you) into >> >> in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html) >> > >> > >> > This was informed by my experience with SQLAlchemy which in some >> > situations added an excessively long computation times. With the >> > PyTables >> > infrastructure, we would at least have the option of writing the >> > performance >> > critical parts in C or Cython... >> >> You are talking here about writing specific /queries/ in Cython? just >> for clarification. > > > I am talking about writing the interface code between SQLite and PyTables > in C or (more likely) Cython. The queries themselves still are written in > SQL. > >> >> >> >> > If I saw a proof-of-concept implementation, I may grok better the >> >> > purpose. >> >> > Do you have any code to share? >> >> >> >> No, but I have an example ER diagram which is only part of what I >> >> need. You are welcome to have a look at it[2] >> >> > Sorry the text in this image is too small for me to read. >> >> I uploaded a larger version on the same location. > > > Thanks! > >> >> >> > Writing data-specific relational layers for your applications on top of >> > HDF5 with PyTables is not hard (IMHO). Add in the features of NumPy >> > to perform in-memory manipulations and you have pretty much everything >> > that you need. I think this is why we don't have formal implementation >> > of the SQ Language in PyTables. >> >> Yes, fair enough. There is just no canonical way of doing it. >> >> What do you think of storing in the .attrs something like "This column >> of this table matches (in the sense of foreign key) that column in >> that table" ? >> >> Or would you store these relations in a global repository of sorts - a >> specific table? > > > Whenever I have wanted to mimic relation behavior in HDF5 I have used > the second method where you store a table(s) of relations somewhere in your > file and make sure that your 'data' tables have appropriate 'primary key' > columns. > > Attributes are an interesting idea but I would advise against it since space > is > limited [9] and access is slow [10]. > >> >> >> > I guess what i don't understand still is why - if you wanted to do this >> > - >> > would >> > you use the SQLite vtabs? This seems to have the worst of the SQL world >> > in terms of vendor lock in, compatibility with other SQL >> > implementations, >> > etc. >> >> That is true. I did a quick search and couldn't find if/how e.g >> Postgres has such an extensibility mechanism. >> >> At the same time, there's some commonality between SQLite and >> PyTables: single-file, no concurrency approach. If you need a >> full-fledged RDBMS with authorization etc. you are in another league >> and some abstractions may be difficult to map to PyTables. > > > This is a good point that I hadn't considered. > >> >> >> And RDBMS have received, recently, features that are of great interest >> for scientific users - for example, indexes optimized for spatial or >> interval queries[7][8]. >> >> > Instead, why not just write a SQLAlchemy dialect [6] that is backed by >> > PyTables? >> >> I considered this. I don't know how difficult it is. Do you think that >> this would be the way to go for implementing a thin relational layer >> on top of PyTables? >> >> As I have no practical experience with SQLAlchemy, I cannot foresee >> e.g. those performance drops that you were pinpointing above. > > > I am under the impression that a SQLite vtabs implementation would be > faster, but less general, than SQLAlchemy. But this is why I was asking > the question about "Why vtabs?" in the first place. I guess it comes down > to whoever implements it ;) > >> >> >> > Yes, this isn't 'self-contained' in that we know have a dependency on >> > SQLAlchemy. >> > However, if done right this would be an *optional* dependency. Are >> > there >> > reasons >> > to not do this that I am missing? I think that including something like >> > this as a >> > subpackage in PyTables is more reasonable than interfacing with SQLite >> > in specific. >> >> More reasonable in a general sense, I don't know. The mirror statement >> would be to say that adding support for Numpy containers to the Python >> database adapter would be a reasonable thing to do. >> >> >> > Thanks for fielding my questions here. >> >> A pleasure. I am trying to wrap my head around all the possibilities >> here. I think a documented PyTables use-case for a moderately complex >> scientific database could do a lot for its story. > > > I agree. I think that having something like what you propose available > would > be really interesting. It would be great to be able to say "And we support > SQL > queries if you need them!" > > However, I am concerned about how this would affect the existing PyTables > code base in terms of maintenance, compatibility with existing objects, > build system dependencies, etc. > > A lot of this decisions get made by the person who actually writes it. Thus > I > was asking if you had any code available. If I saw a partial > implementation > I could review it. > > So to answer your initial question, we would be interested in looking at a > SQL > interface layer for HDF5 using PyTables. At that point we could discuss > what it would take to integrate it back in upstream. However, since I am > not > personally all that interested in SQL, I probably wouldn't be the one to > write > this subpackage ;). > > If you are interested in implementing it in one of the two main ways we > discussed > but don't know which to pursue we can try to work that out here or on the > dev list. > If you really want to try vtabs or SQLAlchemy, I encourage you to try and > let us > know how it goes and if you have questions or need help! > > Be Well > Anthony > > [9] http://www.hdfgroup.org/HDF5/doc1.6/UG/13_Attributes.html#SpecIssues > [10] http://www.hdfgroup.org/HDF5/doc/UG/UG_frame13Attributes.html > >> >> >> Cheers, >> >> Álvaro. >> >> [7] http://www.sqlite.org/rtree.html >> [8] http://www.postgresql.org/docs/8.1/static/gist.html >> >> > Be Well >> > Anthony >> > >> > >> > [6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation >> > >> >> >> >> >> >> Cheers, >> >> >> >> Álvaro. >> >> -- >> >> [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html >> >> [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link >> >> to HDF5 data, or other tables with the real measurements, white tables >> >> are computed). >> >> [3] http://www.scidb.org/ >> >> [4] See p.26-29 and 32 >> >> >> >> >> >> http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf >> >> [5] >> >> >> >> https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826 >> >> >> >> >> >> > Be Well >> >> > Anthony >> >> > >> >> > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero >> >> > <al...@mi...> >> >> > wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> The topic of introducing some kind of relational management in >> >> >> PyTables comes up with certain frequency. >> >> >> >> >> >> Would it be possible to combine the virtues of RDBMS and hdf5's >> >> >> speed >> >> >> via a mechanism such as SQLite Virtual Tables? >> >> >> >> >> >> http://www.sqlite.org/vtab.html >> >> >> >> >> >> I wonder if the required x* functions could be written for PyTables, >> >> >> or if it being in Python is an obstacle to this kind of interfacing >> >> >> with SQLite. >> >> >> >> >> >> Something like that would be a truly powerful solution in use cases >> >> >> that don't require concurrency. >> >> >> >> >> >> Cheers, >> >> >> >> >> >> -á. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> For Developers, A Lot Can Happen In A Second. >> >> >> Boundary is the first to Know...and Tell You. >> >> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> >> >> http://p.sf.net/sfu/Boundary-d2dvs2 >> >> >> _______________________________________________ >> >> >> Pytables-users mailing list >> >> >> Pyt...@li... >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > For Developers, A Lot Can Happen In A Second. >> >> > Boundary is the first to Know...and Tell You. >> >> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> >> > http://p.sf.net/sfu/Boundary-d2dvs2 >> >> > _______________________________________________ >> >> > Pytables-users mailing list >> >> > Pyt...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> For Developers, A Lot Can Happen In A Second. >> >> Boundary is the first to Know...and Tell You. >> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> >> http://p.sf.net/sfu/Boundary-d2dvs2 >> >> _______________________________________________ >> >> Pytables-users mailing list >> >> Pyt...@li... >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > For Developers, A Lot Can Happen In A Second. >> > Boundary is the first to Know...and Tell You. >> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> > http://p.sf.net/sfu/Boundary-d2dvs2 >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> >> >> ------------------------------------------------------------------------------ >> For Developers, A Lot Can Happen In A Second. >> Boundary is the first to Know...and Tell You. >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> http://p.sf.net/sfu/Boundary-d2dvs2 >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-04-13 20:18:06
|
On Fri, Apr 13, 2012 at 12:30 PM, Alvaro Tejero Cantero <al...@mi...>wrote: > Hi Anthony, > > > > >> How does hierarchical help here? do you create a 'singer_name'/song > >> table? or a 'genre name'/song ?. Most of the time the physical layout > >> in the form of a hierarchy is just an annoyance. > > > > I have to say that I disagree. The hierarchical features make it so that > > the data maps well to both Python objects and file systems. I feel that > > both of these are more natural to work with than having to construct a > > query of joins, groupbys, etc which reconstructs my data. So while this > > is just my opinion, I feel that hierarchies are much more natural to work > > with. > > > > I try to see the difference; I think I could be helped by an example. > SQL gives you views to hide the data layout. You will just have 'a > table' that you can read by rows. Nothing else. What happens in the > case outlined before if I want to group songs by genre? if I encode > the relation in the hierarchy, I will have to write the code that > generates the view, but since there are many ways to create the > hierarchy, the code will be specific to one data layout... I'd love to > be wrong here. > Yes, this is exactly correct. You do have to write the code that generates the view based on how your file is organized. However, encoding this information explicitly in the hierarchy can make look up faster than if you have to search and join several large tables. Because PyTables makes writing this easy in most cases, I don't mind. > >> > Also, my sense is that there would be a fair bit of overhead in this > >> > interface > >> > layer, which might not get you the speed boost you desire. I could be > >> > wrong > >> > about this though. > >> > >> I think you're right in the wrapping of the results via the Python > >> interface to SQLite. I suspect you're not about the queries executed > >> in the virtual table, because that is left for you to implement and > >> thus you could turn the query terms (that are handed over to you) into > >> in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html) > > > > > > This was informed by my experience with SQLAlchemy which in some > > situations added an excessively long computation times. With the > PyTables > > infrastructure, we would at least have the option of writing the > > performance > > critical parts in C or Cython... > > You are talking here about writing specific /queries/ in Cython? just > for clarification. > I am talking about writing the interface code between SQLite and PyTables in C or (more likely) Cython. The queries themselves still are written in SQL. > > >> > If I saw a proof-of-concept implementation, I may grok better the > >> > purpose. > >> > Do you have any code to share? > >> > >> No, but I have an example ER diagram which is only part of what I > >> need. You are welcome to have a look at it[2] > > > Sorry the text in this image is too small for me to read. > > I uploaded a larger version on the same location. > Thanks! > > > Writing data-specific relational layers for your applications on top of > > HDF5 with PyTables is not hard (IMHO). Add in the features of NumPy > > to perform in-memory manipulations and you have pretty much everything > > that you need. I think this is why we don't have formal implementation > > of the SQ Language in PyTables. > > Yes, fair enough. There is just no canonical way of doing it. > > What do you think of storing in the .attrs something like "This column > of this table matches (in the sense of foreign key) that column in > that table" ? > > Or would you store these relations in a global repository of sorts - a > specific table? > Whenever I have wanted to mimic relation behavior in HDF5 I have used the second method where you store a table(s) of relations somewhere in your file and make sure that your 'data' tables have appropriate 'primary key' columns. Attributes are an interesting idea but I would advise against it since space is limited [9] and access is slow [10]. > > > I guess what i don't understand still is why - if you wanted to do this - > > would > > you use the SQLite vtabs? This seems to have the worst of the SQL world > > in terms of vendor lock in, compatibility with other SQL implementations, > > etc. > > That is true. I did a quick search and couldn't find if/how e.g > Postgres has such an extensibility mechanism. > > At the same time, there's some commonality between SQLite and > PyTables: single-file, no concurrency approach. If you need a > full-fledged RDBMS with authorization etc. you are in another league > and some abstractions may be difficult to map to PyTables. > This is a good point that I hadn't considered. > > And RDBMS have received, recently, features that are of great interest > for scientific users - for example, indexes optimized for spatial or > interval queries[7][8]. > > > Instead, why not just write a SQLAlchemy dialect [6] that is backed by > > PyTables? > > I considered this. I don't know how difficult it is. Do you think that > this would be the way to go for implementing a thin relational layer > on top of PyTables? > > As I have no practical experience with SQLAlchemy, I cannot foresee > e.g. those performance drops that you were pinpointing above. > I am under the impression that a SQLite vtabs implementation would be faster, but less general, than SQLAlchemy. But this is why I was asking the question about "Why vtabs?" in the first place. I guess it comes down to whoever implements it ;) > > > Yes, this isn't 'self-contained' in that we know have a dependency on > > SQLAlchemy. > > However, if done right this would be an *optional* dependency. Are there > > reasons > > to not do this that I am missing? I think that including something like > > this as a > > subpackage in PyTables is more reasonable than interfacing with SQLite > > in specific. > > More reasonable in a general sense, I don't know. The mirror statement > would be to say that adding support for Numpy containers to the Python > database adapter would be a reasonable thing to do. > > > > Thanks for fielding my questions here. > > A pleasure. I am trying to wrap my head around all the possibilities > here. I think a documented PyTables use-case for a moderately complex > scientific database could do a lot for its story. > I agree. I think that having something like what you propose available would be really interesting. It would be great to be able to say "And we support SQL queries if you need them!" However, I am concerned about how this would affect the existing PyTables code base in terms of maintenance, compatibility with existing objects, build system dependencies, etc. A lot of this decisions get made by the person who actually writes it. Thus I was asking if you had any code available. If I saw a partial implementation I could review it. So to answer your initial question, we would be interested in looking at a SQL interface layer for HDF5 using PyTables. At that point we could discuss what it would take to integrate it back in upstream. However, since I am not personally all that interested in SQL, I probably wouldn't be the one to write this subpackage ;). If you are interested in implementing it in one of the two main ways we discussed but don't know which to pursue we can try to work that out here or on the dev list. If you really want to try vtabs or SQLAlchemy, I encourage you to try and let us know how it goes and if you have questions or need help! Be Well Anthony [9] http://www.hdfgroup.org/HDF5/doc1.6/UG/13_Attributes.html#SpecIssues [10] http://www.hdfgroup.org/HDF5/doc/UG/UG_frame13Attributes.html > > Cheers, > > Álvaro. > > [7] http://www.sqlite.org/rtree.html > [8] http://www.postgresql.org/docs/8.1/static/gist.html > > > Be Well > > Anthony > > > > > > [6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation > > > >> > >> > >> Cheers, > >> > >> Álvaro. > >> -- > >> [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html > >> [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link > >> to HDF5 data, or other tables with the real measurements, white tables > >> are computed). > >> [3] http://www.scidb.org/ > >> [4] See p.26-29 and 32 > >> > >> > http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf > >> [5] > >> > https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826 > >> > >> > >> > Be Well > >> > Anthony > >> > > >> > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero > >> > <al...@mi...> > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> The topic of introducing some kind of relational management in > >> >> PyTables comes up with certain frequency. > >> >> > >> >> Would it be possible to combine the virtues of RDBMS and hdf5's speed > >> >> via a mechanism such as SQLite Virtual Tables? > >> >> > >> >> http://www.sqlite.org/vtab.html > >> >> > >> >> I wonder if the required x* functions could be written for PyTables, > >> >> or if it being in Python is an obstacle to this kind of interfacing > >> >> with SQLite. > >> >> > >> >> Something like that would be a truly powerful solution in use cases > >> >> that don't require concurrency. > >> >> > >> >> Cheers, > >> >> > >> >> -á. > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ > >> >> For Developers, A Lot Can Happen In A Second. > >> >> Boundary is the first to Know...and Tell You. > >> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > >> >> http://p.sf.net/sfu/Boundary-d2dvs2 > >> >> _______________________________________________ > >> >> Pytables-users mailing list > >> >> Pyt...@li... > >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > For Developers, A Lot Can Happen In A Second. > >> > Boundary is the first to Know...and Tell You. > >> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > >> > http://p.sf.net/sfu/Boundary-d2dvs2 > >> > _______________________________________________ > >> > Pytables-users mailing list > >> > Pyt...@li... > >> > https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > >> > >> > >> > ------------------------------------------------------------------------------ > >> For Developers, A Lot Can Happen In A Second. > >> Boundary is the first to Know...and Tell You. > >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > >> http://p.sf.net/sfu/Boundary-d2dvs2 > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > For Developers, A Lot Can Happen In A Second. > > Boundary is the first to Know...and Tell You. > > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > > http://p.sf.net/sfu/Boundary-d2dvs2 > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-13 17:31:27
|
Hi Anthony, >> How does hierarchical help here? do you create a 'singer_name'/song >> table? or a 'genre name'/song ?. Most of the time the physical layout >> in the form of a hierarchy is just an annoyance. > > I have to say that I disagree. The hierarchical features make it so that > the data maps well to both Python objects and file systems. I feel that > both of these are more natural to work with than having to construct a > query of joins, groupbys, etc which reconstructs my data. So while this > is just my opinion, I feel that hierarchies are much more natural to work > with. > I try to see the difference; I think I could be helped by an example. SQL gives you views to hide the data layout. You will just have 'a table' that you can read by rows. Nothing else. What happens in the case outlined before if I want to group songs by genre? if I encode the relation in the hierarchy, I will have to write the code that generates the view, but since there are many ways to create the hierarchy, the code will be specific to one data layout... I'd love to be wrong here. >> > Also, my sense is that there would be a fair bit of overhead in this >> > interface >> > layer, which might not get you the speed boost you desire. I could be >> > wrong >> > about this though. >> >> I think you're right in the wrapping of the results via the Python >> interface to SQLite. I suspect you're not about the queries executed >> in the virtual table, because that is left for you to implement and >> thus you could turn the query terms (that are handed over to you) into >> in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html) > > > This was informed by my experience with SQLAlchemy which in some > situations added an excessively long computation times. With the PyTables > infrastructure, we would at least have the option of writing the > performance > critical parts in C or Cython... You are talking here about writing specific /queries/ in Cython? just for clarification. >> > If I saw a proof-of-concept implementation, I may grok better the >> > purpose. >> > Do you have any code to share? >> >> No, but I have an example ER diagram which is only part of what I >> need. You are welcome to have a look at it[2] > Sorry the text in this image is too small for me to read. I uploaded a larger version on the same location. > Writing data-specific relational layers for your applications on top of > HDF5 with PyTables is not hard (IMHO). Add in the features of NumPy > to perform in-memory manipulations and you have pretty much everything > that you need. I think this is why we don't have formal implementation > of the SQ Language in PyTables. Yes, fair enough. There is just no canonical way of doing it. What do you think of storing in the .attrs something like "This column of this table matches (in the sense of foreign key) that column in that table" ? Or would you store these relations in a global repository of sorts - a specific table? > I guess what i don't understand still is why - if you wanted to do this - > would > you use the SQLite vtabs? This seems to have the worst of the SQL world > in terms of vendor lock in, compatibility with other SQL implementations, > etc. That is true. I did a quick search and couldn't find if/how e.g Postgres has such an extensibility mechanism. At the same time, there's some commonality between SQLite and PyTables: single-file, no concurrency approach. If you need a full-fledged RDBMS with authorization etc. you are in another league and some abstractions may be difficult to map to PyTables. And RDBMS have received, recently, features that are of great interest for scientific users - for example, indexes optimized for spatial or interval queries[7][8]. > Instead, why not just write a SQLAlchemy dialect [6] that is backed by > PyTables? I considered this. I don't know how difficult it is. Do you think that this would be the way to go for implementing a thin relational layer on top of PyTables? As I have no practical experience with SQLAlchemy, I cannot foresee e.g. those performance drops that you were pinpointing above. > Yes, this isn't 'self-contained' in that we know have a dependency on > SQLAlchemy. > However, if done right this would be an *optional* dependency. Are there > reasons > to not do this that I am missing? I think that including something like > this as a > subpackage in PyTables is more reasonable than interfacing with SQLite > in specific. More reasonable in a general sense, I don't know. The mirror statement would be to say that adding support for Numpy containers to the Python database adapter would be a reasonable thing to do. > Thanks for fielding my questions here. A pleasure. I am trying to wrap my head around all the possibilities here. I think a documented PyTables use-case for a moderately complex scientific database could do a lot for its story. Cheers, Álvaro. [7] http://www.sqlite.org/rtree.html [8] http://www.postgresql.org/docs/8.1/static/gist.html > Be Well > Anthony > > > [6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation > >> >> >> Cheers, >> >> Álvaro. >> -- >> [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html >> [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link >> to HDF5 data, or other tables with the real measurements, white tables >> are computed). >> [3] http://www.scidb.org/ >> [4] See p.26-29 and 32 >> >> http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf >> [5] >> https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826 >> >> >> > Be Well >> > Anthony >> > >> > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero >> > <al...@mi...> >> > wrote: >> >> >> >> Hi, >> >> >> >> The topic of introducing some kind of relational management in >> >> PyTables comes up with certain frequency. >> >> >> >> Would it be possible to combine the virtues of RDBMS and hdf5's speed >> >> via a mechanism such as SQLite Virtual Tables? >> >> >> >> http://www.sqlite.org/vtab.html >> >> >> >> I wonder if the required x* functions could be written for PyTables, >> >> or if it being in Python is an obstacle to this kind of interfacing >> >> with SQLite. >> >> >> >> Something like that would be a truly powerful solution in use cases >> >> that don't require concurrency. >> >> >> >> Cheers, >> >> >> >> -á. >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> For Developers, A Lot Can Happen In A Second. >> >> Boundary is the first to Know...and Tell You. >> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> >> http://p.sf.net/sfu/Boundary-d2dvs2 >> >> _______________________________________________ >> >> Pytables-users mailing list >> >> Pyt...@li... >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > For Developers, A Lot Can Happen In A Second. >> > Boundary is the first to Know...and Tell You. >> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> > http://p.sf.net/sfu/Boundary-d2dvs2 >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> >> >> ------------------------------------------------------------------------------ >> For Developers, A Lot Can Happen In A Second. >> Boundary is the first to Know...and Tell You. >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> http://p.sf.net/sfu/Boundary-d2dvs2 >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-04-13 16:29:40
|
On Fri, Apr 13, 2012 at 6:41 AM, Alvaro Tejero Cantero <al...@mi...>wrote: > Hi Anthony, > > > I can see how the virtual table interface could be made to work with > > PyTables, > > but I guess I don't understand why you would want to. It seems like in > this > > case you are querying using SQL rather than the more expressive Python. > > Yes, you'd be querying using SQL. > SQL is a documented declarative syntax for queries over relations. > Python offers many procedural routes to achieve e.g. joins, all of > them custom. If (a == b) | (c==d) is more expressive to you than > WHERE a=b OR c=d , then you can use SQLAlchemy [1], which wraps SQL in > a Pythonic query syntax. > Hello Alvaro, I am quite familiar with SQL and SQLAlchemy, having used these tools both personally and professionally. My initial question was not "What are you trying to do?" but rather "Why would you want to do it?" > > Moreover, you'd be sacrificing all of the 'H' in HDF5 features to obtain > > this. > [snip] > How does hierarchical help here? do you create a 'singer_name'/song > table? or a 'genre name'/song ?. Most of the time the physical layout > in the form of a hierarchy is just an annoyance. > I have to say that I disagree. The hierarchical features make it so that the data maps well to both Python objects and file systems. I feel that both of these are more natural to work with than having to construct a query of joins, groupbys, etc which reconstructs my data. So while this is just my opinion, I feel that hierarchies are much more natural to work with. > > > Also, my sense is that there would be a fair bit of overhead in this > > interface > > layer, which might not get you the speed boost you desire. I could be > wrong > > about this though. > > I think you're right in the wrapping of the results via the Python > interface to SQLite. I suspect you're not about the queries executed > in the virtual table, because that is left for you to implement and > thus you could turn the query terms (that are handed over to you) into > in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html) > This was informed by my experience with SQLAlchemy which in some situations added an excessively long computation times. With the PyTables infrastructure, we would at least have the option of writing the performance critical parts in C or Cython... > > If I saw a proof-of-concept implementation, I may grok better the > purpose. > > Do you have any code to share? > > No, but I have an example ER diagram which is only part of what I > need. You are welcome to have a look at it[2] Sorry the text in this image is too small for me to read. > and tell me how you'd > achieve to support the jungle of relationships there with the H of > HDF5. In SQL I have a syntax to declare all those relationships. In > HDF5 I must decide for one hierarchical cut of those relations and > since it won't be enough, implement the relational layer on top of it, > perhaphs using attrs to store paths everywhere. It can be done, but > the support out of the box at this point for this is next to nil > (maybe integrating something like recarray.joinby [5] would be > useful?) > Writing data-specific relational layers for your applications on top of HDF5 with PyTables is not hard (IMHO). Add in the features of NumPy to perform in-memory manipulations and you have pretty much everything that you need. I think this is why we don't have formal implementation of the SQ Language in PyTables. > It looks to me, at this moment, that as soon as the data model gets > complicated HDF5 is in trouble, and as soon as very large, contiguous, > read-only, datasets are involved relational RDBMSs are in trouble > (subsetting, speed). Since this is not a happy situation, several > people are interested in combining the strengths of both [3][4] and my > e-mail was just highlighting that there may be a way to go that may > make a self-contained, clear, understandable package for the scenarios > where PyTables is most often deployed (single-user). > Reference [4] is particularly interesting (they mention PyTables!) and they also propose basically what you are suggesting in their third option (integrated SQL & HDF5). > Or I am not seeing something obvious? > I guess what i don't understand still is why - if you wanted to do this - would you use the SQLite vtabs? This seems to have the worst of the SQL world in terms of vendor lock in, compatibility with other SQL implementations, etc. Instead, why not just write a SQLAlchemy dialect [6] that is backed by PyTables? Yes, this isn't 'self-contained' in that we know have a dependency on SQLAlchemy. However, if done right this would be an *optional* dependency. Are there reasons to not do this that I am missing? I think that including something like this as a subpackage in PyTables is more reasonable than interfacing with SQLite in specific. Thanks for fielding my questions here. Be Well Anthony [6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation > > Cheers, > > Álvaro. > -- > [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html > [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link > to HDF5 data, or other tables with the real measurements, white tables > are computed). > [3] http://www.scidb.org/ > [4] See p.26-29 and 32 > > http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf > [5] > https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826 > > > > Be Well > > Anthony > > > > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero <al...@mi... > > > > wrote: > >> > >> Hi, > >> > >> The topic of introducing some kind of relational management in > >> PyTables comes up with certain frequency. > >> > >> Would it be possible to combine the virtues of RDBMS and hdf5's speed > >> via a mechanism such as SQLite Virtual Tables? > >> > >> http://www.sqlite.org/vtab.html > >> > >> I wonder if the required x* functions could be written for PyTables, > >> or if it being in Python is an obstacle to this kind of interfacing > >> with SQLite. > >> > >> Something like that would be a truly powerful solution in use cases > >> that don't require concurrency. > >> > >> Cheers, > >> > >> -á. > >> > >> > >> > ------------------------------------------------------------------------------ > >> For Developers, A Lot Can Happen In A Second. > >> Boundary is the first to Know...and Tell You. > >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > >> http://p.sf.net/sfu/Boundary-d2dvs2 > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > For Developers, A Lot Can Happen In A Second. > > Boundary is the first to Know...and Tell You. > > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > > http://p.sf.net/sfu/Boundary-d2dvs2 > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-13 11:42:03
|
Hi Anthony, > I can see how the virtual table interface could be made to work with > PyTables, > but I guess I don't understand why you would want to. It seems like in this > case you are querying using SQL rather than the more expressive Python. Yes, you'd be querying using SQL. SQL is a documented declarative syntax for queries over relations. Python offers many procedural routes to achieve e.g. joins, all of them custom. If (a == b) | (c==d) is more expressive to you than WHERE a=b OR c=d , then you can use SQLAlchemy [1], which wraps SQL in a Pythonic query syntax. > Moreover, you'd be sacrificing all of the 'H' in HDF5 features to obtain > this. What is the benefit of 'H'ierarchical that you have in mind? To me hierarchy seems less expressive than general relations. After all, file systems are hierarchical and you're going to HDF5 still (and losing the panoply of filesystem-based tools with it). So clearly, the differential benefit of HDF5 is not at all in the hierarchical character. Take a list of e.g. songs with a foreign key 'singer' pointing at one row in the table of singers, and a foreign key 'genre' pointing at the genre_songs table which in turns points to 'genres' (n:m) relationship. How does hierarchical help here? do you create a 'singer_name'/song table? or a 'genre name'/song ?. Most of the time the physical layout in the form of a hierarchy is just an annoyance. > Also, my sense is that there would be a fair bit of overhead in this > interface > layer, which might not get you the speed boost you desire. I could be wrong > about this though. I think you're right in the wrapping of the results via the Python interface to SQLite. I suspect you're not about the queries executed in the virtual table, because that is left for you to implement and thus you could turn the query terms (that are handed over to you) into in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html) > If I saw a proof-of-concept implementation, I may grok better the purpose. > Do you have any code to share? No, but I have an example ER diagram which is only part of what I need. You are welcome to have a look at it[2] and tell me how you'd achieve to support the jungle of relationships there with the H of HDF5. In SQL I have a syntax to declare all those relationships. In HDF5 I must decide for one hierarchical cut of those relations and since it won't be enough, implement the relational layer on top of it, perhaphs using attrs to store paths everywhere. It can be done, but the support out of the box at this point for this is next to nil (maybe integrating something like recarray.joinby [5] would be useful?) It looks to me, at this moment, that as soon as the data model gets complicated HDF5 is in trouble, and as soon as very large, contiguous, read-only, datasets are involved relational RDBMSs are in trouble (subsetting, speed). Since this is not a happy situation, several people are interested in combining the strengths of both [3][4] and my e-mail was just highlighting that there may be a way to go that may make a self-contained, clear, understandable package for the scenarios where PyTables is most often deployed (single-user). Or I am not seeing something obvious? Cheers, Álvaro. -- [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link to HDF5 data, or other tables with the real measurements, white tables are computed). [3] http://www.scidb.org/ [4] See p.26-29 and 32 http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf [5] https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826 > Be Well > Anthony > > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero <al...@mi...> > wrote: >> >> Hi, >> >> The topic of introducing some kind of relational management in >> PyTables comes up with certain frequency. >> >> Would it be possible to combine the virtues of RDBMS and hdf5's speed >> via a mechanism such as SQLite Virtual Tables? >> >> http://www.sqlite.org/vtab.html >> >> I wonder if the required x* functions could be written for PyTables, >> or if it being in Python is an obstacle to this kind of interfacing >> with SQLite. >> >> Something like that would be a truly powerful solution in use cases >> that don't require concurrency. >> >> Cheers, >> >> -á. >> >> >> ------------------------------------------------------------------------------ >> For Developers, A Lot Can Happen In A Second. >> Boundary is the first to Know...and Tell You. >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! >> http://p.sf.net/sfu/Boundary-d2dvs2 >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-04-12 20:04:47
|
Hello Alvaro, I can see how the virtual table interface could be made to work with PyTables, but I guess I don't understand why you would want to. It seems like in this case you are querying using SQL rather than the more expressive Python. Moreover, you'd be sacrificing all of the 'H' in HDF5 features to obtain this. Also, my sense is that there would be a fair bit of overhead in this interface layer, which might not get you the speed boost you desire. I could be wrong about this though. If I saw a proof-of-concept implementation, I may grok better the purpose. Do you have any code to share? Be Well Anthony On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero <al...@mi...>wrote: > Hi, > > The topic of introducing some kind of relational management in > PyTables comes up with certain frequency. > > Would it be possible to combine the virtues of RDBMS and hdf5's speed > via a mechanism such as SQLite Virtual Tables? > > http://www.sqlite.org/vtab.html > > I wonder if the required x* functions could be written for PyTables, > or if it being in Python is an obstacle to this kind of interfacing > with SQLite. > > Something like that would be a truly powerful solution in use cases > that don't require concurrency. > > Cheers, > > -á. > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-12 16:03:32
|
Hi, The topic of introducing some kind of relational management in PyTables comes up with certain frequency. Would it be possible to combine the virtues of RDBMS and hdf5's speed via a mechanism such as SQLite Virtual Tables? http://www.sqlite.org/vtab.html I wonder if the required x* functions could be written for PyTables, or if it being in Python is an obstacle to this kind of interfacing with SQLite. Something like that would be a truly powerful solution in use cases that don't require concurrency. Cheers, -á. |
From: Josh M. <jos...@gm...> - 2012-04-11 19:41:19
|
There's been a motion [1] by the development team to drop support for HDF 1.6 in the next PyTables release. For details of the work done, see github issue #105 [2]. Support for files written by 1.6 will be maintained, but if there are any other users who are stuck using the 1.6 HDF libraries now would be a great time to speak up and outline your situation (including OS version, expected length of support for 1.6 that's needed, etc) All feedback is welcome. ~Josh [1] https://groups.google.com/d/topic/pytables-dev/0Uhovr0lohc/discussion [2] https://github.com/PyTables/PyTables/issues/105 |
From: Jean-Baptiste R. <boo...@ya...> - 2012-04-03 15:35:11
|
<a href="http://dmjmultimedia.com/components/com_jcomments/tpl/default/jrklre.html"> http://dmjmultimedia.com/components/com_jcomments/tpl/default/jrklre.html</a> |
From: Francesc A. <fa...@py...> - 2012-04-03 02:05:03
|
On 4/2/12 5:11 PM, Daπid wrote: > I want to report an inaccuracy in this doc: > http://pytables.github.com/usersguide/condition_syntax.html#condition-syntax > > After listing the types supported in the search table.where: > > "Nevertheless, if the type passed is not among the above ones, it will > be silently upcasted, so you don’t need to worry too much about > passing supported types: just pass whatever type you want and the > interpreter will take care of it." > > But if I pass an Unsigned 64 bits integer I get: > > NotImplementedError: variable ``N`` refers to a 64-bit unsigned > integer column, not yet supported in conditions, sorry; please use > regular Python selections > > Of course, that type cannot be upcasted to any of the listed, so you > trully cannot expect it to work; but the behaviour is not exactly what > it says on the paragraph. Yes, you are right. These small amendments to docs are best dealt if you could submit a PR. With github this is easy to do, and it is also very convenient for maintainers for keeping track of all these requests for improvement. Cheers, -- Francesc Alted |
From: Daπid <dav...@gm...> - 2012-04-02 22:11:40
|
I want to report an inaccuracy in this doc: http://pytables.github.com/usersguide/condition_syntax.html#condition-syntax After listing the types supported in the search table.where: "Nevertheless, if the type passed is not among the above ones, it will be silently upcasted, so you don’t need to worry too much about passing supported types: just pass whatever type you want and the interpreter will take care of it." But if I pass an Unsigned 64 bits integer I get: NotImplementedError: variable ``N`` refers to a 64-bit unsigned integer column, not yet supported in conditions, sorry; please use regular Python selections Of course, that type cannot be upcasted to any of the listed, so you trully cannot expect it to work; but the behaviour is not exactly what it says on the paragraph. For the record: I am using the lastest released version of PyTables, 2.3. Regards, David. |
From: Daπid <dav...@gm...> - 2012-04-02 21:13:52
|
People here say that it is killing the process: http://stackoverflow.com/questions/1261597/eclipsepydev-cleanup-functions-arent-called-when-pressing-stop So there is no way of fixing that, from any language. David. On Mon, Apr 2, 2012 at 10:55 PM, Francesc Alted <fa...@py...> wrote: > I don't know. I personally do not have experience with PyDev. If you > don't see the message about PyTables closing files, then there is a high > probability that it does not do that. In this case, your suggestion on > using try-except-finally block is your best bet, IMO. > > Francesc > > On 4/2/12 2:59 PM, Daπid wrote: >> I noticed that if a program raises an error, it shows a message >> indicating the file is closed, but it doesn't show anything if I >> terminate it from outside (in my case, stop from PyDev). >> >> Is it being flushed? Is there any way of doing that, apart from >> enveloping the whole program in a try-except-finally block? >> >> On Mon, Apr 2, 2012 at 9:48 PM, Francesc Alted<fa...@py...> wrote: >>> On 4/2/12 12:38 PM, Alvaro Tejero Cantero wrote: >>>> Hi, >>>> >>>> should PyTables flush on __exit__ ? >>>> https://github.com/PyTables/PyTables/blob/master/tables/file.py#L2164 >>>> >>>> it is not clear to me if a File.close() call results in automatic >>>> flushing all the nodes, since Node()._f_close() promises only "On >>>> nodes with data, it may be flushed to disk." >>>> https://github.com/PyTables/PyTables/blob/master/tables/node.py#L512 >>> Yup, it does flush. The message should be more explicit on this. >>> >>> -- >>> Francesc Alted >>> >>> >>> ------------------------------------------------------------------------------ >>> This SF email is sponsosred by: >>> Try Windows Azure free for 90 days Click Here >>> http://p.sf.net/sfu/sfd2d-msazure >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |