You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Francesc A. <fa...@py...> - 2012-05-10 19:47:25
|
On 5/10/12 12:14 PM, Alvaro Tejero Cantero wrote: > The graphical explanation of the different containers is masterly, and > I believe, supersedes the table that we had talked about for the > documentation. > > I think it the schematics deserve a prominent place in the web page. > They are a very good symbolic explanation of the basics of PyTables. Glad that you like it. In fact I think you are right: this is perhaps the first time that some schematics have been used for describing the basic objects in PyTables. And my impression from the talk yesterday is that people really get the gist of PyTables very quickly. > > As for the tables.Expr example of an in-kernel query, > > [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ] > > now that there exists thanks to Josh a facility to obtain dataset > sizes, perhaps some interesting things become possible I think you are mixing concepts here. tables.Expr is for out-of-core operations. I suppose you mean Numexpr here. > > a) I have always wondered why tables.Expr 'must' be used in an > iterative context, i.e. pay the prize of building the Python list, > which is not the best container to iterate on afterwards. My > explanation for it is that you don't know how big the result set will > be, and thus want to avoid returning a big object in memory. But now > it would be possible that if the size of the columns that are involved > fits in memory (or, let's say a fraction of the total RAM that is > configurable), PyTables returns a numpy mask, or an index array, which > are certainly very useful for further numpy work. A new function name > could be provided for this functionality. Hmm, the Table.where() iterator is very fast already (I can assure you that a lot of optimizations and caching stuff is there), but I agree that, for the indexed case, there would be situations where returning a mask or an index array would be better (read faster). > > b) more generally, expanding on this, knowing the size of datasets and > the available memory, PyTables could eventually decide whether to > perform operations in memory or in kernel. In-memory or in-kernel? You probably mean indexed or in-kernel, right? Yes, that's certainly another nice place for further optimizations. -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-05-10 17:14:43
|
The graphical explanation of the different containers is masterly, and I believe, supersedes the table that we had talked about for the documentation. I think it the schematics deserve a prominent place in the web page. They are a very good symbolic explanation of the basics of PyTables. As for the tables.Expr example of an in-kernel query, [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ] now that there exists thanks to Josh a facility to obtain dataset sizes, perhaps some interesting things become possible a) I have always wondered why tables.Expr 'must' be used in an iterative context, i.e. pay the prize of building the Python list, which is not the best container to iterate on afterwards. My explanation for it is that you don't know how big the result set will be, and thus want to avoid returning a big object in memory. But now it would be possible that if the size of the columns that are involved fits in memory (or, let's say a fraction of the total RAM that is configurable), PyTables returns a numpy mask, or an index array, which are certainly very useful for further numpy work. A new function name could be provided for this functionality. b) more generally, expanding on this, knowing the size of datasets and the available memory, PyTables could eventually decide whether to perform operations in memory or in kernel. What do you think? -á. On Thu, May 10, 2012 at 5:10 PM, Anthony Scopatz <sc...@gm...> wrote: > Thanks for sharing Francesc! > > > On Thu, May 10, 2012 at 10:36 AM, Francesc Alted <fa...@py...> > wrote: >> >> Hey List, >> >> Just a few words to inform you that yesterday I gave a quite extensive >> talk about PyTables at the Austin Python Meetup. I explained not only >> the basics of PyTables, but also its most advanced features >> (compression, out-core computation and querying). People were quite >> responsive and asked a lot of questions, specially on the compression >> (Blosc) and query features. >> >> You can find the slides here: >> >> http://www.pytables.org/docs/PUG-Austin-2012-v3.pdf >> >> Cheers, >> >> -- >> Francesc Alted >> >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-05-10 16:11:04
|
Thanks for sharing Francesc! On Thu, May 10, 2012 at 10:36 AM, Francesc Alted <fa...@py...>wrote: > Hey List, > > Just a few words to inform you that yesterday I gave a quite extensive > talk about PyTables at the Austin Python Meetup. I explained not only > the basics of PyTables, but also its most advanced features > (compression, out-core computation and querying). People were quite > responsive and asked a lot of questions, specially on the compression > (Blosc) and query features. > > You can find the slides here: > > http://www.pytables.org/docs/PUG-Austin-2012-v3.pdf > > Cheers, > > -- > Francesc Alted > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@py...> - 2012-05-10 15:36:39
|
Hey List, Just a few words to inform you that yesterday I gave a quite extensive talk about PyTables at the Austin Python Meetup. I explained not only the basics of PyTables, but also its most advanced features (compression, out-core computation and querying). People were quite responsive and asked a lot of questions, specially on the compression (Blosc) and query features. You can find the slides here: http://www.pytables.org/docs/PUG-Austin-2012-v3.pdf Cheers, -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2012-05-06 19:39:54
|
Hello Christian, I would probably use the modifyCoordinates()<http://pytables.github.com/usersguide/libref.html#tables.Table.modifyCoordinates>, getWhereList()<http://pytables.github.com/usersguide/libref.html#tables.Table.getWhereList>, and readWhere()<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere>methods of Tables. Probably the safest way to ensure stability would be something like the following (though if you can make assumptions based on the layout of your data, you could make this faster). from itertools import product ids = set(oldtable.cols.id) years = set(oldtable.cols.year) species = set(oldtable.cols.species) ages = set(oldtable.cols.age) for id, year, specie, age in product(ids, years, species, ages): cond = "id == {} & year == {} & species == {} & age == {}".format(id, year, specie, age) newrows = newtable.readWhere(cond) oldcoord = oldtable.getWhereList(cond) assert len(oldcoord) == len(newrows) oldtable.modifyCoordiantes(oldcoord, newrows) As I said, there are probably faster ways but this would certainly work. If you needed to you could always sort newrows too. I hope this helps! Be Well Anthony On Sun, May 6, 2012 at 1:57 PM, PyTables Org <pyt...@go...>wrote: > Forwarding, > ~Josh > > On Apr 30, 2012, at 5:04 PM, pyt...@li...wrote: > > From: Christian Werner <fre...@go...> > > Date: April 30, 2012 5:03:59 PM GMT+02:00 > > To: pyt...@li... > > Subject: How would I update specific rows from a table with rows of a > second table? > > > > > > Hi group. > > > > Please consider this scenario. I have produced a large h5 file holding > the outcome of some simulations (hdf file holds some groups and about 5 > tables, but for my question this does not matter). > > How I had to rerun my simulation but only for certain fields and thus I > have a second h5 file with the same structure but only for certain fields > of the original file (old age classes, see below). > > > > Or to be more verbose: > > > > Table (original) > > id year species age data1 data2 data3 ... > > 1 1990 spruce 75 1.2 3.2 3.3 ... > > 1 1991 spruce 75 1.3 3.1 2.2 ... > > ... > > 1 1990 spruce 125 1.1 2.1 1.5 ... > > ... > > 1 1990 pine 145 1.1 2.1 1.5 ... > > ... > > 2 1990 spruce 45 1.2 3.2 3.3 ... > > 2 1991 spruce 55 1.3 3.1 2.2 ... > > ... > > > > I had to rerun my simulations for old vegetation classes. So this table > only contains the following lines (e.g., spruce age > 80) > > > > Table (new) > > id year species age data1 data2 data3 ... > > ... > > 1 1990 spruce 125 1.1 2.1 1.5 ... > > ... > > 1 1990 pine 145 1.1 2.1 1.5 ... > > > > So basically I need to use the rows of the new table and overwrite the > corresponding rows of the old table to form the updated one. I do need to > retain the original order (sort order id > year > species > age). Those > columns combined give a unique row identifier... > > > > The final table would have the same size and order as the original table > with only the rows containing data for old age classes updated with the new > simulation results... > > > > Can anyone give me a hint how to achieve this? > > > > I guess I need to introduce a new (unique index) column first in both > tables (e.g., combining id + year + species + age so I can match those > rows by this identifier), right? Does anyone have some example code for me? > > > > Problem is, the 2 h5 files are about 7GB each. > > > > Cheers, > > Christian > > > > > > > > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: PyTables O. <pyt...@go...> - 2012-05-06 18:57:47
|
Forwarding, ~Josh On Apr 30, 2012, at 5:04 PM, pyt...@li... wrote: > From: Christian Werner <fre...@go...> > Date: April 30, 2012 5:03:59 PM GMT+02:00 > To: pyt...@li... > Subject: How would I update specific rows from a table with rows of a second table? > > > Hi group. > > Please consider this scenario. I have produced a large h5 file holding the outcome of some simulations (hdf file holds some groups and about 5 tables, but for my question this does not matter). > How I had to rerun my simulation but only for certain fields and thus I have a second h5 file with the same structure but only for certain fields of the original file (old age classes, see below). > > Or to be more verbose: > > Table (original) > id year species age data1 data2 data3 ... > 1 1990 spruce 75 1.2 3.2 3.3 ... > 1 1991 spruce 75 1.3 3.1 2.2 ... > ... > 1 1990 spruce 125 1.1 2.1 1.5 ... > ... > 1 1990 pine 145 1.1 2.1 1.5 ... > ... > 2 1990 spruce 45 1.2 3.2 3.3 ... > 2 1991 spruce 55 1.3 3.1 2.2 ... > ... > > I had to rerun my simulations for old vegetation classes. So this table only contains the following lines (e.g., spruce age > 80) > > Table (new) > id year species age data1 data2 data3 ... > ... > 1 1990 spruce 125 1.1 2.1 1.5 ... > ... > 1 1990 pine 145 1.1 2.1 1.5 ... > > So basically I need to use the rows of the new table and overwrite the corresponding rows of the old table to form the updated one. I do need to retain the original order (sort order id > year > species > age). Those columns combined give a unique row identifier... > > The final table would have the same size and order as the original table with only the rows containing data for old age classes updated with the new simulation results... > > Can anyone give me a hint how to achieve this? > > I guess I need to introduce a new (unique index) column first in both tables (e.g., combining id + year + species + age so I can match those rows by this identifier), right? Does anyone have some example code for me? > > Problem is, the 2 h5 files are about 7GB each. > > Cheers, > Christian > > > |
From: Alvaro T. C. <al...@mi...> - 2012-05-02 06:13:37
|
Hi, Just for clarification: when you say simultaneously you mean 'all three tables concatenated in the order given in the list' or 'one curve per table, all three in the same plot'? I am going to assume the former. >From the point of view of organizing your data, you do really have the option of concatenating them on-disk because queries and slicing off disk are so fast (you can store indexes to relevant slices on another table --my preferred solution--, or dedicate a column to indicate whether a sample belongs in 1, 2, or 3). This is assuming that the file boundary can be smashed, which it probably can't from the way you formulate your question. Be ware that if you have long time series to plot (i.e. what is your sampling rate?) the bottleneck may be in matplotlib, which may be too small for the rendering. In that case what you probably want is to couple the panning events to lazy loading of preceding/succeeding data chunks (I would also be interested in this, btw, if anybody has a recipe). My 2c, -á. On Wed, May 2, 2012 at 2:25 AM, le...@cn... <le...@gm...> wrote: > I am just starting to work with python and pytables and have a question for > the group. > > We have multiple HDF5 files, each of which contains 30 seconds of data from > a longer sequence. Rather than merge the data, we'd like to be able to > create a list (or array) of pytables objects and operate on > them simultaneously. For instance we have three h5 data files, each of which > can be opened with the openFile command. > > data_1=tables.openFile(fname_1) > plt.plot(data_1.root.photon.channel001.elev[:]) > > If we define the following list: > > all_data=[data_1, data_2, data_3] > > the we can do the following : > > plt.plot(all_data[0].root.photon.channel001.elev[:]) > > What we'd like to do is plot the data from all three (or more generally, N) > hdf5 files simultaneously: > > plt.plot(all_data[0].root.photon.channel001.elev[:]) > > but this doesn't work. Is there a syntax that will work for this ? > > M > -- > Michael Lefsky > Center for Ecological Applications of Lidar > College of Natural Resources > Colorado State University > http://www.researcherid.com/rid/A-7224-2009 > > If I were creating the world I wouldn't mess about with butterflies and > daffodils. I would have started with lasers, eight o'clock, Day One! - Time > Bandits > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Anthony S. <sc...@gm...> - 2012-05-02 02:04:45
|
Hello M, On Tue, May 1, 2012 at 9:25 PM, le...@cn... <le...@gm...>wrote: > but this doesn't work. Is there a syntax that will work for this ? > The short answer is "No." Basically, right now you have to write your own wrapper class which knows how to dispatch various operations in the correct way. This is something that a few of us are looking into writing. You can imagine how this can get rather complicated in more general cases (parallel operations on a distributed file system). If you are interested in helping out on such infrastructure, or having something that you want to share, please let us know! For now - and what I currently do - using lots of list comprehensions and map() will get you 90% of the way there in a serial environment. Though it doesn't use PyTables and I have trouble getting it working and it is mostly for visualization, you may want to look into VisIt if you have to have an out of the box solution. Be Well Anthony > > M > -- > Michael Lefsky > Center for Ecological Applications of Lidar > College of Natural Resources > Colorado State University > http://www.researcherid.com/rid/A-7224-2009 > > If I were creating the world I wouldn't mess about with butterflies and > daffodils. I would have started with lasers, eight o'clock, Day One! - Time > Bandits > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: <le...@cn...> - 2012-05-02 01:25:27
|
I am just starting to work with python and pytables and have a question for the group. We have multiple HDF5 files, each of which contains 30 seconds of data from a longer sequence. Rather than merge the data, we'd like to be able to create a list (or array) of pytables objects and operate on them simultaneously. For instance we have three h5 data files, each of which can be opened with the openFile command. data_1=tables.openFile(fname_1) plt.plot(data_1.root.photon.channel001.elev[:]) If we define the following list: all_data=[data_1, data_2, data_3] the we can do the following : plt.plot(all_data[0].root.photon.channel001.elev[:]) What we'd like to do is plot the data from all three (or more generally, N) hdf5 files simultaneously: plt.plot(all_data[0].root.photon.channel001.elev[:]) but this doesn't work. Is there a syntax that will work for this ? M -- Michael Lefsky Center for Ecological Applications of Lidar College of Natural Resources Colorado State University http://www.researcherid.com/rid/A-7224-2009 If I were creating the world I wouldn't mess about with butterflies and daffodils. I would have started with lasers, eight o'clock, Day One! - Time Bandits |
From: Alvaro T. C. <al...@mi...> - 2012-05-01 08:52:13
|
Ok, I think I know what happened: during interactive manipulation, at some point I must have forgotten adding '[:]' to assign to the column, i.e. mytable.cols.mycolumn = somearray instead of mytable.cols.mycolumn[:] = somearray This is related to https://github.com/PyTables/PyTables/issues/145 In-memory assignments can shadow access to the object in the file. IMHO this should not be allowed (in, fact, why not making the first assignment behave like the second?). -á. On Mon, Apr 30, 2012 at 20:24, Alvaro Tejero Cantero <al...@mi...> wrote: > I am now on another computer (no access to the one where I reported > the problem until tomorrow; uses 2.3.1 AFAIK) and here I have the > expected behaviour (2.3.1, as well). I hope that I made a mistake; > will update tomorrow. > > > -á. > > > On Mon, Apr 30, 2012 at 20:09, Francesc Alted <fa...@py...> wrote: >> On 4/30/12 12:08 PM, Alvaro Tejero Cantero wrote: >>> Hi all, >>> >>> I created a table: >>> >>>>>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times') >>> I populated it >>> >>>>>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100))) >>> now if I do >>> >>>>>> joins.root.spikes.cols.tetrode[:] = np.ones(100) >>> I can see the updated value (all ones instead of zeros) >>> >>> If I fetch all the table in memory, >>> >>>>>> joins.root.spikes[:] >>> I still see only the original zeros, i.e. the column update has not propagated >> >> Hmm, I cannot reproduce this: >> >> In [18]: joins = pt.openFile("joins.h5", "w") >> >> In [19]: >> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), >> 'unit':pt.UInt8Col()},'Spike times') >> Out[19]: >> /spikes (Table(0,)) 'Spike times' >> description := { >> "t20k": Int32Col(shape=(), dflt=0, pos=0), >> "tetrode": UInt8Col(shape=(), dflt=0, pos=1), >> "unit": UInt8Col(shape=(), dflt=0, pos=2)} >> byteorder := 'little' >> chunkshape := (10922,) >> >> In [20]: joins.root.spikes.append(zip(np.arange(10),np.zeros(10), >> 3*np.ones(10))) >> >> In [22]: joins.root.spikes[:] >> Out[22]: >> array([(0, 0, 3), (1, 0, 3), (2, 0, 3), (3, 0, 3), (4, 0, 3), (5, 0, 3), >> (6, 0, 3), (7, 0, 3), (8, 0, 3), (9, 0, 3)], >> dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')]) >> >> In [23]: joins.root.spikes.cols.tetrode[:] = np.ones(10) >> >> In [24]: joins.root.spikes[:] >> Out[24]: >> array([(0, 1, 3), (1, 1, 3), (2, 1, 3), (3, 1, 3), (4, 1, 3), (5, 1, 3), >> (6, 1, 3), (7, 1, 3), (8, 1, 3), (9, 1, 3)], >> dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')]) >> >> Which versions are you using? >> >> -- >> Francesc Alted >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Alvaro T. C. <al...@mi...> - 2012-04-30 19:24:57
|
I am now on another computer (no access to the one where I reported the problem until tomorrow; uses 2.3.1 AFAIK) and here I have the expected behaviour (2.3.1, as well). I hope that I made a mistake; will update tomorrow. -á. On Mon, Apr 30, 2012 at 20:09, Francesc Alted <fa...@py...> wrote: > On 4/30/12 12:08 PM, Alvaro Tejero Cantero wrote: >> Hi all, >> >> I created a table: >> >>>>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times') >> I populated it >> >>>>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100))) >> now if I do >> >>>>> joins.root.spikes.cols.tetrode[:] = np.ones(100) >> I can see the updated value (all ones instead of zeros) >> >> If I fetch all the table in memory, >> >>>>> joins.root.spikes[:] >> I still see only the original zeros, i.e. the column update has not propagated > > Hmm, I cannot reproduce this: > > In [18]: joins = pt.openFile("joins.h5", "w") > > In [19]: > joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), > 'unit':pt.UInt8Col()},'Spike times') > Out[19]: > /spikes (Table(0,)) 'Spike times' > description := { > "t20k": Int32Col(shape=(), dflt=0, pos=0), > "tetrode": UInt8Col(shape=(), dflt=0, pos=1), > "unit": UInt8Col(shape=(), dflt=0, pos=2)} > byteorder := 'little' > chunkshape := (10922,) > > In [20]: joins.root.spikes.append(zip(np.arange(10),np.zeros(10), > 3*np.ones(10))) > > In [22]: joins.root.spikes[:] > Out[22]: > array([(0, 0, 3), (1, 0, 3), (2, 0, 3), (3, 0, 3), (4, 0, 3), (5, 0, 3), > (6, 0, 3), (7, 0, 3), (8, 0, 3), (9, 0, 3)], > dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')]) > > In [23]: joins.root.spikes.cols.tetrode[:] = np.ones(10) > > In [24]: joins.root.spikes[:] > Out[24]: > array([(0, 1, 3), (1, 1, 3), (2, 1, 3), (3, 1, 3), (4, 1, 3), (5, 1, 3), > (6, 1, 3), (7, 1, 3), (8, 1, 3), (9, 1, 3)], > dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')]) > > Which versions are you using? > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-04-30 19:09:12
|
On 4/30/12 12:08 PM, Alvaro Tejero Cantero wrote: > Hi all, > > I created a table: > >>>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times') > I populated it > >>>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100))) > now if I do > >>>> joins.root.spikes.cols.tetrode[:] = np.ones(100) > I can see the updated value (all ones instead of zeros) > > If I fetch all the table in memory, > >>>> joins.root.spikes[:] > I still see only the original zeros, i.e. the column update has not propagated Hmm, I cannot reproduce this: In [18]: joins = pt.openFile("joins.h5", "w") In [19]: joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times') Out[19]: /spikes (Table(0,)) 'Spike times' description := { "t20k": Int32Col(shape=(), dflt=0, pos=0), "tetrode": UInt8Col(shape=(), dflt=0, pos=1), "unit": UInt8Col(shape=(), dflt=0, pos=2)} byteorder := 'little' chunkshape := (10922,) In [20]: joins.root.spikes.append(zip(np.arange(10),np.zeros(10), 3*np.ones(10))) In [22]: joins.root.spikes[:] Out[22]: array([(0, 0, 3), (1, 0, 3), (2, 0, 3), (3, 0, 3), (4, 0, 3), (5, 0, 3), (6, 0, 3), (7, 0, 3), (8, 0, 3), (9, 0, 3)], dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')]) In [23]: joins.root.spikes.cols.tetrode[:] = np.ones(10) In [24]: joins.root.spikes[:] Out[24]: array([(0, 1, 3), (1, 1, 3), (2, 1, 3), (3, 1, 3), (4, 1, 3), (5, 1, 3), (6, 1, 3), (7, 1, 3), (8, 1, 3), (9, 1, 3)], dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')]) Which versions are you using? -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-04-30 17:09:20
|
Hi all, I created a table: >>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times') I populated it >>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100))) now if I do >>> joins.root.spikes.cols.tetrode[:] = np.ones(100) I can see the updated value (all ones instead of zeros) If I fetch all the table in memory, >>> joins.root.spikes[:] I still see only the original zeros, i.e. the column update has not propagated If I add another 100 rows like above and check joins.root.spikes.cols.tetrode, then I only see the original ones, and not the added zeros In the middle I .flush() table and file abundantly; that changes nothing. This I did following the instruction in http://www.pytables.org/moin/HintsForSQLUsers (search for the line: tbl.cols.temperature[6:13:3] = cols[0] ) Is this the intended behaviour? It's driving me crazy. However >>> joins.root.spikes.col('tetrode') sees the same as joins.root.spikes[:] but cannot be assigned to. Álvaro. -á. |
From: Francesc A. <fa...@py...> - 2012-04-29 22:41:06
|
On 4/26/12 4:04 AM, Alvaro Tejero Cantero wrote: > Sure, but this is not the spirit of a compressor adapted to the blocking > technique (in the sense of [1]). For a compressor that works with > blocks, you need to add some metainformation for each block, and that > takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. > > [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf > Absolutely! > > Blocking seems a good approach for most data, where the a priori many > possible values degrade very fast the potential compression gains of a > run-length-encoding (RLE) based scheme. > > But boolean arrays, that are used extremely often as masks in > scientific applications and suffer already from a 8x penalty in > storage would be an excellent candidate to consider RLE. Boolean > arrays are also an interesting way to encode attributes by > 'bit-vectors', i.e. instead of storing an enum column 'car color' with > values in {red, green, blue}, you store three boolean arrays 'red', > 'green', 'blue'. Where this gets interesting is in allowing more > generality, because you don't need a taxonomy, i.e. red and green need > not be exclusive if they are tags on a genetic sequence (or in my > case, an electrophysiological recording). To compute ANDs and ORs you > just have to perform the corresponding bit-wise operations if you > reconstruct the bit-vector or you can use some smart algorithm on the > intervals themselves (as mentioned in another mail, I think, R*Trees > or Nested Containment Lists are two viable candidates). > > I don't know whether it's possible to have such an specialization for > compression of boolean arrays in PyTables. Maybe a simple, > alternative route is to make the chunklength dependent on the > likelihood of repeated data (i.e. the range of the type domain), or at > the very least, special-casing chunklength estimation for booleans to > be somewhat higher than for other datatypes. This again, I think is an > exception that would do justice to the main use-case of PyTables. Yes, I think you raised a good point here. Well, there are quite a few possibilities to reduce the space of highly redundant data, and the first should be to add a special case in blosc so that, before passing control to blosclz, it first checks for identical data for all the block, and if found, then collapse everything to a counter and a value. This should require a bit more CPU compression effort (so it could be active only at higher compression level), but will lead to far better compression ratios. Another possibility is to add code to deal directly with compressed data, but that should be done more at PyTables (or carray, the package) level, with some help of the blosc compressor. In particular, it would be very interesting to implement interval algebra out of these extremely compressed interval data. > Yup, agreed. Don't know what to do here. carray was more a > proof-of-concept than anything else, but if development for it continues > in the future, I should ponder about changing the names. > It's a neat package and I hope it gets the appreciation and support it deserves! Thanks, I also think it can be useful for some situations. But before being more used, more work should be put in the range of operations supported. Also, defining a C API and being able to use it straight from C could help to spread package adoption too. -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2012-04-28 23:40:51
|
On Sat, Apr 28, 2012 at 5:54 AM, Alvaro Tejero Cantero <al...@mi...>wrote: > Hi, > > There are two things about the design of the PyTables API that I don't > understand: > > a) what is the reason to bind methods such as createTable & so on to > the File object instead of putting the respective functions on the > tables module? > > rationale: tables.createTable(where*, ...) could do the same job, > where* being where prepended with file path or a group object. > this frees the namespace so as to have > mytable.mydataset or mytable.mygroup free and not need to go through > root. > I think that the idea is that data set creation only ever happens on the file object or on group objects. Since this task is well encapsulated by such objects, and that you would always have to pass these file/group objects in to the functions you propose, that we just follow this common object oriented design pattern here. > b) why is it necessary to explicitly dereference links via __call__? > I don't use links, so I'll defer to someone who knows ;) Be Well Anthony > > rationale: if it would work like 'mount' then your applications do not > need to know whether a given dataset is local to the file or 'mounted' > through another file. As it is, the physical design of your database > escalates up to the application code, i.e. the latter depends on how > you arrange your tables in files. > > derived question: what is the best way to achieve this transparency on > application code? e.g. is it a good idea to capture Exceptions such as > occur when indexing mytable.mylink[:] and try mytable.mylink()[:] > before really giving up? > > Cheers, > > -á. > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-28 10:54:45
|
Hi, There are two things about the design of the PyTables API that I don't understand: a) what is the reason to bind methods such as createTable & so on to the File object instead of putting the respective functions on the tables module? rationale: tables.createTable(where*, ...) could do the same job, where* being where prepended with file path or a group object. this frees the namespace so as to have mytable.mydataset or mytable.mygroup free and not need to go through root. b) why is it necessary to explicitly dereference links via __call__? rationale: if it would work like 'mount' then your applications do not need to know whether a given dataset is local to the file or 'mounted' through another file. As it is, the physical design of your database escalates up to the application code, i.e. the latter depends on how you arrange your tables in files. derived question: what is the best way to achieve this transparency on application code? e.g. is it a good idea to capture Exceptions such as occur when indexing mytable.mylink[:] and try mytable.mylink()[:] before really giving up? Cheers, -á. |
From: Alvaro T. C. <al...@mi...> - 2012-04-26 11:47:17
|
Hi Umit, Thanks for commenting. I do think that scientific/hierarchical file formats like HDF5 and > RDBMS system have their specific use cases and I don't think it makes > sense to replace one with the other. > Right. But data modeling is notoriously difficult to get right when complex and changing schemas are involved, and here we have three paradigms involved (relational, object, hierarchical), so I am trying to spell out what is different. In my particular case the DB is both of recorded immutable data and of analyses that are continuously created by expensive functions and want to be cached and tightly integrated with the experimental recordings. > I do also think that you shouldn't try to apply RDBMS principles to > HDF5 like formats and also vice versa. Working with NoSQL DBs > (key/value store) or GraphDB is different than working with a RDBMS. > The same applies to HDF5 and RDBMS. > Got it. My E-Mail is about pinpointing those differences. > PyTables introduces some RBDMS like concepts (tables) but in the end > it is based on HDF5 and to get the best performance you have to have > some knowledge about the underlying HDF5 file structure and its > concepts. > The way you should store data in HDF5 really depends on how you will > access it (if you want to get the best performance). Sometimes this > means that you have to store the data redundantly in two different > ways if you two orthogonal ways to access it. > This is a clever comment that I find important to include in the document. This redundancy is unacceptable in a relational context because things may get out of sync very fast, especially in a multi-user context (that is not supported by hdf5), but here in PyTables where columns are normally written by a single user in a single operation, it is less of a problem. So having different 'views' of the data in different parts of the tree is a pattern of usage in hdf5 that does not cause many problems, but it is very discouraged in SQL. On the other hand, there is no analog of stored queries in the form of views for HDF5, which complicates independence of physical and logical data layout. Does all of the above make sense? However nobody said you can't combine these different storage systems. > For example you can use a RDBMS system to store meta-information and > make use of relationships, constraints, foreign keys, etc but store > the "raw data" (that is not suited for an RDBMS system) in HDF5 or > PyTables respectively and just relate them by using unique identifier. > Yes, this is what many people are doing. > Some databases like PostgreSQL even support retrieving data from non > SQL sources (flat files, XML, etc) > (http://en.wikipedia.org/wiki/PostgreSQL#Foreign_Data_Wrappers) which > might be of an interest > Yes, we had a discussion about a similar mechanism with SQLite's 'Virtual tables' a few days ago. > P.S.: AFAIK Postgresql supports schemas which might be comparable to > groups in HDF5. Would you care to elaborate a bit more on that? Cheers, Álvaro. > > On Wed, Apr 25, 2012 at 11:41 PM, Alvaro Tejero Cantero <al...@mi...> > wrote: > > Hello list, > > > > The relational model has a strong foundation and I have spent a few hours > > thinking about what in PyTables is structurally different from it. Here > are > > my thoughts. I would be delighted if you could add/comment/correct on > these > > ideas. This could eventually help people with a relational/SQL background > > who want to know how to best use the idioms of PyTables for their data > > modeling > > > > --- > > > > I make a distinction between relational and SQL (see CJ Date’s "SQL and > > relational theory" for more on that). > > > > From a purely structural point of view, the following differences are > > apparent: > > > > relations vs. sequences. Relations are sets (i.e. not ordered) of tuples > > (again, not ordered). > > > > rows: In PyTables, every container has an implicit row number, meaning > there > > is always a candidate key and order matters. Although strictly an > > implementation-level concern, row numbers in PyTables are not stored but > > computed, thanks to the in-disk adjacency of the records. This is > important > > for large datasets, where adding storage of row numbers means roughly a > > doubling of diskspace. > > columns: In PyTables columns are ordered. That is not the case in a > purely > > relational system but it is the case in SQL. > > > > Flat tablespace vs. hierarchical tablespace. SQL tables live in a global > > namespace. PyTables objects can be put inside Groups. Each approach can > be > > mapped onto the other by name mangling. Groups in PyTables are like > tables > > of tables -- for each node in a group there is a full table (or another > > group...). This introduces a possible ambiguity in data modeling: > > > > Consider a table of car parts, one column is Part ID and the other is > Model > > ID, indicating in what car models a particular part is built in. In > PyTables > > you can construct the same table /or/ create a /models group and create > one > > table per model consisting of a single column of Part IDs e.g. > /model/sedan, > > /model/cabrio... etc. The same is possible in a relational setting > (dividing > > the tables according to one attribute, and naming them according to the > > attribute value, e.g. model_sedan, model_cabrio...). The defining > difference > > is that the interface to manipulate that list is the same (it is a table) > > whereas in PyTables one listing is a Table object and the other is a > list of > > Nodes, and the API for both is a bit different. > > > > Attributes of tables and integrity. Any Node (Groups and Tables included) > > can receive a limited amount of metadata in PyTables, by using the > attached > > attributeset. In SQL, metadata is limited to some keywords, to be used > upon > > table creation, that establish constraints on the columns that have a > > functional significance. A prime example of this is identifying foreign > > keys. SQL allows to use this information structurally at the time of > joins, > > whereas in PyTables one is free to implement this or any other > navigational > > scheme in a customized way using the attributes of the table. > > > > When designing such a scheme it has to be remembered that PyTables tables > > have always an implicit column containing the row numbers, and this is > > likely to be used as a key. > > > > > > NOTE: I intentionally excluded here implementation issues whenever they > are > > not related to structural ones, e.g. SQL tables do not play well with > Numpy > > containers and are thus ill-suited for big data with Python. Another > example > > would be all the features related to transactions/concurrency and > > authorization, which are orthogonal to the data model. > > > > > > -á. > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-26 11:36:50
|
Hi Anthony, On Wed, Apr 25, 2012 at 23:19, Anthony Scopatz <sc...@gm...> wrote: > Hello Alvaro, > > Thanks for writing this up. I think this would go nicely in our docs if > you are willing to let us add it ;). > Of course! Let us polish it together - maybe you want to add it then to the existing document at http://www.pytables.org/moin/HintsForSQLUsers? > My one comment is that in your NOTE you say that "* > SQL tables do not play well with Numpy containers. > *" I think that this would be better phrased as saying that when > converting SQL Tables to/from NumPy record arrays or PyTables Tables you > are adding or removing the ordering of rows and columns. Additionally > going from NumPy / PyTables to SQL requires that you perform a unique() or > set() operation on the rows. > What you say would be correct if SQL didn't allow duplicate rows (i.e. it is correct for the pure relational system). But I was thinking, rather (and expressing rather informally, because it was not the objective of the document) of what happens when you load an array through the DBAPI - it has to be done through an iterator building first a python list and converting then to a recarray, which happens to be a major bottleneck even if the underlying database is fast. Francesc did a comprehensive study here http://mail.scipy.org/pipermail/numpy-discussion/2006-November/024732.html. I am not sure if it is completely current but since the main problem is with the DBAPI's 'casts' it is unlikely that the situation is better for other RDBMSs. Also loading data in bulk fashion from numpy containers is not completely straightforward (you have to create the right binary representation) http://stackoverflow.com/questions/8144002/use-binary-copy-table-from-with-psycopg2 > So it isn't that they *can't* play nicely together, but rather you have to > understand how they do. Thanks again. > > Be Well > Anthony > > On Wed, Apr 25, 2012 at 4:41 PM, Alvaro Tejero Cantero <al...@mi...>wrote: > >> * Hello list, >> The relational model has a strong foundation and I have spent a few hours >> thinking about what in PyTables is structurally different from it. Here are >> my thoughts. I would be delighted if you could add/comment/correct on these >> ideas. This could eventually help people with a relational/SQL background >> who want to know how to best use the idioms of PyTables for their data >> modeling >> >> --- >> I make a distinction between relational and SQL (see CJ Date’s "SQL and >> relational theory" for more on that). From a purely structural point of >> view, the following differences are apparent: >> >> 1. relations vs. sequences. Relations are sets (i.e. not ordered) of >> tuples (again, not ordered). >> 1. rows: In PyTables, every container has an implicit row number, >> meaning there is always a candidate key and order matters. >> Although strictly an implementation-level concern, row numbers in PyTables >> are not stored but computed, thanks to the in-disk adjacency of the >> records. This is important for large datasets, where adding storage of row >> numbers means roughly a doubling of diskspace. >> 2. columns: In PyTables columns are ordered. That is not the case >> in a purely relational system but it is the case in SQL. >> 2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a >> global namespace. PyTables objects can be put inside Groups. Each approach >> can be mapped onto the other by name mangling. Groups in PyTables are like >> tables of tables -- for each node in a group there is a full table (or >> another group...). This introduces a possible ambiguity in data modeling: >> >> Consider a table of car parts, one column is Part ID and the other is >> Model ID, indicating in what car models a particular part is built in. In >> PyTables you can construct the same table /or/ create a /models group >> and create one table per model consisting of a single column of Part IDs >> e.g. /model/sedan, /model/cabrio... etc. The same is possible in a >> relational setting (dividing the tables according to one attribute, and >> naming them according to the attribute value, e.g. model_sedan, >> model_cabrio...). The defining difference is that the interface to >> manipulate that list is the same (it is a table) whereas in PyTables one >> listing is a Table object and the other is a list of Nodes, and the API for >> both is a bit different. >> >> 1. Attributes of tables and integrity. Any Node (Groups and Tables >> included) can receive a limited amount of metadata in PyTables, by using >> the attached attributeset. In SQL, metadata is limited to some keywords, to >> be used upon table creation, that establish constraints on the columns that >> have a functional significance. A prime example of this is identifying >> foreign keys. SQL allows to use this information structurally at the time >> of joins, whereas in PyTables one is free to implement this or any other >> navigational scheme in a customized way using the attributes of the table. >> >> When designing such a scheme it has to be remembered that PyTables tables >> have always an implicit column containing the row numbers, and this is >> likely to be used as a key. >> >> NOTE: I intentionally excluded here implementation issues whenever they >> are not related to structural ones, e.g. SQL tables do not play well with >> Numpy containers and are thus ill-suited for big data with Python. Another >> example would be all the features related to transactions/concurrency and >> authorization, which are orthogonal to the data model. >> >> * >> >> -á. >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Alvaro T. C. <al...@mi...> - 2012-04-26 10:47:53
|
Hi, I tried again, also with different chunklens and couldn't reproduce it. Unfortunately the session where I had this result was killed by a power outage and the history buffer does not go as far back, so I can't find out what exactly triggered it. -á. On Thu, Apr 26, 2012 at 04:10, Francesc Alted <fa...@py...> wrote: > On 4/25/12 6:13 AM, Alvaro Tejero Cantero wrote: > > Hi, > > > > Thanks for the clarification. > > > > I retried today both with a normal and a completely sorted index on a > > a blosc-compressed table (complevel 5) and could not reproduce the > > putative bug either. > > So could you please confirm if you can reproduce the problem with blosc > level 9? > > Thanks! > > > > > -á. > > > > > > On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz<sc...@gm...> > wrote: > >> On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted<fa...@py...> > wrote: > >>> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote: > >>>> Some complementary info (I copy the details of the tables below) > >>>> > >>>> timeit vals = numpy.fromiter((x['val'] for x in > >>>> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) > >>>> 1 loops, best of 3: 30.4 s per loop > >>>> > >>>> > >>>> Using the compressed and indexed version, it mysteriously does not > >>>> work (output is empty list) > >>>>>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), > >>>>>>> dtype=np.int16) > >>>>>>> cvals > >>>> array([], dtype=int16) > >>> This smells like a bug, but I cannot reproduce it. Could you send an > >>> self-contained example reproducing this behavior? > >> > >> I am not able to reproduce this either... > >> > >>> > >>> -- > >>> Francesc Alted > >>> > >>> > >>> > >>> > ------------------------------------------------------------------------------ > >>> Live Security Virtual Conference > >>> Exclusive live event will cover all the ways today's security and > >>> threat landscape has changed and how IT managers can respond. > Discussions > >>> will include endpoint security, mobile security and the latest in > malware > >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >>> _______________________________________________ > >>> Pytables-users mailing list > >>> Pyt...@li... > >>> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > >> > >> > ------------------------------------------------------------------------------ > >> Live Security Virtual Conference > >> Exclusive live event will cover all the ways today's security and > >> threat landscape has changed and how IT managers can respond. > Discussions > >> will include endpoint security, mobile security and the latest in > malware > >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Alvaro T. C. <al...@mi...> - 2012-04-26 09:05:29
|
On Thu, Apr 26, 2012 at 04:07, Francesc Alted <fa...@py...> wrote: > On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote: >> Hi, a minor update on this thread >> >>>> * a bool array of 10**8 elements with True in two separate slices of >>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain >>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy >>>> array). The resulting filesize is 248kb, still far from storing the 4 >>>> or 6 integer indexes that define the slices (I am experimenting with >>>> an approach for scientific databases where this is a concern). >>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but >>> apparently a 350 to 1 is not enough? :) >> Here I expected more from a run-length-like compression scheme. My >> array would be compressible to the following representation: >> >> (0, x) : 0 >> (x, x+10**6) : 1 >> (x+10**6, y) : 0 >> (y, y+10**6) : 1 >> (y+10**6, 10**8) : 0 >> >> or just: >> (x, x+10**6) : 1 >> (y, y+10**6) : 1 >> >> where x and y are two reasonable integers (i.e. in range and with no overlap). > > Sure, but this is not the spirit of a compressor adapted to the blocking > technique (in the sense of [1]). For a compressor that works with > blocks, you need to add some metainformation for each block, and that > takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. > > [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf Absolutely! Blocking seems a good approach for most data, where the a priori many possible values degrade very fast the potential compression gains of a run-length-encoding (RLE) based scheme. But boolean arrays, that are used extremely often as masks in scientific applications and suffer already from a 8x penalty in storage would be an excellent candidate to consider RLE. Boolean arrays are also an interesting way to encode attributes by 'bit-vectors', i.e. instead of storing an enum column 'car color' with values in {red, green, blue}, you store three boolean arrays 'red', 'green', 'blue'. Where this gets interesting is in allowing more generality, because you don't need a taxonomy, i.e. red and green need not be exclusive if they are tags on a genetic sequence (or in my case, an electrophysiological recording). To compute ANDs and ORs you just have to perform the corresponding bit-wise operations if you reconstruct the bit-vector or you can use some smart algorithm on the intervals themselves (as mentioned in another mail, I think, R*Trees or Nested Containment Lists are two viable candidates). I don't know whether it's possible to have such an specialization for compression of boolean arrays in PyTables. Maybe a simple, alternative route is to make the chunklength dependent on the likelihood of repeated data (i.e. the range of the type domain), or at the very least, special-casing chunklength estimation for booleans to be somewhat higher than for other datatypes. This again, I think is an exception that would do justice to the main use-case of PyTables. >>>> * how blosc choses the chunklen is black magic for me, but it seems to >>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to >>>> 64*1024 when CArraying only one row). >>> Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could >>> you detail a bit more how you achieve this result? Providing an example >>> would be very useful. >> I revisited this issue. While in PyTables CArray the guesses are >> reasonable, the problem is in carray.carray (or in its reporting of >> chunklen). >> >> This is the offender: >> carray((64, 15600000), int16) nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78 >> cparams := cparams(clevel=5, shuffle=True) >> >> In [87]: x.chunklen >> Out[87]: 1 >> >> Could it be that carray is not reporting the second dimension of the >> chunkshape? (in PyTables, this is 262144) > > Ah yes, this is it. The carray package is not as sophisticated as HDF5, > and it only blocks in the leading dimension. In this case, it is saying > that the block is a complete row. So this is the intended behaviour. Ok, it makes sense, and in my particular use case, the rows do fit in memory, so there is no need for further chunking. >> >> The fact that both PyTable's CArray and carray.carray are named carray >> is a bit confusing. > > Yup, agreed. Don't know what to do here. carray was more a > proof-of-concept than anything else, but if development for it continues > in the future, I should ponder about changing the names. It's a neat package and I hope it gets the appreciation and support it deserves! Cheers, Álvaro. > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Ümit S. <uem...@gm...> - 2012-04-26 08:16:43
|
Good points. Just some additional comments: I do think that scientific/hierarchical file formats like HDF5 and RDBMS system have their specific use cases and I don't think it makes sense to replace one with the other. I do also think that you shouldn't try to apply RDBMS principles to HDF5 like formats and also vice versa. Working with NoSQL DBs (key/value store) or GraphDB is different than working with a RDBMS. The same applies to HDF5 and RDBMS. PyTables introduces some RBDMS like concepts (tables) but in the end it is based on HDF5 and to get the best performance you have to have some knowledge about the underlying HDF5 file structure and its concepts. The way you should store data in HDF5 really depends on how you will access it (if you want to get the best performance). Sometimes this means that you have to store the data redundantly in two different ways if you two orthogonal ways to access it. However nobody said you can't combine these different storage systems. For example you can use a RDBMS system to store meta-information and make use of relationships, constraints, foreign keys, etc but store the "raw data" (that is not suited for an RDBMS system) in HDF5 or PyTables respectively and just relate them by using unique identifier. Some databases like PostgreSQL even support retrieving data from non SQL sources (flat files, XML, etc) (http://en.wikipedia.org/wiki/PostgreSQL#Foreign_Data_Wrappers) which might be of an interest. cheers Ümit P.S.: AFAIK Postgresql supports schemas which might be comparable to groups in HDF5. On Wed, Apr 25, 2012 at 11:41 PM, Alvaro Tejero Cantero <al...@mi...> wrote: > Hello list, > > The relational model has a strong foundation and I have spent a few hours > thinking about what in PyTables is structurally different from it. Here are > my thoughts. I would be delighted if you could add/comment/correct on these > ideas. This could eventually help people with a relational/SQL background > who want to know how to best use the idioms of PyTables for their data > modeling > > --- > > I make a distinction between relational and SQL (see CJ Date’s "SQL and > relational theory" for more on that). > > From a purely structural point of view, the following differences are > apparent: > > relations vs. sequences. Relations are sets (i.e. not ordered) of tuples > (again, not ordered). > > rows: In PyTables, every container has an implicit row number, meaning there > is always a candidate key and order matters. Although strictly an > implementation-level concern, row numbers in PyTables are not stored but > computed, thanks to the in-disk adjacency of the records. This is important > for large datasets, where adding storage of row numbers means roughly a > doubling of diskspace. > columns: In PyTables columns are ordered. That is not the case in a purely > relational system but it is the case in SQL. > > Flat tablespace vs. hierarchical tablespace. SQL tables live in a global > namespace. PyTables objects can be put inside Groups. Each approach can be > mapped onto the other by name mangling. Groups in PyTables are like tables > of tables -- for each node in a group there is a full table (or another > group...). This introduces a possible ambiguity in data modeling: > > Consider a table of car parts, one column is Part ID and the other is Model > ID, indicating in what car models a particular part is built in. In PyTables > you can construct the same table /or/ create a /models group and create one > table per model consisting of a single column of Part IDs e.g. /model/sedan, > /model/cabrio... etc. The same is possible in a relational setting (dividing > the tables according to one attribute, and naming them according to the > attribute value, e.g. model_sedan, model_cabrio...). The defining difference > is that the interface to manipulate that list is the same (it is a table) > whereas in PyTables one listing is a Table object and the other is a list of > Nodes, and the API for both is a bit different. > > Attributes of tables and integrity. Any Node (Groups and Tables included) > can receive a limited amount of metadata in PyTables, by using the attached > attributeset. In SQL, metadata is limited to some keywords, to be used upon > table creation, that establish constraints on the columns that have a > functional significance. A prime example of this is identifying foreign > keys. SQL allows to use this information structurally at the time of joins, > whereas in PyTables one is free to implement this or any other navigational > scheme in a customized way using the attributes of the table. > > When designing such a scheme it has to be remembered that PyTables tables > have always an implicit column containing the row numbers, and this is > likely to be used as a key. > > > NOTE: I intentionally excluded here implementation issues whenever they are > not related to structural ones, e.g. SQL tables do not play well with Numpy > containers and are thus ill-suited for big data with Python. Another example > would be all the features related to transactions/concurrency and > authorization, which are orthogonal to the data model. > > > -á. > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@py...> - 2012-04-26 03:10:22
|
On 4/25/12 6:13 AM, Alvaro Tejero Cantero wrote: > Hi, > > Thanks for the clarification. > > I retried today both with a normal and a completely sorted index on a > a blosc-compressed table (complevel 5) and could not reproduce the > putative bug either. So could you please confirm if you can reproduce the problem with blosc level 9? Thanks! > > -á. > > > On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz<sc...@gm...> wrote: >> On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted<fa...@py...> wrote: >>> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote: >>>> Some complementary info (I copy the details of the tables below) >>>> >>>> timeit vals = numpy.fromiter((x['val'] for x in >>>> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16) >>>> 1 loops, best of 3: 30.4 s per loop >>>> >>>> >>>> Using the compressed and indexed version, it mysteriously does not >>>> work (output is empty list) >>>>>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')), >>>>>>> dtype=np.int16) >>>>>>> cvals >>>> array([], dtype=int16) >>> This smells like a bug, but I cannot reproduce it. Could you send an >>> self-contained example reproducing this behavior? >> >> I am not able to reproduce this either... >> >>> >>> -- >>> Francesc Alted >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Francesc A. <fa...@py...> - 2012-04-26 03:07:54
|
On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote: > Hi, a minor update on this thread > >>> * a bool array of 10**8 elements with True in two separate slices of >>> length 10**6 each compresses by ~350. Using .wheretrue to obtain >>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy >>> array). The resulting filesize is 248kb, still far from storing the 4 >>> or 6 integer indexes that define the slices (I am experimenting with >>> an approach for scientific databases where this is a concern). >> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but >> apparently a 350 to 1 is not enough? :) > Here I expected more from a run-length-like compression scheme. My > array would be compressible to the following representation: > > (0, x) : 0 > (x, x+10**6) : 1 > (x+10**6, y) : 0 > (y, y+10**6) : 1 > (y+10**6, 10**8) : 0 > > or just: > (x, x+10**6) : 1 > (y, y+10**6) : 1 > > where x and y are two reasonable integers (i.e. in range and with no overlap). Sure, but this is not the spirit of a compressor adapted to the blocking technique (in the sense of [1]). For a compressor that works with blocks, you need to add some metainformation for each block, and that takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf > >>> * how blosc choses the chunklen is black magic for me, but it seems to >>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to >>> 64*1024 when CArraying only one row). >> Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could >> you detail a bit more how you achieve this result? Providing an example >> would be very useful. > I revisited this issue. While in PyTables CArray the guesses are > reasonable, the problem is in carray.carray (or in its reporting of > chunklen). > > This is the offender: > carray((64, 15600000), int16) nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78 > cparams := cparams(clevel=5, shuffle=True) > > In [87]: x.chunklen > Out[87]: 1 > > Could it be that carray is not reporting the second dimension of the > chunkshape? (in PyTables, this is 262144) Ah yes, this is it. The carray package is not as sophisticated as HDF5, and it only blocks in the leading dimension. In this case, it is saying that the block is a complete row. So this is the intended behaviour. > > The fact that both PyTable's CArray and carray.carray are named carray > is a bit confusing. Yup, agreed. Don't know what to do here. carray was more a proof-of-concept than anything else, but if development for it continues in the future, I should ponder about changing the names. -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2012-04-25 22:19:53
|
Hello Alvaro, Thanks for writing this up. I think this would go nicely in our docs if you are willing to let us add it ;). My one comment is that in your NOTE you say that "* SQL tables do not play well with Numpy containers. *" I think that this would be better phrased as saying that when converting SQL Tables to/from NumPy record arrays or PyTables Tables you are adding or removing the ordering of rows and columns. Additionally going from NumPy / PyTables to SQL requires that you perform a unique() or set() operation on the rows. So it isn't that they *can't* play nicely together, but rather you have to understand how they do. Thanks again. Be Well Anthony On Wed, Apr 25, 2012 at 4:41 PM, Alvaro Tejero Cantero <al...@mi...>wrote: > * Hello list, > The relational model has a strong foundation and I have spent a few hours > thinking about what in PyTables is structurally different from it. Here are > my thoughts. I would be delighted if you could add/comment/correct on these > ideas. This could eventually help people with a relational/SQL background > who want to know how to best use the idioms of PyTables for their data > modeling > > --- > I make a distinction between relational and SQL (see CJ Date’s "SQL and > relational theory" for more on that). From a purely structural point of > view, the following differences are apparent: > > 1. relations vs. sequences. Relations are sets (i.e. not ordered) of > tuples (again, not ordered). > 1. rows: In PyTables, every container has an implicit row number, > meaning there is always a candidate key and order matters. Although > strictly an implementation-level concern, row numbers in PyTables are not > stored but computed, thanks to the in-disk adjacency of the records. This > is important for large datasets, where adding storage of row numbers means > roughly a doubling of diskspace. > 2. columns: In PyTables columns are ordered. That is not the case > in a purely relational system but it is the case in SQL. > 2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a > global namespace. PyTables objects can be put inside Groups. Each approach > can be mapped onto the other by name mangling. Groups in PyTables are like > tables of tables -- for each node in a group there is a full table (or > another group...). This introduces a possible ambiguity in data modeling: > > Consider a table of car parts, one column is Part ID and the other is > Model ID, indicating in what car models a particular part is built in. In > PyTables you can construct the same table /or/ create a /models group and > create one table per model consisting of a single column of Part IDs e.g. > /model/sedan, /model/cabrio... etc. The same is possible in a relational > setting (dividing the tables according to one attribute, and naming them > according to the attribute value, e.g. model_sedan, model_cabrio...). The > defining difference is that the interface to manipulate that list is the > same (it is a table) whereas in PyTables one listing is a Table object and > the other is a list of Nodes, and the API for both is a bit different. > > 1. Attributes of tables and integrity. Any Node (Groups and Tables > included) can receive a limited amount of metadata in PyTables, by using > the attached attributeset. In SQL, metadata is limited to some keywords, to > be used upon table creation, that establish constraints on the columns that > have a functional significance. A prime example of this is identifying > foreign keys. SQL allows to use this information structurally at the time > of joins, whereas in PyTables one is free to implement this or any other > navigational scheme in a customized way using the attributes of the table. > > When designing such a scheme it has to be remembered that PyTables tables > have always an implicit column containing the row numbers, and this is > likely to be used as a key. > > NOTE: I intentionally excluded here implementation issues whenever they > are not related to structural ones, e.g. SQL tables do not play well with > Numpy containers and are thus ill-suited for big data with Python. Another > example would be all the features related to transactions/concurrency and > authorization, which are orthogonal to the data model. > > * > > -á. > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Alvaro T. C. <al...@mi...> - 2012-04-25 21:41:40
|
* Hello list, The relational model has a strong foundation and I have spent a few hours thinking about what in PyTables is structurally different from it. Here are my thoughts. I would be delighted if you could add/comment/correct on these ideas. This could eventually help people with a relational/SQL background who want to know how to best use the idioms of PyTables for their data modeling --- I make a distinction between relational and SQL (see CJ Date’s "SQL and relational theory" for more on that). From a purely structural point of view, the following differences are apparent: 1. relations vs. sequences. Relations are sets (i.e. not ordered) of tuples (again, not ordered). 1. rows: In PyTables, every container has an implicit row number, meaning there is always a candidate key and order matters. Although strictly an implementation-level concern, row numbers in PyTables are not stored but computed, thanks to the in-disk adjacency of the records. This is important for large datasets, where adding storage of row numbers means roughly a doubling of diskspace. 2. columns: In PyTables columns are ordered. That is not the case in a purely relational system but it is the case in SQL. 2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a global namespace. PyTables objects can be put inside Groups. Each approach can be mapped onto the other by name mangling. Groups in PyTables are like tables of tables -- for each node in a group there is a full table (or another group...). This introduces a possible ambiguity in data modeling: Consider a table of car parts, one column is Part ID and the other is Model ID, indicating in what car models a particular part is built in. In PyTables you can construct the same table /or/ create a /models group and create one table per model consisting of a single column of Part IDs e.g. /model/sedan, /model/cabrio... etc. The same is possible in a relational setting (dividing the tables according to one attribute, and naming them according to the attribute value, e.g. model_sedan, model_cabrio...). The defining difference is that the interface to manipulate that list is the same (it is a table) whereas in PyTables one listing is a Table object and the other is a list of Nodes, and the API for both is a bit different. 1. Attributes of tables and integrity. Any Node (Groups and Tables included) can receive a limited amount of metadata in PyTables, by using the attached attributeset. In SQL, metadata is limited to some keywords, to be used upon table creation, that establish constraints on the columns that have a functional significance. A prime example of this is identifying foreign keys. SQL allows to use this information structurally at the time of joins, whereas in PyTables one is free to implement this or any other navigational scheme in a customized way using the attributes of the table. When designing such a scheme it has to be remembered that PyTables tables have always an implicit column containing the row numbers, and this is likely to be used as a key. NOTE: I intentionally excluded here implementation issues whenever they are not related to structural ones, e.g. SQL tables do not play well with Numpy containers and are thus ill-suited for big data with Python. Another example would be all the features related to transactions/concurrency and authorization, which are orthogonal to the data model. * -á. |