pytables-users Mailing List for PyTables - Hierarchical datasets (Page 28)

pytables-users — PyTables users discussion list

You can subscribe to this list here.

2002	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (5)	Dec
2003	Jan	Feb (2)	Mar	Apr (5)	May (11)	Jun (7)	Jul (18)	Aug (5)	Sep (15)	Oct (4)	Nov (1)	Dec (4)
2004	Jan (5)	Feb (2)	Mar (5)	Apr (8)	May (8)	Jun (10)	Jul (4)	Aug (4)	Sep (20)	Oct (11)	Nov (31)	Dec (41)
2005	Jan (79)	Feb (22)	Mar (14)	Apr (17)	May (35)	Jun (24)	Jul (26)	Aug (9)	Sep (57)	Oct (64)	Nov (25)	Dec (37)
2006	Jan (76)	Feb (24)	Mar (79)	Apr (44)	May (33)	Jun (12)	Jul (15)	Aug (40)	Sep (17)	Oct (21)	Nov (46)	Dec (23)
2007	Jan (18)	Feb (25)	Mar (41)	Apr (66)	May (18)	Jun (29)	Jul (40)	Aug (32)	Sep (34)	Oct (17)	Nov (46)	Dec (17)
2008	Jan (17)	Feb (42)	Mar (23)	Apr (11)	May (65)	Jun (28)	Jul (28)	Aug (16)	Sep (24)	Oct (33)	Nov (16)	Dec (5)
2009	Jan (19)	Feb (25)	Mar (11)	Apr (32)	May (62)	Jun (28)	Jul (61)	Aug (20)	Sep (61)	Oct (11)	Nov (14)	Dec (53)
2010	Jan (17)	Feb (31)	Mar (39)	Apr (43)	May (49)	Jun (47)	Jul (35)	Aug (58)	Sep (55)	Oct (91)	Nov (77)	Dec (63)
2011	Jan (50)	Feb (30)	Mar (67)	Apr (31)	May (17)	Jun (83)	Jul (17)	Aug (33)	Sep (35)	Oct (19)	Nov (29)	Dec (26)
2012	Jan (53)	Feb (22)	Mar (118)	Apr (45)	May (28)	Jun (71)	Jul (87)	Aug (55)	Sep (30)	Oct (73)	Nov (41)	Dec (28)
2013	Jan (19)	Feb (30)	Mar (14)	Apr (63)	May (20)	Jun (59)	Jul (40)	Aug (33)	Sep (1)	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 26 27 28 29 30 .. 165 > >> (Page 28 of 165)

Re: [Pytables-users] New talk about PyTables

From: Francesc A. <fa...@py...> - 2012-05-10 19:47:25

On 5/10/12 12:14 PM, Alvaro Tejero Cantero wrote:
> The graphical explanation of the different containers is masterly, and
> I believe, supersedes the table that we had talked about for the
> documentation.
>
> I think it the schematics deserve a prominent place in the web page.
> They are a very good symbolic explanation of the basics of PyTables.

Glad that you like it.  In fact I think you are right: this is perhaps 
the first time that some schematics have been used for describing the 
basic objects in PyTables.  And my impression from the talk yesterday is 
that people really get the gist of PyTables very quickly.

>
> As for the tables.Expr example of an in-kernel query,
>
> [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
>
> now that there exists thanks to Josh a facility to obtain dataset
> sizes, perhaps some interesting things become possible

I think you are mixing concepts here.  tables.Expr is for out-of-core 
operations.  I suppose you mean Numexpr here.

>
> a) I have always wondered why tables.Expr 'must' be used in an
> iterative context, i.e. pay the prize of building the Python list,
> which is not the best container to iterate on afterwards. My
> explanation for it is that you don't know how big the result set will
> be, and thus want to avoid returning a big object in memory. But now
> it would be possible that if the size of the columns that are involved
> fits in memory (or, let's say a fraction of the total RAM that is
> configurable), PyTables returns a numpy mask, or an index array, which
> are certainly very useful for further numpy work. A new function name
> could be provided for this functionality.

Hmm, the Table.where() iterator is very fast already (I can assure you 
that a lot of optimizations and caching stuff is there), but I agree 
that, for the indexed case, there would be situations where returning a 
mask or an index array would be better (read faster).

>
> b) more generally, expanding on this, knowing the size of datasets and
> the available memory, PyTables could eventually decide whether to
> perform operations in memory or in kernel.

In-memory or in-kernel?  You probably mean indexed or in-kernel, right?  
Yes, that's certainly another nice place for further optimizations.

-- 
Francesc Alted

Re: [Pytables-users] New talk about PyTables

From: Alvaro T. C. <al...@mi...> - 2012-05-10 17:14:43

The graphical explanation of the different containers is masterly, and
I believe, supersedes the table that we had talked about for the
documentation.

I think it the schematics deserve a prominent place in the web page.
They are a very good symbolic explanation of the basics of PyTables.

As for the tables.Expr example of an in-kernel query,

[ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]

now that there exists thanks to Josh a facility to obtain dataset
sizes, perhaps some interesting things become possible

a) I have always wondered why tables.Expr 'must' be used in an
iterative context, i.e. pay the prize of building the Python list,
which is not the best container to iterate on afterwards. My
explanation for it is that you don't know how big the result set will
be, and thus want to avoid returning a big object in memory. But now
it would be possible that if the size of the columns that are involved
fits in memory (or, let's say a fraction of the total RAM that is
configurable), PyTables returns a numpy mask, or an index array, which
are certainly very useful for further numpy work. A new function name
could be provided for this functionality.

b) more generally, expanding on this, knowing the size of datasets and
the available memory, PyTables could eventually decide whether to
perform operations in memory or in kernel.

What do you think?

-á.

On Thu, May 10, 2012 at 5:10 PM, Anthony Scopatz <sc...@gm...> wrote:
> Thanks for sharing Francesc!
>
>
> On Thu, May 10, 2012 at 10:36 AM, Francesc Alted <fa...@py...>
> wrote:
>>
>> Hey List,
>>
>> Just a few words to inform you that yesterday I gave a quite extensive
>> talk about PyTables at the Austin Python Meetup.  I explained not only
>> the basics of PyTables, but also its most advanced features
>> (compression, out-core computation and querying).  People were quite
>> responsive and asked a lot of questions, specially on the compression
>> (Blosc) and query features.
>>
>> You can find the slides here:
>>
>> http://www.pytables.org/docs/PUG-Austin-2012-v3.pdf
>>
>> Cheers,
>>
>> --
>> Francesc Alted
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] New talk about PyTables

From: Anthony S. <sc...@gm...> - 2012-05-10 16:11:04

Thanks for sharing Francesc!

On Thu, May 10, 2012 at 10:36 AM, Francesc Alted <fa...@py...>wrote:

> Hey List,
>
> Just a few words to inform you that yesterday I gave a quite extensive
> talk about PyTables at the Austin Python Meetup.  I explained not only
> the basics of PyTables, but also its most advanced features
> (compression, out-core computation and querying).  People were quite
> responsive and asked a lot of questions, specially on the compression
> (Blosc) and query features.
>
> You can find the slides here:
>
> http://www.pytables.org/docs/PUG-Austin-2012-v3.pdf
>
> Cheers,
>
> --
> Francesc Alted
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

[Pytables-users] New talk about PyTables

From: Francesc A. <fa...@py...> - 2012-05-10 15:36:39

Hey List,

Just a few words to inform you that yesterday I gave a quite extensive 
talk about PyTables at the Austin Python Meetup.  I explained not only 
the basics of PyTables, but also its most advanced features 
(compression, out-core computation and querying).  People were quite 
responsive and asked a lot of questions, specially on the compression 
(Blosc) and query features.

You can find the slides here:

http://www.pytables.org/docs/PUG-Austin-2012-v3.pdf

Cheers,

-- 
Francesc Alted

Re: [Pytables-users] How would I update specific rows from a table with rows of a second table?

From: Anthony S. <sc...@gm...> - 2012-05-06 19:39:54

Hello Christian,

I would probably use the
modifyCoordinates()<http://pytables.github.com/usersguide/libref.html#tables.Table.modifyCoordinates>,
getWhereList()<http://pytables.github.com/usersguide/libref.html#tables.Table.getWhereList>,
and readWhere()<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere>methods
of Tables.  Probably the safest way to ensure stability would be
something like the following (though if you can make assumptions based on
the layout of your data, you could make this faster).

from itertools import product

ids = set(oldtable.cols.id)
years = set(oldtable.cols.year)
species = set(oldtable.cols.species)
ages = set(oldtable.cols.age)

for id, year, specie, age in product(ids, years, species, ages):
    cond = "id == {} & year == {} & species == {} & age == {}".format(id,
year, specie, age)
    newrows = newtable.readWhere(cond)
    oldcoord = oldtable.getWhereList(cond)
    assert len(oldcoord) == len(newrows)
    oldtable.modifyCoordiantes(oldcoord, newrows)

As I said, there are probably faster ways but this would certainly work.
 If you needed to you could always sort newrows too.  I hope this helps!

Be Well
Anthony


On Sun, May 6, 2012 at 1:57 PM, PyTables Org <pyt...@go...>wrote:

> Forwarding,
> ~Josh
>
> On Apr 30, 2012, at 5:04 PM, pyt...@li...wrote:
> > From: Christian Werner <fre...@go...>
> > Date: April 30, 2012 5:03:59 PM GMT+02:00
> > To: pyt...@li...
> > Subject: How would I update specific rows from a table with rows of a
> second table?
> >
> >
> > Hi group.
> >
> > Please consider this scenario. I have produced a large h5 file holding
> the outcome of some simulations (hdf file holds some groups and about 5
> tables, but for my question this does not matter).
> > How I had to rerun my simulation but only for certain fields and thus I
> have a second h5 file with the same structure but only for certain fields
> of the original file (old age classes, see below).
> >
> > Or to be more verbose:
> >
> > Table (original)
> > id    year    species age     data1   data2   data3   ...
> > 1     1990    spruce  75      1.2     3.2     3.3 ...
> > 1     1991    spruce  75      1.3     3.1     2.2 ...
> > ...
> > 1     1990    spruce  125     1.1     2.1     1.5     ...
> > ...
> > 1     1990    pine    145     1.1     2.1     1.5     ...
> > ...
> > 2     1990    spruce  45      1.2     3.2     3.3 ...
> > 2     1991    spruce  55      1.3     3.1     2.2 ...
> > ...
> >
> > I had to rerun my simulations for old vegetation classes. So this table
> only contains the following lines (e.g., spruce age > 80)
> >
> > Table (new)
> > id    year    species age     data1   data2   data3   ...
> > ...
> > 1     1990    spruce  125     1.1     2.1     1.5     ...
> > ...
> > 1     1990    pine    145     1.1     2.1     1.5     ...
> >
> > So basically I need to use the rows of the new table and overwrite the
> corresponding rows of the old table to form the updated one. I do need to
> retain the original order (sort order id > year > species > age). Those
> columns combined give a unique row identifier...
> >
> > The final table would have the same size and order as the original table
> with only the rows containing data for old age classes updated with the new
> simulation results...
> >
> > Can anyone give me a hint how to achieve this?
> >
> > I guess I need to introduce a new (unique index) column first in both
> tables (e.g., combining  id + year + species + age so I can match those
> rows by this identifier), right? Does anyone have some example code for me?
> >
> > Problem is, the 2 h5 files are about 7GB each.
> >
> > Cheers,
> > Christian
> >
> >
> >
>
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

[Pytables-users] How would I update specific rows from a table with rows of a second table?

From: PyTables O. <pyt...@go...> - 2012-05-06 18:57:47

Forwarding,
~Josh

On Apr 30, 2012, at 5:04 PM, pyt...@li... wrote:
> From: Christian Werner <fre...@go...>
> Date: April 30, 2012 5:03:59 PM GMT+02:00
> To: pyt...@li...
> Subject: How would I update specific rows from a table with rows of a second table?
> 
> 
> Hi group.
> 
> Please consider this scenario. I have produced a large h5 file holding the outcome of some simulations (hdf file holds some groups and about 5 tables, but for my question this does not matter). 
> How I had to rerun my simulation but only for certain fields and thus I have a second h5 file with the same structure but only for certain fields of the original file (old age classes, see below).
> 
> Or to be more verbose:
> 
> Table (original)
> id	year	species	age	data1	data2	data3	...
> 1	1990	spruce	75	1.2	3.2	3.3 ...
> 1	1991	spruce	75	1.3	3.1	2.2 ...
> ...
> 1	1990	spruce	125	1.1	2.1	1.5	...
> ...
> 1	1990	pine	145	1.1	2.1	1.5	...
> ...
> 2	1990	spruce	45	1.2	3.2	3.3 ...
> 2	1991	spruce	55	1.3	3.1	2.2 ...
> ...
> 
> I had to rerun my simulations for old vegetation classes. So this table only contains the following lines (e.g., spruce age > 80)
> 
> Table (new)
> id	year	species	age	data1	data2	data3	...
> ...
> 1	1990	spruce	125	1.1	2.1	1.5	...
> ...
> 1	1990	pine	145	1.1	2.1	1.5	...
> 
> So basically I need to use the rows of the new table and overwrite the corresponding rows of the old table to form the updated one. I do need to retain the original order (sort order id > year > species > age). Those columns combined give a unique row identifier...
> 
> The final table would have the same size and order as the original table with only the rows containing data for old age classes updated with the new simulation results...
> 
> Can anyone give me a hint how to achieve this?
> 
> I guess I need to introduce a new (unique index) column first in both tables (e.g., combining  id + year + species + age so I can match those rows by this identifier), right? Does anyone have some example code for me?
> 
> Problem is, the 2 h5 files are about 7GB each.
> 
> Cheers,
> Christian
> 
> 
>

Re: [Pytables-users] Lists of pytable objects

From: Alvaro T. C. <al...@mi...> - 2012-05-02 06:13:37

Hi,

Just for clarification: when you say simultaneously you mean 'all
three tables concatenated in the order given in the list' or 'one
curve per table, all three in the same plot'?

I am going to assume the former.

>From the point of view of organizing your data, you do really have the
option of concatenating them on-disk because queries and slicing off
disk are so fast (you can store indexes to relevant slices on another
table --my preferred solution--, or dedicate a column to indicate
whether a sample belongs in 1, 2, or 3). This is assuming that the
file boundary can be smashed, which it probably can't from the way you
formulate your question.

Be ware that if you have long time series to plot (i.e. what is your
sampling rate?)  the bottleneck may be in matplotlib, which may be too
small for the rendering. In that case what you probably want is to
couple the panning events to lazy loading of preceding/succeeding data
chunks (I would also be interested in this, btw, if anybody has a
recipe).

My 2c,

-á.


On Wed, May 2, 2012 at 2:25 AM, le...@cn...
<le...@gm...> wrote:
> I am just starting to work with python and pytables and have a question for
> the group.
>
> We have multiple HDF5 files, each of which contains 30 seconds of data from
> a longer sequence. Rather than merge the data, we'd like to be able to
> create a list (or array) of pytables objects and operate on
> them simultaneously. For instance we have three h5 data files, each of which
> can be opened with the openFile command.
>
> data_1=tables.openFile(fname_1)
> plt.plot(data_1.root.photon.channel001.elev[:])
>
> If we define the following list:
>
> all_data=[data_1, data_2, data_3]
>
> the we can do the following :
>
> plt.plot(all_data[0].root.photon.channel001.elev[:])
>
> What we'd like to do is plot the data from all three (or more generally, N)
> hdf5 files simultaneously:
>
> plt.plot(all_data[0].root.photon.channel001.elev[:])
>
> but this doesn't work. Is there a syntax that will work for this ?
>
> M
> --
> Michael Lefsky
> Center for Ecological Applications of Lidar
> College of Natural Resources
> Colorado State University
> http://www.researcherid.com/rid/A-7224-2009
>
> If I were creating the world I wouldn't mess about with butterflies and
> daffodils. I would have started with lasers, eight o'clock, Day One! - Time
> Bandits
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Lists of pytable objects

From: Anthony S. <sc...@gm...> - 2012-05-02 02:04:45

Hello M,

On Tue, May 1, 2012 at 9:25 PM, le...@cn...
<le...@gm...>wrote:

> but this doesn't work. Is there a syntax that will work for this ?
>

The short answer is "No."  Basically, right now you have to write your own
wrapper class which knows how to dispatch various operations in the correct
way.  This is something that a few of us are looking into writing.  You can
imagine how this can get rather complicated in more general cases (parallel
operations on a distributed file system).  If you are interested in helping
out on such infrastructure, or having something that you want to share,
please let us know!

For now - and what I currently do - using lots of list comprehensions and
map() will get you 90% of the way there in a serial environment.

Though it doesn't use PyTables and I have trouble getting it working and it
is mostly for visualization, you may want to look into VisIt if you have to
have an out of the box solution.

Be Well
Anthony

>
> M
> --
> Michael Lefsky
> Center for Ecological Applications of Lidar
> College of Natural Resources
> Colorado State University
> http://www.researcherid.com/rid/A-7224-2009
>
> If I were creating the world I wouldn't mess about with butterflies and
> daffodils. I would have started with lasers, eight o'clock, Day One! - Time
> Bandits
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

[Pytables-users] Lists of pytable objects

From: <le...@cn...> - 2012-05-02 01:25:27

I am just starting to work with python and pytables and have a question for
the group.

We have multiple HDF5 files, each of which contains 30 seconds of data from
a longer sequence. Rather than merge the data, we'd like to be able to
create a list (or array) of pytables objects and operate on
them simultaneously. For instance we have three h5 data files, each of
which can be opened with the openFile command.

data_1=tables.openFile(fname_1)
plt.plot(data_1.root.photon.channel001.elev[:])

If we define the following list:

all_data=[data_1, data_2, data_3]

the we can do the following :

plt.plot(all_data[0].root.photon.channel001.elev[:])

What we'd like to do is plot the data from all three (or more generally, N)
hdf5 files simultaneously:

plt.plot(all_data[0].root.photon.channel001.elev[:])

but this doesn't work. Is there a syntax that will work for this ?

M
-- 
Michael Lefsky
Center for Ecological Applications of Lidar
College of Natural Resources
Colorado State University
http://www.researcherid.com/rid/A-7224-2009

If I were creating the world I wouldn't mess about with butterflies and
daffodils. I would have started with lasers, eight o'clock, Day One! - Time
Bandits

Re: [Pytables-users] Column gets updated but table does not reflect

From: Alvaro T. C. <al...@mi...> - 2012-05-01 08:52:13

Ok, I think I know what happened: during interactive manipulation, at
some point I must have forgotten adding '[:]' to assign to the column,
i.e.

mytable.cols.mycolumn = somearray

instead of

mytable.cols.mycolumn[:] = somearray

This is related to

https://github.com/PyTables/PyTables/issues/145

In-memory assignments can shadow access to the object in the file.
IMHO this should not be allowed (in, fact, why not making the first
assignment behave like the second?).

-á.


On Mon, Apr 30, 2012 at 20:24, Alvaro Tejero Cantero <al...@mi...> wrote:
> I am now on another computer (no access to the one where I reported
> the problem until tomorrow; uses 2.3.1 AFAIK) and here I have the
> expected behaviour (2.3.1, as well). I hope that I made a mistake;
> will update tomorrow.
>
>
> -á.
>
>
> On Mon, Apr 30, 2012 at 20:09, Francesc Alted <fa...@py...> wrote:
>> On 4/30/12 12:08 PM, Alvaro Tejero Cantero wrote:
>>> Hi all,
>>>
>>> I created a table:
>>>
>>>>>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times')
>>> I populated it
>>>
>>>>>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100)))
>>> now if I do
>>>
>>>>>> joins.root.spikes.cols.tetrode[:] = np.ones(100)
>>> I can see the updated value (all ones instead of zeros)
>>>
>>> If I fetch all the table in memory,
>>>
>>>>>> joins.root.spikes[:]
>>> I still see only the original zeros, i.e. the column update has not propagated
>>
>> Hmm, I cannot reproduce this:
>>
>> In [18]: joins = pt.openFile("joins.h5", "w")
>>
>> In [19]:
>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(),
>> 'unit':pt.UInt8Col()},'Spike times')
>> Out[19]:
>> /spikes (Table(0,)) 'Spike times'
>>   description := {
>>   "t20k": Int32Col(shape=(), dflt=0, pos=0),
>>   "tetrode": UInt8Col(shape=(), dflt=0, pos=1),
>>   "unit": UInt8Col(shape=(), dflt=0, pos=2)}
>>   byteorder := 'little'
>>   chunkshape := (10922,)
>>
>> In [20]: joins.root.spikes.append(zip(np.arange(10),np.zeros(10),
>> 3*np.ones(10)))
>>
>> In [22]: joins.root.spikes[:]
>> Out[22]:
>> array([(0, 0, 3), (1, 0, 3), (2, 0, 3), (3, 0, 3), (4, 0, 3), (5, 0, 3),
>>        (6, 0, 3), (7, 0, 3), (8, 0, 3), (9, 0, 3)],
>>       dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')])
>>
>> In [23]: joins.root.spikes.cols.tetrode[:] = np.ones(10)
>>
>> In [24]: joins.root.spikes[:]
>> Out[24]:
>> array([(0, 1, 3), (1, 1, 3), (2, 1, 3), (3, 1, 3), (4, 1, 3), (5, 1, 3),
>>        (6, 1, 3), (7, 1, 3), (8, 1, 3), (9, 1, 3)],
>>       dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')])
>>
>> Which versions are you using?
>>
>> --
>> Francesc Alted
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Column gets updated but table does not reflect

From: Alvaro T. C. <al...@mi...> - 2012-04-30 19:24:57

I am now on another computer (no access to the one where I reported
the problem until tomorrow; uses 2.3.1 AFAIK) and here I have the
expected behaviour (2.3.1, as well). I hope that I made a mistake;
will update tomorrow.


-á.


On Mon, Apr 30, 2012 at 20:09, Francesc Alted <fa...@py...> wrote:
> On 4/30/12 12:08 PM, Alvaro Tejero Cantero wrote:
>> Hi all,
>>
>> I created a table:
>>
>>>>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times')
>> I populated it
>>
>>>>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100)))
>> now if I do
>>
>>>>> joins.root.spikes.cols.tetrode[:] = np.ones(100)
>> I can see the updated value (all ones instead of zeros)
>>
>> If I fetch all the table in memory,
>>
>>>>> joins.root.spikes[:]
>> I still see only the original zeros, i.e. the column update has not propagated
>
> Hmm, I cannot reproduce this:
>
> In [18]: joins = pt.openFile("joins.h5", "w")
>
> In [19]:
> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(),
> 'unit':pt.UInt8Col()},'Spike times')
> Out[19]:
> /spikes (Table(0,)) 'Spike times'
>   description := {
>   "t20k": Int32Col(shape=(), dflt=0, pos=0),
>   "tetrode": UInt8Col(shape=(), dflt=0, pos=1),
>   "unit": UInt8Col(shape=(), dflt=0, pos=2)}
>   byteorder := 'little'
>   chunkshape := (10922,)
>
> In [20]: joins.root.spikes.append(zip(np.arange(10),np.zeros(10),
> 3*np.ones(10)))
>
> In [22]: joins.root.spikes[:]
> Out[22]:
> array([(0, 0, 3), (1, 0, 3), (2, 0, 3), (3, 0, 3), (4, 0, 3), (5, 0, 3),
>        (6, 0, 3), (7, 0, 3), (8, 0, 3), (9, 0, 3)],
>       dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')])
>
> In [23]: joins.root.spikes.cols.tetrode[:] = np.ones(10)
>
> In [24]: joins.root.spikes[:]
> Out[24]:
> array([(0, 1, 3), (1, 1, 3), (2, 1, 3), (3, 1, 3), (4, 1, 3), (5, 1, 3),
>        (6, 1, 3), (7, 1, 3), (8, 1, 3), (9, 1, 3)],
>       dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')])
>
> Which versions are you using?
>
> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Column gets updated but table does not reflect

From: Francesc A. <fa...@py...> - 2012-04-30 19:09:12

On 4/30/12 12:08 PM, Alvaro Tejero Cantero wrote:
> Hi all,
>
> I created a table:
>
>>>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times')
> I populated it
>
>>>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100)))
> now if I do
>
>>>> joins.root.spikes.cols.tetrode[:] = np.ones(100)
> I can see the updated value (all ones instead of zeros)
>
> If I fetch all the table in memory,
>
>>>> joins.root.spikes[:]
> I still see only the original zeros, i.e. the column update has not propagated

Hmm, I cannot reproduce this:

In [18]: joins = pt.openFile("joins.h5", "w")

In [19]: 
joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 
'unit':pt.UInt8Col()},'Spike times')
Out[19]:
/spikes (Table(0,)) 'Spike times'
   description := {
   "t20k": Int32Col(shape=(), dflt=0, pos=0),
   "tetrode": UInt8Col(shape=(), dflt=0, pos=1),
   "unit": UInt8Col(shape=(), dflt=0, pos=2)}
   byteorder := 'little'
   chunkshape := (10922,)

In [20]: joins.root.spikes.append(zip(np.arange(10),np.zeros(10), 
3*np.ones(10)))

In [22]: joins.root.spikes[:]
Out[22]:
array([(0, 0, 3), (1, 0, 3), (2, 0, 3), (3, 0, 3), (4, 0, 3), (5, 0, 3),
        (6, 0, 3), (7, 0, 3), (8, 0, 3), (9, 0, 3)],
       dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')])

In [23]: joins.root.spikes.cols.tetrode[:] = np.ones(10)

In [24]: joins.root.spikes[:]
Out[24]:
array([(0, 1, 3), (1, 1, 3), (2, 1, 3), (3, 1, 3), (4, 1, 3), (5, 1, 3),
        (6, 1, 3), (7, 1, 3), (8, 1, 3), (9, 1, 3)],
       dtype=[('t20k', '<i4'), ('tetrode', 'u1'), ('unit', 'u1')])

Which versions are you using?

-- 
Francesc Alted

[Pytables-users] Column gets updated but table does not reflect

From: Alvaro T. C. <al...@mi...> - 2012-04-30 17:09:20

Hi all,

I created a table:

>>> joins.createTable('/','spikes',{'t20k':pt.Int32Col(),'tetrode':pt.UInt8Col(), 'unit':pt.UInt8Col()},'Spike times')

I populated it

>>> joins.root.spikes.append(zip(np.arange(100),np.zeros(100), 3*np.ones(100)))

now if I do

>>> joins.root.spikes.cols.tetrode[:] = np.ones(100)

I can see the updated value (all ones instead of zeros)

If I fetch all the table in memory,

>>> joins.root.spikes[:]

I still see only the original zeros, i.e. the column update has not propagated

If I add another 100 rows like above and check
joins.root.spikes.cols.tetrode, then I only see the original ones, and
not the added zeros

In the middle I .flush() table and file abundantly; that changes nothing.

This I did following the instruction in
http://www.pytables.org/moin/HintsForSQLUsers
(search for the line:
tbl.cols.temperature[6:13:3] = cols[0]
)

Is this the intended behaviour? It's driving me crazy.

However

>>> joins.root.spikes.col('tetrode')

sees the same as joins.root.spikes[:]

but cannot be assigned to.


Álvaro.



-á.

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-04-29 22:41:06

On 4/26/12 4:04 AM, Alvaro Tejero Cantero wrote:
> Sure, but this is not the spirit of a compressor adapted to the blocking
> technique (in the sense of [1]).  For a compressor that works with
> blocks, you need to add some metainformation for each block, and that
> takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.
>
> [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf
> Absolutely!
>
> Blocking seems a good approach for most data, where the a priori many
> possible values degrade very fast the potential compression gains of a
> run-length-encoding (RLE) based scheme.
>
> But boolean arrays, that are used extremely often as masks in
> scientific applications and suffer already from a 8x penalty in
> storage would be an excellent candidate to consider RLE. Boolean
> arrays are also an interesting way to encode attributes by
> 'bit-vectors', i.e. instead of storing an enum column 'car color' with
> values in {red, green, blue}, you store three boolean arrays 'red',
> 'green', 'blue'. Where this gets interesting is in allowing more
> generality, because you don't need a taxonomy, i.e. red and green need
> not be exclusive if they are tags on a genetic sequence (or in my
> case, an electrophysiological recording). To compute ANDs and ORs you
> just have to perform the corresponding bit-wise operations if you
> reconstruct the bit-vector or you can use some smart algorithm on the
> intervals themselves (as mentioned in another mail, I think, R*Trees
> or Nested Containment Lists are two viable candidates).
>
> I don't know whether it's possible to have such an specialization for
> compression of boolean arrays in PyTables. Maybe a simple,
> alternative route is to make the chunklength dependent on the
> likelihood of repeated data (i.e. the range of the type domain), or at
> the very least, special-casing chunklength estimation for booleans to
> be somewhat higher than for other datatypes. This again, I think is an
> exception that would do justice to the main use-case of PyTables.

Yes, I think you raised a good point here.  Well, there are quite a few 
possibilities to reduce the space of highly redundant data, and the 
first should be to add a special case in blosc so that, before passing 
control to blosclz, it first checks for identical data for all the 
block, and if found, then collapse everything to a counter and a value.  
This should require a bit more CPU compression effort (so it could be 
active only at higher compression level), but will lead to far better 
compression ratios.

Another possibility is to add code to deal directly with compressed 
data, but that should be done more at PyTables (or carray, the package) 
level, with some help of the blosc compressor.  In particular, it would 
be very interesting to implement interval algebra out of these extremely 
compressed interval data.

> Yup, agreed.  Don't know what to do here.  carray was more a
> proof-of-concept than anything else, but if development for it continues
> in the future, I should ponder about changing the names.
> It's a neat package and I hope it gets the appreciation and support it deserves!

Thanks, I also think it can be useful for some situations.  But before 
being more used, more work should be put in the range of operations 
supported.  Also, defining a C API and being able to use it straight 
from C could help to spread package adoption too.

-- 
Francesc Alted

Re: [Pytables-users] Design questions

From: Anthony S. <sc...@gm...> - 2012-04-28 23:40:51

On Sat, Apr 28, 2012 at 5:54 AM, Alvaro Tejero Cantero <al...@mi...>wrote:

> Hi,
>
> There are two things about the design of the PyTables API that I don't
> understand:
>
> a) what is the reason to bind methods such as createTable & so on to
> the File object instead of putting the respective functions on the
> tables module?
>
> rationale: tables.createTable(where*, ...) could do the same job,
> where* being where prepended with file path or a group object.
>                this frees the namespace so as to have
> mytable.mydataset or mytable.mygroup free and not need to go through
> root.
>

I think that the idea is that data set creation only ever happens on the
file object
or on group objects.  Since this task is well encapsulated by such objects,
and that
you would always have to pass these file/group objects in to the functions
you
propose, that we just follow this common object oriented design pattern
here.


> b) why is it necessary to explicitly dereference links via __call__?
>

I don't use links, so I'll defer to someone who knows ;)

Be Well
Anthony


>
> rationale: if it would work like 'mount' then your applications do not
> need to know whether a given dataset is local to the file or 'mounted'
> through another file. As it is, the physical design of your database
> escalates up to the application code, i.e. the latter depends on how
> you arrange your tables in files.
>
> derived question: what is the best way to achieve this transparency on
> application code? e.g. is it a good idea to capture Exceptions such as
> occur when indexing mytable.mylink[:] and try mytable.mylink()[:]
> before really giving up?
>
> Cheers,
>
> -á.
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

[Pytables-users] Design questions

From: Alvaro T. C. <al...@mi...> - 2012-04-28 10:54:45

Hi,

There are two things about the design of the PyTables API that I don't
understand:

a) what is the reason to bind methods such as createTable & so on to
the File object instead of putting the respective functions on the
tables module?

rationale: tables.createTable(where*, ...) could do the same job,
where* being where prepended with file path or a group object.
                this frees the namespace so as to have
mytable.mydataset or mytable.mygroup free and not need to go through
root.


b) why is it necessary to explicitly dereference links via __call__?

rationale: if it would work like 'mount' then your applications do not
need to know whether a given dataset is local to the file or 'mounted'
through another file. As it is, the physical design of your database
escalates up to the application code, i.e. the latter depends on how
you arrange your tables in files.

derived question: what is the best way to achieve this transparency on
application code? e.g. is it a good idea to capture Exceptions such as
occur when indexing mytable.mylink[:] and try mytable.mylink()[:]
before really giving up?

Cheers,

-á.

Re: [Pytables-users] Main differences between PyTables and Relational

From: Alvaro T. C. <al...@mi...> - 2012-04-26 11:47:17

Hi Umit,

Thanks for commenting.

I do think that scientific/hierarchical file formats like HDF5 and
> RDBMS system have their specific use cases and I don't think it makes
> sense to replace one with the other.
>

Right. But data modeling is notoriously difficult to get right when complex
and changing schemas are involved, and here we have three paradigms
involved (relational, object, hierarchical), so I am trying to spell out
what is different. In my particular case the DB is both of recorded
immutable data and of analyses that are continuously created by expensive
functions and want to be cached and tightly integrated with the
experimental recordings.


> I do also think that you shouldn't try to apply RDBMS principles to
> HDF5 like formats and also vice versa. Working with NoSQL DBs
> (key/value store) or GraphDB is different than working with a RDBMS.
> The same applies to HDF5 and RDBMS.
>

Got it. My E-Mail is about pinpointing those differences.


> PyTables introduces some RBDMS like concepts (tables) but in the end
> it is based on HDF5 and to get the best performance you have to have
> some knowledge about the underlying HDF5 file structure and its
> concepts.
> The way you should store data in HDF5 really depends on how you will
> access it (if you want to get the best performance). Sometimes this
> means that you have to store the data redundantly in two different
> ways if you two orthogonal ways to access it.
>

This is a clever comment that I find important to include in the document.
This redundancy is unacceptable in a relational context because things may
get out of sync very fast, especially in a multi-user context (that is not
supported by hdf5), but here in PyTables where columns are normally written
by a single user in a single operation, it is less of a problem. So having
different 'views' of the data in different parts of the tree is a pattern
of usage in hdf5 that does not cause many problems, but it is very
discouraged in SQL.

On the other hand, there is no analog of stored queries in the form of
views for HDF5, which complicates independence of physical and logical data
layout.

Does all of the above make sense?

However nobody said you can't combine these different storage systems.
> For example you can use a RDBMS system to store meta-information and
> make use of relationships, constraints, foreign keys, etc but store
> the "raw data" (that is not suited for an RDBMS system) in HDF5 or
> PyTables respectively and just relate them by using unique identifier.
>

Yes, this is what many people are doing.


> Some databases like PostgreSQL even support retrieving data from non
> SQL sources (flat files, XML, etc)
> (http://en.wikipedia.org/wiki/PostgreSQL#Foreign_Data_Wrappers) which
> might be of an interest
>

Yes, we had a discussion about a similar mechanism with SQLite's 'Virtual
tables' a few days ago.



> P.S.: AFAIK Postgresql supports schemas which might be comparable to
> groups in HDF5.


Would you care to elaborate a bit more on that?

Cheers,

Álvaro.


>
> On Wed, Apr 25, 2012 at 11:41 PM, Alvaro Tejero Cantero <al...@mi...>
> wrote:
> > Hello list,
> >
> > The relational model has a strong foundation and I have spent a few hours
> > thinking about what in PyTables is structurally different from it. Here
> are
> > my thoughts. I would be delighted if you could add/comment/correct on
> these
> > ideas. This could eventually help people with a relational/SQL background
> > who want to know how to best use the idioms of PyTables for their data
> > modeling
> >
> > ---
> >
> > I make a distinction between relational and SQL (see CJ Date’s "SQL and
> > relational theory" for more on that).
> >
> > From a purely structural point of view, the following differences are
> > apparent:
> >
> > relations vs. sequences. Relations are sets (i.e. not ordered) of tuples
> > (again, not ordered).
> >
> > rows: In PyTables, every container has an implicit row number, meaning
> there
> > is always a candidate key and order matters. Although strictly an
> > implementation-level concern, row numbers in PyTables are not stored but
> > computed, thanks to the in-disk adjacency of the records. This is
> important
> > for large datasets, where adding storage of row numbers means roughly a
> > doubling of diskspace.
> > columns: In PyTables columns are ordered. That is not the case in a
> purely
> > relational system but it is the case in SQL.
> >
> > Flat tablespace vs. hierarchical tablespace. SQL tables live in a global
> > namespace. PyTables objects can be put inside Groups. Each approach can
> be
> > mapped onto the other by name mangling. Groups in PyTables are like
> tables
> > of tables -- for each node in a group there is a full table (or another
> > group...). This introduces a possible ambiguity in data modeling:
> >
> > Consider a table of car parts, one column is Part ID and the other is
> Model
> > ID, indicating in what car models a particular part is built in. In
> PyTables
> > you can construct the same table /or/ create a /models group and create
> one
> > table per model consisting of a single column of Part IDs e.g.
> /model/sedan,
> > /model/cabrio... etc. The same is possible in a relational setting
> (dividing
> > the tables according to one attribute, and naming them according to the
> > attribute value, e.g. model_sedan, model_cabrio...). The defining
> difference
> > is that the interface to manipulate that list is the same (it is a table)
> > whereas in PyTables one listing is a Table object and the other is a
> list of
> > Nodes, and the API for both is a bit different.
> >
> > Attributes of tables and integrity. Any Node (Groups and Tables included)
> > can receive a limited amount of metadata in PyTables, by using the
> attached
> > attributeset. In SQL, metadata is limited to some keywords, to be used
> upon
> > table creation, that establish constraints on the columns that have a
> > functional significance. A prime example of this is identifying foreign
> > keys. SQL allows to use this information structurally at the time of
> joins,
> > whereas in PyTables one is free to implement this or any other
> navigational
> > scheme in a customized way using the attributes of the table.
> >
> > When designing such a scheme it has to be remembered that PyTables tables
> > have always an implicit column containing the row numbers, and this is
> > likely to be used as a key.
> >
> >
> > NOTE: I intentionally excluded here implementation issues whenever they
> are
> > not related to structural ones, e.g. SQL tables do not play well with
> Numpy
> > containers and are thus ill-suited for big data with Python. Another
> example
> > would be all the features related to transactions/concurrency and
> > authorization, which are orthogonal to the data model.
> >
> >
> > -á.
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Main differences between PyTables and Relational

From: Alvaro T. C. <al...@mi...> - 2012-04-26 11:36:50

Hi Anthony,

On Wed, Apr 25, 2012 at 23:19, Anthony Scopatz <sc...@gm...> wrote:

> Hello Alvaro,
>
> Thanks for writing this up.  I think this would go nicely in our docs if
> you are willing to let us add it ;).
>

Of course! Let us polish it together - maybe you want to add it then to the
existing document at http://www.pytables.org/moin/HintsForSQLUsers?


> My one comment is that in your NOTE you say that "*
> SQL tables do not play well with Numpy containers.
> *"  I think that this would be better phrased as saying that when
> converting SQL Tables to/from NumPy record arrays or PyTables Tables you
> are adding or removing the ordering of rows and columns.  Additionally
> going from NumPy / PyTables to SQL requires that you perform a unique() or
> set() operation on the rows.
>

What you say would be correct if SQL didn't allow duplicate rows (i.e. it
is correct for the pure relational system). But I was thinking, rather (and
expressing rather informally, because it was not the objective of the
document) of what happens when you load an array through the DBAPI - it has
to be done through an iterator building first a python list  and converting
then to a recarray, which happens to be a major bottleneck even if the
underlying database is fast. Francesc did a comprehensive study here
http://mail.scipy.org/pipermail/numpy-discussion/2006-November/024732.html.
I am not sure if it is completely current but since the main problem is
with the DBAPI's 'casts' it is unlikely that the situation is better for
other RDBMSs.

Also loading data in bulk fashion from numpy containers is not completely
straightforward (you have to create the right binary representation)
http://stackoverflow.com/questions/8144002/use-binary-copy-table-from-with-psycopg2




> So it isn't that they *can't* play nicely together, but rather you have to
> understand how they do.   Thanks again.
>





> Be Well
> Anthony
>
> On Wed, Apr 25, 2012 at 4:41 PM, Alvaro Tejero Cantero <al...@mi...>wrote:
>
>> * Hello list,
>> The relational model has a strong foundation and I have spent a few hours
>> thinking about what in PyTables is structurally different from it. Here are
>> my thoughts. I would be delighted if you could add/comment/correct on these
>> ideas. This could eventually help people with a relational/SQL background
>> who want to know how to best use the idioms of PyTables for their data
>> modeling
>>
>> ---
>> I make a distinction between relational and SQL (see CJ Date’s "SQL and
>> relational theory" for more on that). From a purely structural point of
>> view, the following differences are apparent:
>>
>>    1. relations vs. sequences. Relations are sets (i.e. not ordered) of
>>    tuples (again, not ordered).
>>       1. rows: In PyTables, every container has an implicit row number,
>>       meaning there is always a candidate key and order matters.
>>       Although strictly an implementation-level concern, row numbers in PyTables
>>       are not stored but computed, thanks to the in-disk adjacency of the
>>       records. This is important for large datasets, where adding storage of row
>>       numbers means roughly a doubling of diskspace.
>>       2. columns: In PyTables columns are ordered. That is not the case
>>       in a purely relational system but it is the case in SQL.
>>    2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a
>>    global namespace. PyTables objects can be put inside Groups. Each approach
>>    can be mapped onto the other by name mangling. Groups in PyTables are like
>>    tables of tables -- for each node in a group there is a full table (or
>>    another group...). This introduces a possible ambiguity in data modeling:
>>
>> Consider a table of car parts, one column is Part ID and the other is
>> Model ID, indicating in what car models a particular part is built in. In
>> PyTables you can construct the same table /or/ create a /models group
>> and create one table per model consisting of a single column of Part IDs
>> e.g. /model/sedan, /model/cabrio... etc. The same is possible in a
>> relational setting (dividing the tables according to one attribute, and
>> naming them according to the attribute value, e.g. model_sedan,
>> model_cabrio...). The defining difference is that the interface to
>> manipulate that list is the same (it is a table) whereas in PyTables one
>> listing is a Table object and the other is a list of Nodes, and the API for
>> both is a bit different.
>>
>>    1. Attributes of tables and integrity. Any Node (Groups and Tables
>>    included) can receive a limited amount of metadata in PyTables, by using
>>    the attached attributeset. In SQL, metadata is limited to some keywords, to
>>    be used upon table creation, that establish constraints on the columns that
>>    have a functional significance. A prime example of this is identifying
>>    foreign keys. SQL allows to use this information structurally at the time
>>    of joins, whereas in PyTables one is free to implement this or any other
>>    navigational scheme in a customized way using the attributes of the table.
>>
>> When designing such a scheme it has to be remembered that PyTables tables
>> have always an implicit column containing the row numbers, and this is
>> likely to be used as a key.
>>
>> NOTE: I intentionally excluded here implementation issues whenever they
>> are not related to structural ones, e.g. SQL tables do not play well with
>> Numpy containers and are thus ill-suited for big data with Python. Another
>> example would be all the features related to transactions/concurrency and
>> authorization, which are orthogonal to the data model.
>>
>> *
>>
>> -á.
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Alvaro T. C. <al...@mi...> - 2012-04-26 10:47:53

Hi,

I tried again, also with different chunklens and couldn't reproduce it.
Unfortunately the session where I had this result was killed by a power
outage and the history buffer does not go as far back, so I can't find out
what exactly triggered it.


-á.


On Thu, Apr 26, 2012 at 04:10, Francesc Alted <fa...@py...> wrote:

> On 4/25/12 6:13 AM, Alvaro Tejero Cantero wrote:
> > Hi,
> >
> > Thanks for the clarification.
> >
> > I retried today both with a normal and a completely sorted index on a
> > a blosc-compressed table (complevel 5) and could not reproduce the
> > putative bug either.
>
> So could you please confirm if you can reproduce the problem with blosc
> level 9?
>
> Thanks!
>
> >
> > -á.
> >
> >
> > On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz<sc...@gm...>
>  wrote:
> >> On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted<fa...@py...>
>  wrote:
> >>> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote:
> >>>> Some complementary info (I copy the details of the tables below)
> >>>>
> >>>> timeit vals = numpy.fromiter((x['val'] for x in
> >>>> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
> >>>> 1 loops, best of 3: 30.4 s per loop
> >>>>
> >>>>
> >>>> Using the compressed and indexed version, it mysteriously does not
> >>>> work (output is empty list)
> >>>>>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
> >>>>>>> dtype=np.int16)
> >>>>>>> cvals
> >>>> array([], dtype=int16)
> >>> This smells like a bug, but I cannot reproduce it.  Could you send an
> >>> self-contained example reproducing this behavior?
> >>
> >> I am not able to reproduce this either...
> >>
> >>>
> >>> --
> >>> Francesc Alted
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Live Security Virtual Conference
> >>> Exclusive live event will cover all the ways today's security and
> >>> threat landscape has changed and how IT managers can respond.
> Discussions
> >>> will include endpoint security, mobile security and the latest in
> malware
> >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >>> _______________________________________________
> >>> Pytables-users mailing list
> >>> Pyt...@li...
> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Live Security Virtual Conference
> >> Exclusive live event will cover all the ways today's security and
> >> threat landscape has changed and how IT managers can respond.
> Discussions
> >> will include endpoint security, mobile security and the latest in
> malware
> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >>
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Table.where and conditions across tables

From: Alvaro T. C. <al...@mi...> - 2012-04-26 09:05:29

On Thu, Apr 26, 2012 at 04:07, Francesc Alted <fa...@py...> wrote:
> On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote:
>> Hi, a minor update on this thread
>>
>>>> * a bool array of 10**8 elements with True in two separate slices of
>>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain
>>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
>>>> array). The resulting filesize is 248kb, still far from storing the 4
>>>> or 6 integer indexes that define the slices (I am experimenting with
>>>> an approach for scientific databases where this is a concern).
>>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but
>>> apparently a 350 to 1 is not enough? :)
>> Here I expected more from a run-length-like compression scheme. My
>> array would be compressible to the following representation:
>>
>> (0, x) : 0
>> (x, x+10**6) : 1
>> (x+10**6, y) : 0
>> (y, y+10**6) : 1
>> (y+10**6, 10**8) : 0
>>
>> or just:
>> (x, x+10**6) : 1
>> (y, y+10**6) : 1
>>
>> where x and y are two reasonable integers (i.e. in range and with no overlap).
>
> Sure, but this is not the spirit of a compressor adapted to the blocking
> technique (in the sense of [1]).  For a compressor that works with
> blocks, you need to add some metainformation for each block, and that
> takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.
>
> [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

Absolutely!

Blocking seems a good approach for most data, where the a priori many
possible values degrade very fast the potential compression gains of a
run-length-encoding (RLE) based scheme.

But boolean arrays, that are used extremely often as masks in
scientific applications and suffer already from a 8x penalty in
storage would be an excellent candidate to consider RLE. Boolean
arrays are also an interesting way to encode attributes by
'bit-vectors', i.e. instead of storing an enum column 'car color' with
values in {red, green, blue}, you store three boolean arrays 'red',
'green', 'blue'. Where this gets interesting is in allowing more
generality, because you don't need a taxonomy, i.e. red and green need
not be exclusive if they are tags on a genetic sequence (or in my
case, an electrophysiological recording). To compute ANDs and ORs you
just have to perform the corresponding bit-wise operations if you
reconstruct the bit-vector or you can use some smart algorithm on the
intervals themselves (as mentioned in another mail, I think, R*Trees
or Nested Containment Lists are two viable candidates).

I don't know whether it's possible to have such an specialization for
compression of boolean arrays in PyTables. Maybe a simple,
alternative route is to make the chunklength dependent on the
likelihood of repeated data (i.e. the range of the type domain), or at
the very least, special-casing chunklength estimation for booleans to
be somewhat higher than for other datatypes. This again, I think is an
exception that would do justice to the main use-case of PyTables.

>>>> * how blosc choses the chunklen is black magic for me, but it seems to
>>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
>>>> 64*1024 when CArraying only one row).
>>> Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could
>>> you detail a bit more how you achieve this result?  Providing an example
>>> would be very useful.
>> I revisited this issue. While in PyTables CArray the guesses are
>> reasonable, the problem is in carray.carray (or in its reporting of
>> chunklen).
>>
>> This is the offender:
>> carray((64, 15600000), int16)  nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78
>>    cparams := cparams(clevel=5, shuffle=True)
>>
>> In [87]: x.chunklen
>> Out[87]: 1
>>
>> Could it be that carray is not reporting the second dimension of the
>> chunkshape? (in PyTables, this is 262144)
>
> Ah yes, this is it.  The carray package is not as sophisticated as HDF5,
> and it only blocks in the leading dimension.  In this case, it is saying
> that the block is a complete row.  So this is the intended behaviour.

Ok, it makes sense, and in my particular use case, the rows do fit in
memory, so there is no need for further chunking.

>>
>> The fact that both PyTable's CArray and carray.carray are named carray
>> is a bit confusing.
>
> Yup, agreed.  Don't know what to do here.  carray was more a
> proof-of-concept than anything else, but if development for it continues
> in the future, I should ponder about changing the names.

It's a neat package and I hope it gets the appreciation and support it deserves!

Cheers,

Álvaro.

> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Main differences between PyTables and Relational

From: Ümit S. <uem...@gm...> - 2012-04-26 08:16:43

Good points.
Just some additional comments:

I do think that scientific/hierarchical file formats like HDF5 and
RDBMS system have their specific use cases and I don't think it makes
sense to replace one with the other.
I do also think that you shouldn't try to apply RDBMS principles to
HDF5 like formats and also vice versa. Working with NoSQL DBs
(key/value store) or GraphDB is different than working with a RDBMS.
The same applies to HDF5 and RDBMS.

PyTables introduces some RBDMS like concepts (tables) but in the end
it is based on HDF5 and to get the best performance you have to have
some knowledge about the underlying HDF5 file structure and its
concepts.
The way you should store data in HDF5 really depends on how you will
access it (if you want to get the best performance). Sometimes this
means that you have to store the data redundantly in two different
ways if you two orthogonal ways to access it.

However nobody said you can't combine these different storage systems.
For example you can use a RDBMS system to store meta-information and
make use of relationships, constraints, foreign keys, etc but store
the "raw data" (that is not suited for an RDBMS system) in HDF5 or
PyTables respectively and just relate them by using unique identifier.

Some databases like PostgreSQL even support retrieving data from non
SQL sources (flat files, XML, etc)
(http://en.wikipedia.org/wiki/PostgreSQL#Foreign_Data_Wrappers) which
might be of an interest.

cheers
Ümit

P.S.: AFAIK Postgresql supports schemas which might be comparable to
groups in HDF5.


On Wed, Apr 25, 2012 at 11:41 PM, Alvaro Tejero Cantero <al...@mi...> wrote:
> Hello list,
>
> The relational model has a strong foundation and I have spent a few hours
> thinking about what in PyTables is structurally different from it. Here are
> my thoughts. I would be delighted if you could add/comment/correct on these
> ideas. This could eventually help people with a relational/SQL background
> who want to know how to best use the idioms of PyTables for their data
> modeling
>
> ---
>
> I make a distinction between relational and SQL (see CJ Date’s "SQL and
> relational theory" for more on that).
>
> From a purely structural point of view, the following differences are
> apparent:
>
> relations vs. sequences. Relations are sets (i.e. not ordered) of tuples
> (again, not ordered).
>
> rows: In PyTables, every container has an implicit row number, meaning there
> is always a candidate key and order matters. Although strictly an
> implementation-level concern, row numbers in PyTables are not stored but
> computed, thanks to the in-disk adjacency of the records. This is important
> for large datasets, where adding storage of row numbers means roughly a
> doubling of diskspace.
> columns: In PyTables columns are ordered. That is not the case in a purely
> relational system but it is the case in SQL.
>
> Flat tablespace vs. hierarchical tablespace. SQL tables live in a global
> namespace. PyTables objects can be put inside Groups. Each approach can be
> mapped onto the other by name mangling. Groups in PyTables are like tables
> of tables -- for each node in a group there is a full table (or another
> group...). This introduces a possible ambiguity in data modeling:
>
> Consider a table of car parts, one column is Part ID and the other is Model
> ID, indicating in what car models a particular part is built in. In PyTables
> you can construct the same table /or/ create a /models group and create one
> table per model consisting of a single column of Part IDs e.g. /model/sedan,
> /model/cabrio... etc. The same is possible in a relational setting (dividing
> the tables according to one attribute, and naming them according to the
> attribute value, e.g. model_sedan, model_cabrio...). The defining difference
> is that the interface to manipulate that list is the same (it is a table)
> whereas in PyTables one listing is a Table object and the other is a list of
> Nodes, and the API for both is a bit different.
>
> Attributes of tables and integrity. Any Node (Groups and Tables included)
> can receive a limited amount of metadata in PyTables, by using the attached
> attributeset. In SQL, metadata is limited to some keywords, to be used upon
> table creation, that establish constraints on the columns that have a
> functional significance. A prime example of this is identifying foreign
> keys. SQL allows to use this information structurally at the time of joins,
> whereas in PyTables one is free to implement this or any other navigational
> scheme in a customized way using the attributes of the table.
>
> When designing such a scheme it has to be remembered that PyTables tables
> have always an implicit column containing the row numbers, and this is
> likely to be used as a key.
>
>
> NOTE: I intentionally excluded here implementation issues whenever they are
> not related to structural ones, e.g. SQL tables do not play well with Numpy
> containers and are thus ill-suited for big data with Python. Another example
> would be all the features related to transactions/concurrency and
> authorization, which are orthogonal to the data model.
>
>
> -á.
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Performance of tables vs. arrays (out vs in core?)

From: Francesc A. <fa...@py...> - 2012-04-26 03:10:22

On 4/25/12 6:13 AM, Alvaro Tejero Cantero wrote:
> Hi,
>
> Thanks for the clarification.
>
> I retried today both with a normal and a completely sorted index on a
> a blosc-compressed table (complevel 5) and could not reproduce the
> putative bug either.

So could you please confirm if you can reproduce the problem with blosc 
level 9?

Thanks!

>
> -á.
>
>
> On Tue, Apr 24, 2012 at 04:39, Anthony Scopatz<sc...@gm...>  wrote:
>> On Mon, Apr 23, 2012 at 9:14 PM, Francesc Alted<fa...@py...>  wrote:
>>> On 4/19/12 8:43 AM, Alvaro Tejero Cantero wrote:
>>>> Some complementary info (I copy the details of the tables below)
>>>>
>>>> timeit vals = numpy.fromiter((x['val'] for x in
>>>> my.root.raw.t0.wtab02.where('val>1')),dtype=np.int16)
>>>> 1 loops, best of 3: 30.4 s per loop
>>>>
>>>>
>>>> Using the compressed and indexed version, it mysteriously does not
>>>> work (output is empty list)
>>>>>>> cvals = np.fromiter((x['val'] for x in wctab02.where('val>1')),
>>>>>>> dtype=np.int16)
>>>>>>> cvals
>>>> array([], dtype=int16)
>>> This smells like a bug, but I cannot reproduce it.  Could you send an
>>> self-contained example reproducing this behavior?
>>
>> I am not able to reproduce this either...
>>
>>>
>>> --
>>> Francesc Alted
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pyt...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users


-- 
Francesc Alted

Re: [Pytables-users] Table.where and conditions across tables

From: Francesc A. <fa...@py...> - 2012-04-26 03:07:54

On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote:
> Hi, a minor update on this thread
>
>>> * a bool array of 10**8 elements with True in two separate slices of
>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain
>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
>>> array). The resulting filesize is 248kb, still far from storing the 4
>>> or 6 integer indexes that define the slices (I am experimenting with
>>> an approach for scientific databases where this is a concern).
>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but
>> apparently a 350 to 1 is not enough? :)
> Here I expected more from a run-length-like compression scheme. My
> array would be compressible to the following representation:
>
> (0, x) : 0
> (x, x+10**6) : 1
> (x+10**6, y) : 0
> (y, y+10**6) : 1
> (y+10**6, 10**8) : 0
>
> or just:
> (x, x+10**6) : 1
> (y, y+10**6) : 1
>
> where x and y are two reasonable integers (i.e. in range and with no overlap).

Sure, but this is not the spirit of a compressor adapted to the blocking 
technique (in the sense of [1]).  For a compressor that works with 
blocks, you need to add some metainformation for each block, and that 
takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.

[1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

>
>>> * how blosc choses the chunklen is black magic for me, but it seems to
>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
>>> 64*1024 when CArraying only one row).
>> Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could
>> you detail a bit more how you achieve this result?  Providing an example
>> would be very useful.
> I revisited this issue. While in PyTables CArray the guesses are
> reasonable, the problem is in carray.carray (or in its reporting of
> chunklen).
>
> This is the offender:
> carray((64, 15600000), int16)  nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78
>    cparams := cparams(clevel=5, shuffle=True)
>
> In [87]: x.chunklen
> Out[87]: 1
>
> Could it be that carray is not reporting the second dimension of the
> chunkshape? (in PyTables, this is 262144)

Ah yes, this is it.  The carray package is not as sophisticated as HDF5, 
and it only blocks in the leading dimension.  In this case, it is saying 
that the block is a complete row.  So this is the intended behaviour.

>
> The fact that both PyTable's CArray and carray.carray are named carray
> is a bit confusing.

Yup, agreed.  Don't know what to do here.  carray was more a 
proof-of-concept than anything else, but if development for it continues 
in the future, I should ponder about changing the names.

-- 
Francesc Alted

Re: [Pytables-users] Main differences between PyTables and Relational

From: Anthony S. <sc...@gm...> - 2012-04-25 22:19:53

Hello Alvaro,

Thanks for writing this up.  I think this would go nicely in our docs if
you are willing to let us add it ;).

My one comment is that in your NOTE you say that "*
SQL tables do not play well with Numpy containers.
*"  I think that this would be better phrased as saying that when
converting SQL Tables to/from NumPy record arrays or PyTables Tables you
are adding or removing the ordering of rows and columns.  Additionally
going from NumPy / PyTables to SQL requires that you perform a unique() or
set() operation on the rows.

So it isn't that they *can't* play nicely together, but rather you have to
understand how they do.   Thanks again.

Be Well
Anthony

On Wed, Apr 25, 2012 at 4:41 PM, Alvaro Tejero Cantero <al...@mi...>wrote:

> * Hello list,
> The relational model has a strong foundation and I have spent a few hours
> thinking about what in PyTables is structurally different from it. Here are
> my thoughts. I would be delighted if you could add/comment/correct on these
> ideas. This could eventually help people with a relational/SQL background
> who want to know how to best use the idioms of PyTables for their data
> modeling
>
> ---
> I make a distinction between relational and SQL (see CJ Date’s "SQL and
> relational theory" for more on that). From a purely structural point of
> view, the following differences are apparent:
>
>    1. relations vs. sequences. Relations are sets (i.e. not ordered) of
>    tuples (again, not ordered).
>       1. rows: In PyTables, every container has an implicit row number,
>       meaning there is always a candidate key and order matters. Although
>       strictly an implementation-level concern, row numbers in PyTables are not
>       stored but computed, thanks to the in-disk adjacency of the records. This
>       is important for large datasets, where adding storage of row numbers means
>       roughly a doubling of diskspace.
>       2. columns: In PyTables columns are ordered. That is not the case
>       in a purely relational system but it is the case in SQL.
>    2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a
>    global namespace. PyTables objects can be put inside Groups. Each approach
>    can be mapped onto the other by name mangling. Groups in PyTables are like
>    tables of tables -- for each node in a group there is a full table (or
>    another group...). This introduces a possible ambiguity in data modeling:
>
> Consider a table of car parts, one column is Part ID and the other is
> Model ID, indicating in what car models a particular part is built in. In
> PyTables you can construct the same table /or/ create a /models group and
> create one table per model consisting of a single column of Part IDs e.g.
> /model/sedan, /model/cabrio... etc. The same is possible in a relational
> setting (dividing the tables according to one attribute, and naming them
> according to the attribute value, e.g. model_sedan, model_cabrio...). The
> defining difference is that the interface to manipulate that list is the
> same (it is a table) whereas in PyTables one listing is a Table object and
> the other is a list of Nodes, and the API for both is a bit different.
>
>    1. Attributes of tables and integrity. Any Node (Groups and Tables
>    included) can receive a limited amount of metadata in PyTables, by using
>    the attached attributeset. In SQL, metadata is limited to some keywords, to
>    be used upon table creation, that establish constraints on the columns that
>    have a functional significance. A prime example of this is identifying
>    foreign keys. SQL allows to use this information structurally at the time
>    of joins, whereas in PyTables one is free to implement this or any other
>    navigational scheme in a customized way using the attributes of the table.
>
> When designing such a scheme it has to be remembered that PyTables tables
> have always an implicit column containing the row numbers, and this is
> likely to be used as a key.
>
> NOTE: I intentionally excluded here implementation issues whenever they
> are not related to structural ones, e.g. SQL tables do not play well with
> Numpy containers and are thus ill-suited for big data with Python. Another
> example would be all the features related to transactions/concurrency and
> authorization, which are orthogonal to the data model.
>
> *
>
> -á.
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

[Pytables-users] Main differences between PyTables and Relational

From: Alvaro T. C. <al...@mi...> - 2012-04-25 21:41:40

* Hello list,
The relational model has a strong foundation and I have spent a few hours
thinking about what in PyTables is structurally different from it. Here are
my thoughts. I would be delighted if you could add/comment/correct on these
ideas. This could eventually help people with a relational/SQL background
who want to know how to best use the idioms of PyTables for their data
modeling

---
I make a distinction between relational and SQL (see CJ Date’s "SQL and
relational theory" for more on that). From a purely structural point of
view, the following differences are apparent:

   1. relations vs. sequences. Relations are sets (i.e. not ordered) of
   tuples (again, not ordered).
      1. rows: In PyTables, every container has an implicit row number,
      meaning there is always a candidate key and order matters. Although
      strictly an implementation-level concern, row numbers in PyTables are not
      stored but computed, thanks to the in-disk adjacency of the records. This
      is important for large datasets, where adding storage of row
numbers means
      roughly a doubling of diskspace.
      2. columns: In PyTables columns are ordered. That is not the case in
      a purely relational system but it is the case in SQL.
   2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a
   global namespace. PyTables objects can be put inside Groups. Each approach
   can be mapped onto the other by name mangling. Groups in PyTables are like
   tables of tables -- for each node in a group there is a full table (or
   another group...). This introduces a possible ambiguity in data modeling:

Consider a table of car parts, one column is Part ID and the other is Model
ID, indicating in what car models a particular part is built in. In
PyTables you can construct the same table /or/ create a /models group and
create one table per model consisting of a single column of Part IDs e.g.
/model/sedan, /model/cabrio... etc. The same is possible in a relational
setting (dividing the tables according to one attribute, and naming them
according to the attribute value, e.g. model_sedan, model_cabrio...). The
defining difference is that the interface to manipulate that list is the
same (it is a table) whereas in PyTables one listing is a Table object and
the other is a list of Nodes, and the API for both is a bit different.

   1. Attributes of tables and integrity. Any Node (Groups and Tables
   included) can receive a limited amount of metadata in PyTables, by using
   the attached attributeset. In SQL, metadata is limited to some keywords, to
   be used upon table creation, that establish constraints on the columns that
   have a functional significance. A prime example of this is identifying
   foreign keys. SQL allows to use this information structurally at the time
   of joins, whereas in PyTables one is free to implement this or any other
   navigational scheme in a customized way using the attributes of the table.

When designing such a scheme it has to be remembered that PyTables tables
have always an implicit column containing the row numbers, and this is
likely to be used as a key.

NOTE: I intentionally excluded here implementation issues whenever they are
not related to structural ones, e.g. SQL tables do not play well with Numpy
containers and are thus ill-suited for big data with Python. Another
example would be all the features related to transactions/concurrency and
authorization, which are orthogonal to the data model.

*

-á.

22 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 26 27 28 29 30 .. 165 > >> (Page 28 of 165)