Thread: [Pytables-users] What is the result of calling craeteIndex() on multiple columns?

Brought to you by: a_valentino, falted, ivilata, joshmoore

pytables-users

[Pytables-users] What is the result of calling craeteIndex() on multiple columns?

From: Aquil H. A. <aqu...@gm...> - 2012-06-26 21:19:43

Hello All,  

In my newbist state, I called createIndex on two columns in one of my tables:

import tables
table_desc = {'timestamp':tables.Time32Col(), 'symbol':tables.StringCol(8), 'observation':tables.Float32Col()}
h5f = tables.openFile('test.h5',mode='w')
group = h5f.createGroup('/','data')
table = h5f.createTable(group, 'test',table_desc,'Test Table')
table.cols.timestamp.createIndex()
table.cols.symbol.createIndex()
…

Now from what I've been able to find on the internet an index is only associated with one column:

class tables.Index
    Represents the index of a column in a table.

    This class is used to keep the indexing information for columns in a Table dataset (see The Table class). It is actually the descendant of the  
    Group class (see The Group class), with some added functionality. An Index is always associated with one and only one column in a table.

- PyTables 2.3.1 User's Guide - Library Reference/The Index Class http://pytables.github.com/usersguide/libref.html#indexclassdescr
- Efficient way to verify that records are unique in Python/PyTables http://stackoverflow.com/questions/1315129/efficient-way-to-verify-that-records-are-unique-in-python-pytables
- Hints For SQL Users (Creating an index) http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex

So how does PyTables interpret a table with multiple column indices?  The best solution that I've found is creating a hash from the two fields that I am interested in indexing and then indexing that table on that hash.

The other solution would be to shard my data by symbol and then index each symbol table by timestamp.

Can anyone explain what effect two index columns has on Pytables?
Also, can anyone tell me if they've come up with a better solution for dealing with tables that require multiple indices than any that I've mentioned?

Regards,

--  
Aquil H. Abdullah

Re: [Pytables-users] What is the result of calling craeteIndex() on multiple columns?

From: Anthony S. <sc...@gm...> - 2012-06-26 21:30:43

On Tue, Jun 26, 2012 at 4:19 PM, Aquil H. Abdullah <aqu...@gm...
> wrote:

>  Hello All,
>
> In my newbist state, I called createIndex on two columns in one of my
> tables:
>
> import tables
> table_desc = {'timestamp':tables.Time32Col(),
> 'symbol':tables.StringCol(8), 'observation':tables.Float32Col()}
> h5f = tables.openFile('test.h5',mode='w')
> group = h5f.createGroup('/','data')
> table = h5f.createTable(group, 'test',table_desc,'Test Table')
> table.cols.timestamp.createIndex()
> table.cols.symbol.createIndex()
> …
>
> Now from what I've been able to find on the internet an index is only
> associated with one column:
>
> class tables.Index
>     Represents the index of a column in a table.
>
>     This class is used to keep the indexing information for columns in a
> Table dataset (see The Table class). It is actually the descendant of the
>     Group class (see The Group class), with some added functionality. An
> Index is always associated with one and only one column in a table.
>
> - PyTables 2.3.1 User's Guide - Library Reference/The Index Class
> http://pytables.github.com/usersguide/libref.html#indexclassdescr
> - Efficient way to verify that records are unique in Python/PyTables
> http://stackoverflow.com/questions/1315129/efficient-way-to-verify-that-records-are-unique-in-python-pytables
> - Hints For SQL Users (Creating an index)
> http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex
>
> So how does PyTables interpret a table with multiple column indices?  The
> best solution that I've found is creating a hash from the two fields that I
> am interested in indexing and then indexing that table on that hash.
>
> The other solution would be to shard my data by symbol and then index each
> symbol table by timestamp.
>
> Can anyone explain what effect two index columns has on Pytables?
> Also, can anyone tell me if they've come up with a better solution for
> dealing with tables that require multiple indices than any that I've
> mentioned?
>

I don't have a lot of time right now, but maybe create a nested column or a
column with a compound data type that is just a tuple of the two data types
you are interested in.  Then index against the super column.  Storing a
hash in another column is probably not the greatest way to do this...

Hopefully someone else can jump in and answer this one.


>
> Regards,
>
> --
> Aquil H. Abdullah
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] What is the result of calling craeteIndex() on multiple columns?

From: Francesc A. <fa...@py...> - 2012-06-27 08:43:59

On 6/26/12 11:19 PM, Aquil H. Abdullah wrote:
> Hello All,
>
> In my newbist state, I called createIndex on two columns in one of my 
> tables:
>
> import tables
> table_desc = {'timestamp':tables.Time32Col(), 
> 'symbol':tables.StringCol(8), 'observation':tables.Float32Col()}
> h5f = tables.openFile('test.h5',mode='w')
> group = h5f.createGroup('/','data')
> table = h5f.createTable(group, 'test',table_desc,'Test Table')
> table.cols.timestamp.createIndex()
> table.cols.symbol.createIndex()
> …
>
> Now from what I've been able to find on the internet an index is only 
> associated with one column:
>
> class tables.Index
> Represents the index of a column in a table.
>
> This class is used to keep the indexing information for columns in a 
> Table dataset (see The Table class). It is actually the descendant of the
> Group class (see The Group class), with some added functionality. An 
> Index is always associated with one and only one column in a table.
>
> - PyTables 2.3.1 User's Guide - Library Reference/The Index Class 
> http://pytables.github.com/usersguide/libref.html#indexclassdescr
> - Efficient way to verify that records are unique in Python/PyTables 
> http://stackoverflow.com/questions/1315129/efficient-way-to-verify-that-records-are-unique-in-python-pytables
> - Hints For SQL Users (Creating an index) 
> http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex
>
> So how does PyTables interpret a table with multiple column indices?

If a table has multiple indices, PyTables will use its internal query 
optimizer to try to use these in your queries. It is not always possible 
for PyTables to use all indexes though. Please see:

http://pytables.github.com/usersguide/optimization.html#indexed-searches

for a series of examples where different indexes can be used.

> The best solution that I've found is creating a hash from the two 
> fields that I am interested in indexing and then indexing that table 
> on that hash.

In case several indexes cannot be use in your case, that could be an 
alternate solution for what you are trying to do, yes.

>
> The other solution would be to shard my data by symbol and then index 
> each symbol table by timestamp.

The range of possibilities is really large, yes, but I'd try to avoid 
sharding because it is normally harder to setup and manage, but you are 
indeed free to try whatever approaches you feel they are best for you.

HTH,

-- 
Francesc Alted

Re: [Pytables-users] What is the result of calling craeteIndex() on multiple columns?

From: Aquil H. A. <aqu...@gm...> - 2012-06-27 13:55:58

Hello Francesc,  

Thank you for your response! I guess I need to read the User's Guide cover to cover.  

--  
Aquil H. Abdullah


On Wednesday, June 27, 2012 at 4:44 AM, Francesc Alted wrote:

> On 6/26/12 11:19 PM, Aquil H. Abdullah wrote:
> > Hello All,
> >  
> > In my newbist state, I called createIndex on two columns in one of my  
> > tables:
> >  
> > import tables
> > table_desc = {'timestamp':tables.Time32Col(),  
> > 'symbol':tables.StringCol(8), 'observation':tables.Float32Col()}
> > h5f = tables.openFile('test.h5',mode='w')
> > group = h5f.createGroup('/','data')
> > table = h5f.createTable(group, 'test',table_desc,'Test Table')
> > table.cols.timestamp.createIndex()
> > table.cols.symbol.createIndex()
> > …
> >  
> > Now from what I've been able to find on the internet an index is only  
> > associated with one column:
> >  
> > class tables.Index
> > Represents the index of a column in a table.
> >  
> > This class is used to keep the indexing information for columns in a  
> > Table dataset (see The Table class). It is actually the descendant of the
> > Group class (see The Group class), with some added functionality. An  
> > Index is always associated with one and only one column in a table.
> >  
> > - PyTables 2.3.1 User's Guide - Library Reference/The Index Class  
> > http://pytables.github.com/usersguide/libref.html#indexclassdescr
> > - Efficient way to verify that records are unique in Python/PyTables  
> > http://stackoverflow.com/questions/1315129/efficient-way-to-verify-that-records-are-unique-in-python-pytables
> > - Hints For SQL Users (Creating an index)  
> > http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex
> >  
> > So how does PyTables interpret a table with multiple column indices?
>  
> If a table has multiple indices, PyTables will use its internal query  
> optimizer to try to use these in your queries. It is not always possible  
> for PyTables to use all indexes though. Please see:
>  
> http://pytables.github.com/usersguide/optimization.html#indexed-searches
>  
> for a series of examples where different indexes can be used.
>  
> > The best solution that I've found is creating a hash from the two  
> > fields that I am interested in indexing and then indexing that table  
> > on that hash.
> >  
>  
>  
> In case several indexes cannot be use in your case, that could be an  
> alternate solution for what you are trying to do, yes.
>  
> >  
> > The other solution would be to shard my data by symbol and then index  
> > each symbol table by timestamp.
> >  
>  
>  
> The range of possibilities is really large, yes, but I'd try to avoid  
> sharding because it is normally harder to setup and manage, but you are  
> indeed free to try whatever approaches you feel they are best for you.
>  
> HTH,
>  
> --  
> Francesc Alted
>  
>  
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and  
> threat landscape has changed and how IT managers can respond. Discussions  
> will include endpoint security, mobile security and the latest in malware  
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li... (mailto:Pyt...@li...)
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>  
>