Thread: [Pytables-users] Optimizing pytables for reading entire columns at a time

Brought to you by: a_valentino, falted, ivilata, joshmoore

pytables-users

[Pytables-users] Optimizing pytables for reading entire columns at a time

From: Luke L. <dur...@gm...> - 2012-09-19 13:38:00

Hi all,

I'm attempting to optimize my HDF5/pytables application for reading entire
columns at a time.  I was wondering what the best way to go about this is.

My HDF5 has the following properties:

- 400,000+ rows
- 25 columns
- 147 MB in total size
- 1 string column of size 12
- 1 column of type 'Float'
- 23 columns of type 'Float64'

My access pattern for this data is generally to read an entire column out
at a time.  So, I want to minimize the number of disk accesses this takes
and store data contiguously by column.

I think the proper way to do this via HDF5 is to use 'chunking.'  I'm
creating my HDF5 files via Pytables so I guess using the 'chunkshape'
parameter during creation is the correct way to do this?

All of the HDF5 documentation I read discusses 'chunksize' in terms of rows
and columns.  However, the Pytables 'chunkshape' parameter only takes a
single number.  I looked through the source and see that I can in fact pass
a tuple, which I assume is (row, column) as the HDF5 documentation would
suggest.

Is it best to use the 'expectedrows' parameter instead of the 'chunkshape'
or use both?

I have done some debugging/profiling and discovered that my default
chunkshape is 321 for this dataset.  I have increased this to 1000 and see
quite a bit better speeds.  I'm sure I could keep changing these numbers
and find what is best for this particular dataset.  However, I'm seeking a
bit more knowledge on how Pytables uses each of these parameters, how they
relate to the HDF5 'chunking' concept and best-practices.  This will help
me to understand how to optimize in the future instead of just for this
particular dataset.  Is there any documentation on best practices for using
the 'expectedrows' and 'chunkshape' parameters?

Thank you for your time,

Luke Lee

[Pytables-users] Optimizing pytables for reading entire columns at a time

From: Luke L. <dur...@gm...> - 2012-09-24 13:28:02

Thanks for the information guys.  I have joined the dev group on Google
groups.  I'm sure I can learn a lot just by watching the discussions.

Also, I think for my current situation I'm going to stick with Pytables
carrays.  We already have Pytables as a dependency, and we are using it for
some other stuff in the project as well.  I will definitely keep the
stand-alone carray project in mind for the future though.

I guess by using Pytables.carrays I'm just losing the ability to query,
etc.?  Are there any other downsides in a Pytables.carray vs.
Pytables.table comparison?

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Ümit S. <uem...@gm...> - 2012-09-24 13:35:21

With CArrays you can only have one specific type for the array (int,
float, etc) whereas with a table each column can have a different type
(string, float, etc). If you want to replicate this with carray, you
would have to have multiple carray's for each type.
I think for storing numerical data where querying isn't that
important, carrays are just fine.
But even if you have to query, you can replicate the indexing behavior
for example by adding a second carray with the values you want to
index.



On Mon, Sep 24, 2012 at 3:27 PM, Luke Lee <dur...@gm...> wrote:
> Thanks for the information guys.  I have joined the dev group on Google
> groups.  I'm sure I can learn a lot just by watching the discussions.
>
> Also, I think for my current situation I'm going to stick with Pytables
> carrays.  We already have Pytables as a dependency, and we are using it for
> some other stuff in the project as well.  I will definitely keep the
> stand-alone carray project in mind for the future though.
>
> I guess by using Pytables.carrays I'm just losing the ability to query,
> etc.?  Are there any other downsides in a Pytables.carray vs. Pytables.table
> comparison?
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

[Pytables-users] Optimizing pytables for reading entire columns at a time

From: Luke L. <dur...@gm...> - 2012-09-27 16:02:18

Are there any performance issues with relatively large carrays?  For
example, say I have a carray with 300,000 float64s in it.  Is there some
threshold where I could expect performance to degrade or anything?

I think I remember seeing there was a performance limit with tables > 255
columns.  I can't find a reference to that so it's possible I made it up.
 However, I was wondering if carrays had some limitation like that.

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Anthony S. <sc...@gm...> - 2012-09-27 18:11:23

On Thu, Sep 27, 2012 at 11:02 AM, Luke Lee <dur...@gm...> wrote:

> Are there any performance issues with relatively large carrays?  For
> example, say I have a carray with 300,000 float64s in it.  Is there some
> threshold where I could expect performance to degrade or anything?
>

Hello Luke,

The breakdowns happen when you have too many chunks.  However you are well
away from this threshold (which is ~20,000).  I believe that the PyTables
will issue a warning or error when you reach this point anyways.

> I think I remember seeing there was a performance limit with tables > 255
> columns.  I can't find a reference to that so it's possible I made it up.
>  However, I was wondering if carrays had some limitation like that.
>

Tables are a different data set.  The issue with tables is that column
metadata (names, etc.) needs to fit in the attribute space.  The size of
this space is statically limited to 64 kb.  In my experience, this number
is in the thousands of columns (not hundreds). On the other hand CArrays
don't have much of any column metadata.  CArrays should scale to an
infinite number of columns without any issue.

Be Well
Anthony

>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://ad.doubleclick.net/clk;258768047;13503038;j?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Francesc A. <fa...@py...> - 2012-09-28 07:46:23

On 9/27/12 8:10 PM, Anthony Scopatz wrote:
>
> I think I remember seeing there was a performance limit with tables > 
> 255 columns.  I can't find a reference to that so it's possible I made 
> it up.  However, I was wondering if carrays had some limitation like 
> that.
>
> Tables are a different data set.  The issue with tables is that column 
> metadata (names, etc.) needs to fit in the attribute space.  The size 
> of this space is statically limited to 64 kb.  In my experience, this 
> number is in the thousands of columns (not hundreds).

For the record, the PerformanceWarning issued by PyTables has nothing to 
do with the attribute space, but rather to the fact that putting too 
many columns in the same table means that you have to retrieve much more 
data even if you are retrieving only one single column.  Also, internal 
I/O buffers have to be much more larger, and compressors tend to work 
much less efficiently too.

> On the other hand CArrays don't have much of any column metadata. 
>  CArrays should scale to an infinite number of columns without any issue.

Yeah, they should scale better, although saying they can reach infinite 
scalability is a bit audacious :)  All the CArrays are datasets that 
have to be saved internally by HDF5, and that requires quite a few of 
resources to have track of them.

-- 
Francesc Alted

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Anthony S. <sc...@gm...> - 2012-09-28 14:19:03

On Fri, Sep 28, 2012 at 2:46 AM, Francesc Alted <fa...@py...> wrote:

> On 9/27/12 8:10 PM, Anthony Scopatz wrote:
> >
> > I think I remember seeing there was a performance limit with tables >
> > 255 columns.  I can't find a reference to that so it's possible I made
> > it up.  However, I was wondering if carrays had some limitation like
> > that.
> >
> > Tables are a different data set.  The issue with tables is that column
> > metadata (names, etc.) needs to fit in the attribute space.  The size
> > of this space is statically limited to 64 kb.  In my experience, this
> > number is in the thousands of columns (not hundreds).
>
> For the record, the PerformanceWarning issued by PyTables has nothing to
> do with the attribute space, but rather to the fact that putting too
> many columns in the same table means that you have to retrieve much more
> data even if you are retrieving only one single column.  Also, internal
> I/O buffers have to be much more larger, and compressors tend to work
> much less efficiently too.
>
> > On the other hand CArrays don't have much of any column metadata.
> >  CArrays should scale to an infinite number of columns without any issue.
>
> Yeah, they should scale better, although saying they can reach infinite
> scalability is a bit audacious :)  All the CArrays are datasets that
> have to be saved internally by HDF5, and that requires quite a few of
> resources to have track of them.
>

True, but I would ague that this is effectively infinite if you set your
chunksize
appropriately large.  I have never the run into an issue with HDF5 where
the
number of rows or columns on its own becomes too large for arrays.
 However,
it is relatively easy to reach this limit with tables (both in PyTables and
the HL
interface).  So maybe I should have said "effectively infinite" ;)


>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Francesc A. <fa...@py...> - 2012-09-19 17:55:52

On 9/19/12 3:37 PM, Luke Lee wrote:
> Hi all,
>
> I'm attempting to optimize my HDF5/pytables application for reading 
> entire columns at a time.  I was wondering what the best way to go 
> about this is.
>
> My HDF5 has the following properties:
>
> - 400,000+ rows
> - 25 columns
> - 147 MB in total size
> - 1 string column of size 12
> - 1 column of type 'Float'
> - 23 columns of type 'Float64'
>
> My access pattern for this data is generally to read an entire column 
> out at a time.  So, I want to minimize the number of disk accesses 
> this takes and store data contiguously by column.

To start with, you must be aware that the Table object stores data in 
row-order, not column order.  In practice, that means that whenever you 
want to access a single column, you will need to traverse the *entire* 
table.

I always wished to implement a column-order table in PyTables, but that 
did not happen in the end.

>
> I think the proper way to do this via HDF5 is to use 'chunking.'  I'm 
> creating my HDF5 files via Pytables so I guess using the 'chunkshape' 
> parameter during creation is the correct way to do this?

Yes, it is.

>
> All of the HDF5 documentation I read discusses 'chunksize' in terms of 
> rows and columns.  However, the Pytables 'chunkshape' parameter only 
> takes a single number.  I looked through the source and see that I can 
> in fact pass a tuple, which I assume is (row, column) as the HDF5 
> documentation would suggest.

Not quite.  The Table object is actually an uni-dimensional beast, but 
with a 'compound' datatype (that in some way can be regarded as another 
dimension, but it is not a 'true' dimension).

>
> Is it best to use the 'expectedrows' parameter instead of the 
> 'chunkshape' or use both?

You can try both.  The `expectedrows` parameter was introduced to ease 
the life of users, and it 'optimizes' the `chunkshape` but for 'normal' 
usage.  For specific requirements, playing directly with the 
`chunkshape` normally gives better results.

>
> I have done some debugging/profiling and discovered that my default 
> chunkshape is 321 for this dataset.  I have increased this to 1000 and 
> see quite a bit better speeds.  I'm sure I could keep changing these 
> numbers and find what is best for this particular dataset.  However, 
> I'm seeking a bit more knowledge on how Pytables uses each of these 
> parameters, how they relate to the HDF5 'chunking' concept and 
> best-practices.  This will help me to understand how to optimize in 
> the future instead of just for this particular dataset.  Is there any 
> documentation on best practices for using the 'expectedrows' and 
> 'chunkshape' parameters?

Well, there is:

http://pytables.github.com/usersguide/optimization.html

but I'm sure you already know this.

Frankly, if you want to enhance the speed of column retrieval, you are 
going to need an object that is stored in column-order.  In this sense, 
you may want to experiment with the ctable object in carray package 
(https://github.com/FrancescAlted/carray).  It supports barely the same 
capabilities than the Table object, but the column-order is implemented 
properly, so probably a ctable will buy you a nice speed-up.

>
> Thank you for your time,

Hope this helps,

-- 
Francesc Alted

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Josh A. <jos...@gm...> - 2012-09-20 19:26:15

Depending on your use case, you may be able to get around this by storing
each column in its own table.  That will effectively store the data in
column-first order.  Instead of creating a table, you would create a group,
which then contains a separate table for each column.

If you want, you can wrap all the functionality you need in a single object
that hides the complexity and makes it act just like a single table.  I did
something similar to this recently and it's worked well.  However, I wasn't
too concerned with exactly matching the Table API or implementing all of
its features.

Creating a more general version that does duplicate the Table class
interface and can be included in PyTables is definitely possible and is
something I'd like to do, but I've never had the necessary time to dedicate
to it.

Hope that helps,
Josh


On Wed, Sep 19, 2012 at 10:56 AM, Francesc Alted <fa...@py...>wrote:

> On 9/19/12 3:37 PM, Luke Lee wrote:
> > Hi all,
> >
> > I'm attempting to optimize my HDF5/pytables application for reading
> > entire columns at a time.  I was wondering what the best way to go
> > about this is.
> >
> > My HDF5 has the following properties:
> >
> > - 400,000+ rows
> > - 25 columns
> > - 147 MB in total size
> > - 1 string column of size 12
> > - 1 column of type 'Float'
> > - 23 columns of type 'Float64'
> >
> > My access pattern for this data is generally to read an entire column
> > out at a time.  So, I want to minimize the number of disk accesses
> > this takes and store data contiguously by column.
>
> To start with, you must be aware that the Table object stores data in
> row-order, not column order.  In practice, that means that whenever you
> want to access a single column, you will need to traverse the *entire*
> table.
>
> I always wished to implement a column-order table in PyTables, but that
> did not happen in the end.
>
> >
> > I think the proper way to do this via HDF5 is to use 'chunking.'  I'm
> > creating my HDF5 files via Pytables so I guess using the 'chunkshape'
> > parameter during creation is the correct way to do this?
>
> Yes, it is.
>
> >
> > All of the HDF5 documentation I read discusses 'chunksize' in terms of
> > rows and columns.  However, the Pytables 'chunkshape' parameter only
> > takes a single number.  I looked through the source and see that I can
> > in fact pass a tuple, which I assume is (row, column) as the HDF5
> > documentation would suggest.
>
> Not quite.  The Table object is actually an uni-dimensional beast, but
> with a 'compound' datatype (that in some way can be regarded as another
> dimension, but it is not a 'true' dimension).
>
> >
> > Is it best to use the 'expectedrows' parameter instead of the
> > 'chunkshape' or use both?
>
> You can try both.  The `expectedrows` parameter was introduced to ease
> the life of users, and it 'optimizes' the `chunkshape` but for 'normal'
> usage.  For specific requirements, playing directly with the
> `chunkshape` normally gives better results.
>
> >
> > I have done some debugging/profiling and discovered that my default
> > chunkshape is 321 for this dataset.  I have increased this to 1000 and
> > see quite a bit better speeds.  I'm sure I could keep changing these
> > numbers and find what is best for this particular dataset.  However,
> > I'm seeking a bit more knowledge on how Pytables uses each of these
> > parameters, how they relate to the HDF5 'chunking' concept and
> > best-practices.  This will help me to understand how to optimize in
> > the future instead of just for this particular dataset.  Is there any
> > documentation on best practices for using the 'expectedrows' and
> > 'chunkshape' parameters?
>
> Well, there is:
>
> http://pytables.github.com/usersguide/optimization.html
>
> but I'm sure you already know this.
>
> Frankly, if you want to enhance the speed of column retrieval, you are
> going to need an object that is stored in column-order.  In this sense,
> you may want to experiment with the ctable object in carray package
> (https://github.com/FrancescAlted/carray).  It supports barely the same
> capabilities than the Table object, but the column-order is implemented
> properly, so probably a ctable will buy you a nice speed-up.
>
> >
> > Thank you for your time,
>
> Hope this helps,
>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Anthony S. <sc...@gm...> - 2012-09-21 00:15:21

Luke,

I'd also like to mention, that if you don't want to wait for us to
implement this we will gladly take contributions ;).  If you need help
getting started or throughout the process we are also happy to provide that
too.  Please sign up for PyTables Dev (pyt...@go...) so we
move implementation discussions away from users.  Clearly, people would
benefit from you taking this upon yourself, should you choose to accept
this mission!

Be Well
Anthony

On Thu, Sep 20, 2012 at 3:26 PM, Josh Ayers <jos...@gm...> wrote:

> Depending on your use case, you may be able to get around this by storing
> each column in its own table.  That will effectively store the data in
> column-first order.  Instead of creating a table, you would create a group,
> which then contains a separate table for each column.
>
> If you want, you can wrap all the functionality you need in a single
> object that hides the complexity and makes it act just like a single
> table.  I did something similar to this recently and it's worked well.
> However, I wasn't too concerned with exactly matching the Table API or
> implementing all of its features.
>
> Creating a more general version that does duplicate the Table class
> interface and can be included in PyTables is definitely possible and is
> something I'd like to do, but I've never had the necessary time to dedicate
> to it.
>
> Hope that helps,
> Josh
>
>
>
> On Wed, Sep 19, 2012 at 10:56 AM, Francesc Alted <fa...@py...>wrote:
>
>> On 9/19/12 3:37 PM, Luke Lee wrote:
>> > Hi all,
>> >
>> > I'm attempting to optimize my HDF5/pytables application for reading
>> > entire columns at a time.  I was wondering what the best way to go
>> > about this is.
>> >
>> > My HDF5 has the following properties:
>> >
>> > - 400,000+ rows
>> > - 25 columns
>> > - 147 MB in total size
>> > - 1 string column of size 12
>> > - 1 column of type 'Float'
>> > - 23 columns of type 'Float64'
>> >
>> > My access pattern for this data is generally to read an entire column
>> > out at a time.  So, I want to minimize the number of disk accesses
>> > this takes and store data contiguously by column.
>>
>> To start with, you must be aware that the Table object stores data in
>> row-order, not column order.  In practice, that means that whenever you
>> want to access a single column, you will need to traverse the *entire*
>> table.
>>
>> I always wished to implement a column-order table in PyTables, but that
>> did not happen in the end.
>>
>> >
>> > I think the proper way to do this via HDF5 is to use 'chunking.'  I'm
>> > creating my HDF5 files via Pytables so I guess using the 'chunkshape'
>> > parameter during creation is the correct way to do this?
>>
>> Yes, it is.
>>
>> >
>> > All of the HDF5 documentation I read discusses 'chunksize' in terms of
>> > rows and columns.  However, the Pytables 'chunkshape' parameter only
>> > takes a single number.  I looked through the source and see that I can
>> > in fact pass a tuple, which I assume is (row, column) as the HDF5
>> > documentation would suggest.
>>
>> Not quite.  The Table object is actually an uni-dimensional beast, but
>> with a 'compound' datatype (that in some way can be regarded as another
>> dimension, but it is not a 'true' dimension).
>>
>> >
>> > Is it best to use the 'expectedrows' parameter instead of the
>> > 'chunkshape' or use both?
>>
>> You can try both.  The `expectedrows` parameter was introduced to ease
>> the life of users, and it 'optimizes' the `chunkshape` but for 'normal'
>> usage.  For specific requirements, playing directly with the
>> `chunkshape` normally gives better results.
>>
>> >
>> > I have done some debugging/profiling and discovered that my default
>> > chunkshape is 321 for this dataset.  I have increased this to 1000 and
>> > see quite a bit better speeds.  I'm sure I could keep changing these
>> > numbers and find what is best for this particular dataset.  However,
>> > I'm seeking a bit more knowledge on how Pytables uses each of these
>> > parameters, how they relate to the HDF5 'chunking' concept and
>> > best-practices.  This will help me to understand how to optimize in
>> > the future instead of just for this particular dataset.  Is there any
>> > documentation on best practices for using the 'expectedrows' and
>> > 'chunkshape' parameters?
>>
>> Well, there is:
>>
>> http://pytables.github.com/usersguide/optimization.html
>>
>> but I'm sure you already know this.
>>
>> Frankly, if you want to enhance the speed of column retrieval, you are
>> going to need an object that is stored in column-order.  In this sense,
>> you may want to experiment with the ctable object in carray package
>> (https://github.com/FrancescAlted/carray).  It supports barely the same
>> capabilities than the Table object, but the column-order is implemented
>> properly, so probably a ctable will buy you a nice speed-up.
>>
>> >
>> > Thank you for your time,
>>
>> Hope this helps,
>>
>> --
>> Francesc Alted
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://ad.doubleclick.net/clk;258768047;13503038;j?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Alvaro T. C. <al...@mi...> - 2012-09-21 09:50:34

Hi!

You may want to have a look | reuse | combine your approach with that
implemented in pandas (pandas.io.pytables.HDFStore)

https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py

(see _write_array method)

A certain liberality in Pandas with dtypes (partly induced by the
missing data problem) leads to VLArrays being created often that might
be not the most performant solution. But if the types of the columns
in the data frames are guessed right, then CArrays embedded in groups
will be used, as far as I understand (as suggested above).

Best,

 -á.



On 21 September 2012 01:14, Anthony Scopatz <sc...@gm...> wrote:
> Luke,
>
> I'd also like to mention, that if you don't want to wait for us to implement
> this we will gladly take contributions ;).  If you need help getting started
> or throughout the process we are also happy to provide that too.  Please
> sign up for PyTables Dev (pyt...@go...) so we move
> implementation discussions away from users.  Clearly, people would benefit
> from you taking this upon yourself, should you choose to accept this
> mission!
>
> Be Well
> Anthony
>
> On Thu, Sep 20, 2012 at 3:26 PM, Josh Ayers <jos...@gm...> wrote:
>>
>> Depending on your use case, you may be able to get around this by storing
>> each column in its own table.  That will effectively store the data in
>> column-first order.  Instead of creating a table, you would create a group,
>> which then contains a separate table for each column.
>>
>> If you want, you can wrap all the functionality you need in a single
>> object that hides the complexity and makes it act just like a single table.
>> I did something similar to this recently and it's worked well.  However, I
>> wasn't too concerned with exactly matching the Table API or implementing all
>> of its features.
>>
>> Creating a more general version that does duplicate the Table class
>> interface and can be included in PyTables is definitely possible and is
>> something I'd like to do, but I've never had the necessary time to dedicate
>> to it.
>>
>> Hope that helps,
>> Josh
>>
>>
>>
>> On Wed, Sep 19, 2012 at 10:56 AM, Francesc Alted <fa...@py...>
>> wrote:
>>>
>>> On 9/19/12 3:37 PM, Luke Lee wrote:
>>> > Hi all,
>>> >
>>> > I'm attempting to optimize my HDF5/pytables application for reading
>>> > entire columns at a time.  I was wondering what the best way to go
>>> > about this is.
>>> >
>>> > My HDF5 has the following properties:
>>> >
>>> > - 400,000+ rows
>>> > - 25 columns
>>> > - 147 MB in total size
>>> > - 1 string column of size 12
>>> > - 1 column of type 'Float'
>>> > - 23 columns of type 'Float64'
>>> >
>>> > My access pattern for this data is generally to read an entire column
>>> > out at a time.  So, I want to minimize the number of disk accesses
>>> > this takes and store data contiguously by column.
>>>
>>> To start with, you must be aware that the Table object stores data in
>>> row-order, not column order.  In practice, that means that whenever you
>>> want to access a single column, you will need to traverse the *entire*
>>> table.
>>>
>>> I always wished to implement a column-order table in PyTables, but that
>>> did not happen in the end.
>>>
>>> >
>>> > I think the proper way to do this via HDF5 is to use 'chunking.'  I'm
>>> > creating my HDF5 files via Pytables so I guess using the 'chunkshape'
>>> > parameter during creation is the correct way to do this?
>>>
>>> Yes, it is.
>>>
>>> >
>>> > All of the HDF5 documentation I read discusses 'chunksize' in terms of
>>> > rows and columns.  However, the Pytables 'chunkshape' parameter only
>>> > takes a single number.  I looked through the source and see that I can
>>> > in fact pass a tuple, which I assume is (row, column) as the HDF5
>>> > documentation would suggest.
>>>
>>> Not quite.  The Table object is actually an uni-dimensional beast, but
>>> with a 'compound' datatype (that in some way can be regarded as another
>>> dimension, but it is not a 'true' dimension).
>>>
>>> >
>>> > Is it best to use the 'expectedrows' parameter instead of the
>>> > 'chunkshape' or use both?
>>>
>>> You can try both.  The `expectedrows` parameter was introduced to ease
>>> the life of users, and it 'optimizes' the `chunkshape` but for 'normal'
>>> usage.  For specific requirements, playing directly with the
>>> `chunkshape` normally gives better results.
>>>
>>> >
>>> > I have done some debugging/profiling and discovered that my default
>>> > chunkshape is 321 for this dataset.  I have increased this to 1000 and
>>> > see quite a bit better speeds.  I'm sure I could keep changing these
>>> > numbers and find what is best for this particular dataset.  However,
>>> > I'm seeking a bit more knowledge on how Pytables uses each of these
>>> > parameters, how they relate to the HDF5 'chunking' concept and
>>> > best-practices.  This will help me to understand how to optimize in
>>> > the future instead of just for this particular dataset.  Is there any
>>> > documentation on best practices for using the 'expectedrows' and
>>> > 'chunkshape' parameters?
>>>
>>> Well, there is:
>>>
>>> http://pytables.github.com/usersguide/optimization.html
>>>
>>> but I'm sure you already know this.
>>>
>>> Frankly, if you want to enhance the speed of column retrieval, you are
>>> going to need an object that is stored in column-order.  In this sense,
>>> you may want to experiment with the ctable object in carray package
>>> (https://github.com/FrancescAlted/carray).  It supports barely the same
>>> capabilities than the Table object, but the column-order is implemented
>>> properly, so probably a ctable will buy you a nice speed-up.
>>>
>>> >
>>> > Thank you for your time,
>>>
>>> Hope this helps,
>>>
>>> --
>>> Francesc Alted
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pyt...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://ad.doubleclick.net/clk;258768047;13503038;j?
>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>>
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
> ------------------------------------------------------------------------------
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Luke L. <dur...@gm...> - 2012-09-21 14:50:10

Hi again,

I haven't been getting the updates via email so I'm attempting to post
again to respond.

Thanks everyone for the suggestions.  I have a few questions:

1.  What is the benefit of using the stand-alone carray project (
https://github.com/FrancescAlted/carray) vs Pytables.carray?
2.  I realized my code base never uses the query functionality of a Table.
 So, I changed all my columns to be just Pytables.carray objects instead.
 They are all sitting at the top of the hierarchy, just below root.  Is
this a good idea?

I see a big speed increase from this obviously because now everything is
stored contiguously.  However, are there any downsides to doing this?  I
suppose I could also use EArray, but we are never actually changing the
data once it is stored in HDF5.

3.  Is compression automatically happening with the Carray?  I know the
documentation says that compression is supported, but what do I need to do
to enable it?  Maybe it's already happening and this is contributing to my
big speed improvement.

4.  I would certainly love to take a look at contributing something like
this in my free time.  I don't have a whole lot at this time so the changes
could take a while.  I'm sure I need to learn a lot more about the codebase
before really giving it a try.  I'm going to take a look at this though,
thanks for the suggestion!

5.  How do I subscribe to the dev mailing list?  I only see announcements
and users.
6.  Any idea why I'm not getting the emails from the list?  I signed up 2
days ago and didn't get any of your replies via email.

Thanks!

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Anthony S. <sc...@gm...> - 2012-09-21 20:08:14

On Fri, Sep 21, 2012 at 10:49 AM, Luke Lee <dur...@gm...> wrote:

> Hi again,
>
> I haven't been getting the updates via email so I'm attempting to post
> again to respond.
>
> Thanks everyone for the suggestions.  I have a few questions:
>
> 1.  What is the benefit of using the stand-alone carray project (
> https://github.com/FrancescAlted/carray) vs Pytables.carray?
>

Hello Luke,

carrays are in-memory, not on disk.


> 2.  I realized my code base never uses the query functionality of a Table.
>  So, I changed all my columns to be just Pytables.carray objects instead.
>  They are all sitting at the top of the hierarchy, just below root.  Is
> this a good idea?
>
> I see a big speed increase from this obviously because now everything is
> stored contiguously.  However, are there any downsides to doing this?  I
> suppose I could also use EArray, but we are never actually changing the
> data once it is stored in HDF5.
>

If it works for you, then great!


> 3.  Is compression automatically happening with the Carray?  I know the
> documentation says that compression is supported, but what do I need to do
> to enable it?  Maybe it's already happening and this is contributing to my
> big speed improvement.
>

For compression to be enabled, you need to define the appropriate filter
[1] on either the node or the file.

4.  I would certainly love to take a look at contributing something like
> this in my free time.  I don't have a whole lot at this time so the changes
> could take a while.  I'm sure I need to learn a lot more about the codebase
> before really giving it a try.  I'm going to take a look at this though,
> thanks for the suggestion!
>

No problem ;)


> 5.  How do I subscribe to the dev mailing list?  I only see announcements
> and users.
>

Here is the dev list site:
https://groups.google.com/forum/?fromgroups#!forum/pytables-dev


> 6.  Any idea why I'm not getting the emails from the list?  I signed up 2
> days ago and didn't get any of your replies via email.
>

We have been having problems with this list.  I think It might be time to
transition...

Be Well
Anthony

1.
http://pytables.github.com/usersguide/libref/helper_classes.html?highlight=filter#tables.Filters


>
>
> ------------------------------------------------------------------------------
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Francesc A. <fa...@gm...> - 2012-09-21 20:55:20

On 9/21/12 10:07 PM, Anthony Scopatz wrote:
> On Fri, Sep 21, 2012 at 10:49 AM, Luke Lee <dur...@gm... 
> <mailto:dur...@gm...>> wrote:
>
>     Hi again,
>
>     I haven't been getting the updates via email so I'm attempting to
>     post again to respond.
>
>     Thanks everyone for the suggestions.  I have a few questions:
>
>     1.  What is the benefit of using the stand-alone carray project
>     (https://github.com/FrancescAlted/carray) vs Pytables.carray?
>
>
> Hello Luke,
>
> carrays are in-memory, not on disk.

Well, that was true until version 0.5 where disk persistency was 
introduced.  Now, carray supports both in-memory and on-disk objects, 
and they work exactly in the same way.

-- 
Francesc Alted

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

From: Anthony S. <sc...@gm...> - 2012-09-21 21:01:05

On Fri, Sep 21, 2012 at 4:55 PM, Francesc Alted <fa...@gm...> wrote:

> On 9/21/12 10:07 PM, Anthony Scopatz wrote:
> > On Fri, Sep 21, 2012 at 10:49 AM, Luke Lee <dur...@gm...
> > <mailto:dur...@gm...>> wrote:
> >
> >     Hi again,
> >
> >     I haven't been getting the updates via email so I'm attempting to
> >     post again to respond.
> >
> >     Thanks everyone for the suggestions.  I have a few questions:
> >
> >     1.  What is the benefit of using the stand-alone carray project
> >     (https://github.com/FrancescAlted/carray) vs Pytables.carray?
> >
> >
> > Hello Luke,
> >
> > carrays are in-memory, not on disk.
>
> Well, that was true until version 0.5 where disk persistency was
> introduced.  Now, carray supports both in-memory and on-disk objects,
> and they work exactly in the same way.
>

Sorry for not being exactly up to date ;)


>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>