pytables-users Mailing List for PyTables - Hierarchical datasets (Page 9)

Brought to you by: a_valentino, falted, ivilata, joshmoore

pytables-users — PyTables users discussion list

You can subscribe to this list here.

2002	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (5)	Dec
2003	Jan	Feb (2)	Mar	Apr (5)	May (11)	Jun (7)	Jul (18)	Aug (5)	Sep (15)	Oct (4)	Nov (1)	Dec (4)
2004	Jan (5)	Feb (2)	Mar (5)	Apr (8)	May (8)	Jun (10)	Jul (4)	Aug (4)	Sep (20)	Oct (11)	Nov (31)	Dec (41)
2005	Jan (79)	Feb (22)	Mar (14)	Apr (17)	May (35)	Jun (24)	Jul (26)	Aug (9)	Sep (57)	Oct (64)	Nov (25)	Dec (37)
2006	Jan (76)	Feb (24)	Mar (79)	Apr (44)	May (33)	Jun (12)	Jul (15)	Aug (40)	Sep (17)	Oct (21)	Nov (46)	Dec (23)
2007	Jan (18)	Feb (25)	Mar (41)	Apr (66)	May (18)	Jun (29)	Jul (40)	Aug (32)	Sep (34)	Oct (17)	Nov (46)	Dec (17)
2008	Jan (17)	Feb (42)	Mar (23)	Apr (11)	May (65)	Jun (28)	Jul (28)	Aug (16)	Sep (24)	Oct (33)	Nov (16)	Dec (5)
2009	Jan (19)	Feb (25)	Mar (11)	Apr (32)	May (62)	Jun (28)	Jul (61)	Aug (20)	Sep (61)	Oct (11)	Nov (14)	Dec (53)
2010	Jan (17)	Feb (31)	Mar (39)	Apr (43)	May (49)	Jun (47)	Jul (35)	Aug (58)	Sep (55)	Oct (91)	Nov (77)	Dec (63)
2011	Jan (50)	Feb (30)	Mar (67)	Apr (31)	May (17)	Jun (83)	Jul (17)	Aug (33)	Sep (35)	Oct (19)	Nov (29)	Dec (26)
2012	Jan (53)	Feb (22)	Mar (118)	Apr (45)	May (28)	Jun (71)	Jul (87)	Aug (55)	Sep (30)	Oct (73)	Nov (41)	Dec (28)
2013	Jan (19)	Feb (30)	Mar (14)	Apr (63)	May (20)	Jun (59)	Jul (40)	Aug (33)	Sep (1)	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 7 8 9 10 11 .. 165 > >> (Page 9 of 165)

[Pytables-users] ANN: numexpr 2.1 RC1 available!

From: Francesc A. <fa...@gm...> - 2013-04-14 20:19:42

============================
  Announcing Numexpr 2.1RC1
============================

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
VML library, which allows for squeezing the last drop of performance
out of your multi-core processors.

What's new
==========

This version adds compatibility for Python 3.  A bunch of thanks to 
Antonio Valentino for his excelent work on this.I apologize for taking 
so long in releasing his contributions.

In case you want to know more in detail what has changed in this
version, see:

http://code.google.com/p/numexpr/wiki/ReleaseNotes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=========================

The project is hosted at Google code in:

http://code.google.com/p/numexpr/

This is a release candidate 1, so it will not be available on the PyPi 
repository.  I'll post it there when the final version will released.

Share your experience
=====================

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy!

--
Francesc Alted

Re: [Pytables-users] Row.append() performance

From: Anthony S. <sc...@gm...> - 2013-04-11 23:02:48

On Thu, Apr 11, 2013 at 5:16 PM, Shyam Parimal Katti <sp...@ny...> wrote:

> Hello Anthony,
>
> Thank you for replying back with suggestions.
>
> In response to your suggestions, I am *not reading the data from a file
> in the first step, but instead a database*.
>

Hello Shyam,

To put too fine a point on it, hdf5 databases are files.  And reading from
any kind of file incurs the same disk read overhead.


>  I did try out your 1st suggestion of doing a table.append(list of
> tuples), which took a little more than the executed time I got with the
> original code. Can you please guide me in how to chunk the data (that I got
> from database and stored as a list of tuples in Python) ?
>

Ahh, so you should not be using list of tuples.  These are Pythonic types
and conversion between HDF5 types and Python types is what is slowing you
down.  You should be passing a numpy structured array into append().  Numpy
types are very similar (and often exactly the same as) HDF5 types.  For
large, continuous, structured data you want to avoid the Python interpreter
as much as possible.  Use Python here as the glue code to compose a series
of fast operations using the APIs exposed by numpy, pytables, etc.

Be Well
Anthony


>
>
> Thanks,
> Shyam
>
>
> Hi Shyam,
>
> The pattern that you are using to write to a table is basically one for
> writing Python data to HDF5.  However, your data is already in a machine /
> HDF5 native format.  Thus what you are doing here is an excessive amount of
> work:  read data from file -> convert to Python data structures -> covert
> back to HDF5 data structures -> write to file.
>
> When reading from a table you get back a numpy structured array (look them
> up on the numpy website).
>
> Then instead of using rows to write back the data, just use Table.append()
> [1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
> data in this case is too large to fit into memory, so you may have to spit
> it up into chunks or use the new iterators which are in the development
> branch.)
>
> Additionally, if all you are doing is copying a table wholesale, you should
> use the Table.copy(). [2]  Or if you only want to copy some subset based on
> a conditional you provide, use whereAppend() [3].
>
> Finally, if you want to do math or evaluate expressions on one table to
> create a new table, use the Expr class [4].
>
> All of these will be waaaaay faster than what you are doing right now.
>
> Be Well
> Anthony
>
> 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
> 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
> 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
> 4. http://pytables.github.io/usersguide/libref/expr_class.html
>
>
>
>
> On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <sp...@ny...>wrote:
>
>> Hello,
>>
>> I am writing a lot of data(close to 122GB ) to a hdf5 file using
>> PyTables. The execution time for writing the query result to the file is
>> close to 10 hours, which includes querying the database and then writing to
>> the file. When I timed the entire execution, I found that it takes as much
>> time to get the data from the database as it takes to write to the hdf5
>> file. Here is the small snippet(P.S: the execution time noted below is not
>> for 122GB data, but a small subset close to 10GB):
>>
>> class ContactClass(table.IsDescription):
>>     name= tb.StringCol(4200)
>>     address= tb.StringCol(4200)
>>     emailAddr= tb.StringCol(180)
>>     phone= tb.StringCol(256)
>>
>> h5File= table.openFile(<file name>, mode="a", title= "Contacts")
>> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
>> filters=table.Filters(5, 'blosc'), expectedrows=77806938)
>>
>> resultSet= get data from database
>> currRow= t.row
>> print("Before appending data: %s" % str(datetime.now()))
>> for (attributes ..) in resultSet:
>>      currRow['name']= attribute[0]
>>      currRow['address']= attribute[1]
>>      currRow['emailAddr']= attribute[2]
>>      currRow['phone']= attribute[3]
>>      currRow.append()
>> print("After done appending: %s" % str(datetime.now()))
>> t.flush()
>> print("After done flushing: %s" % str(datetime.now()))
>>
>> .. gives me:
>> *Before appending data  2013-04-11 10:42:39.903713  *
>> *After done appending: 2013-04-11 11:04:10.002712*
>> *After done flushing: 2013-04-11 11:05:50.059893*
>> *
>> *
>> it seems like append() takes a lot of time. Any suggestions on how to
>> improve this?
>>
>> Thanks,
>> Shyam
>>
>>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Row.append() performance

From: Shyam P. K. <sp...@ny...> - 2013-04-11 22:16:45

Hello Anthony,

Thank you for replying back with suggestions.

In response to your suggestions, I am *not reading the data from a file in
the first step, but instead a database*. I did try out your 1st suggestion
of doing a table.append(list of tuples), which took a little more than the
executed time I got with the original code. Can you please guide me in how
to chunk the data (that I got from database and stored as a list of tuples
in Python) ?

Thanks,
Shyam

Hi Shyam,

The pattern that you are using to write to a table is basically one for
writing Python data to HDF5.  However, your data is already in a machine /
HDF5 native format.  Thus what you are doing here is an excessive amount of
work:  read data from file -> convert to Python data structures -> covert
back to HDF5 data structures -> write to file.

When reading from a table you get back a numpy structured array (look them
up on the numpy website).

Then instead of using rows to write back the data, just use Table.append()
[1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
data in this case is too large to fit into memory, so you may have to spit
it up into chunks or use the new iterators which are in the development
branch.)

Additionally, if all you are doing is copying a table wholesale, you should
use the Table.copy(). [2]  Or if you only want to copy some subset based on
a conditional you provide, use whereAppend() [3].

Finally, if you want to do math or evaluate expressions on one table to
create a new table, use the Expr class [4].

All of these will be waaaaay faster than what you are doing right now.

Be Well
Anthony

1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
4. http://pytables.github.io/usersguide/libref/expr_class.html

On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <sp...@ny...>wrote:

> Hello,
>
> I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables.
> The execution time for writing the query result to the file is close to 10
> hours, which includes querying the database and then writing to the file.
> When I timed the entire execution, I found that it takes as much time to
> get the data from the database as it takes to write to the hdf5 file. Here
> is the small snippet(P.S: the execution time noted below is not for 122GB
> data, but a small subset close to 10GB):
>
> class ContactClass(table.IsDescription):
>     name= tb.StringCol(4200)
>     address= tb.StringCol(4200)
>     emailAddr= tb.StringCol(180)
>     phone= tb.StringCol(256)
>
> h5File= table.openFile(<file name>, mode="a", title= "Contacts")
> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
> filters=table.Filters(5, 'blosc'), expectedrows=77806938)
>
> resultSet= get data from database
> currRow= t.row
> print("Before appending data: %s" % str(datetime.now()))
> for (attributes ..) in resultSet:
>      currRow['name']= attribute[0]
>      currRow['address']= attribute[1]
>      currRow['emailAddr']= attribute[2]
>      currRow['phone']= attribute[3]
>      currRow.append()
> print("After done appending: %s" % str(datetime.now()))
> t.flush()
> print("After done flushing: %s" % str(datetime.now()))
>
> .. gives me:
> *Before appending data  2013-04-11 10:42:39.903713  *
> *After done appending: 2013-04-11 11:04:10.002712*
> *After done flushing: 2013-04-11 11:05:50.059893*
> *
> *
> it seems like append() takes a lot of time. Any suggestions on how to
> improve this?
>
> Thanks,
> Shyam
>
>

Re: [Pytables-users] Row.append() performance

From: Anthony S. <sc...@gm...> - 2013-04-11 18:14:41

Hi Shyam,

The pattern that you are using to write to a table is basically one for
writing Python data to HDF5.  However, your data is already in a machine /
HDF5 native format.  Thus what you are doing here is an excessive amount of
work:  read data from file -> convert to Python data structures -> covert
back to HDF5 data structures -> write to file.

When reading from a table you get back a numpy structured array (look them
up on the numpy website).

Then instead of using rows to write back the data, just use Table.append()
[1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
data in this case is too large to fit into memory, so you may have to spit
it up into chunks or use the new iterators which are in the development
branch.)

Additionally, if all you are doing is copying a table wholesale, you should
use the Table.copy(). [2]  Or if you only want to copy some subset based on
a conditional you provide, use whereAppend() [3].

Finally, if you want to do math or evaluate expressions on one table to
create a new table, use the Expr class [4].

All of these will be waaaaay faster than what you are doing right now.

Be Well
Anthony

1.
http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
2.
http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
3.
http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
4. http://pytables.github.io/usersguide/libref/expr_class.html


On Thu, Apr 11, 2013 at 11:23 AM, Shyam Parimal Katti <sp...@ny...>wrote:

> Hello,
>
> I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables.
> The execution time for writing the query result to the file is close to 10
> hours, which includes querying the database and then writing to the file.
> When I timed the entire execution, I found that it takes as much time to
> get the data from the database as it takes to write to the hdf5 file. Here
> is the small snippet(P.S: the execution time noted below is not for 122GB
> data, but a small subset close to 10GB):
>
> class ContactClass(table.IsDescription):
>     name= tb.StringCol(4200)
>     address= tb.StringCol(4200)
>     emailAddr= tb.StringCol(180)
>     phone= tb.StringCol(256)
>
> h5File= table.openFile(<file name>, mode="a", title= "Contacts")
> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
> filters=table.Filters(5, 'blosc'), expectedrows=77806938)
>
> resultSet= get data from database
> currRow= t.row
> print("Before appending data: %s" % str(datetime.now()))
> for (attributes ..) in resultSet:
>      currRow['name']= attribute[0]
>      currRow['address']= attribute[1]
>      currRow['emailAddr']= attribute[2]
>      currRow['phone']= attribute[3]
>      currRow.append()
> print("After done appending: %s" % str(datetime.now()))
> t.flush()
> print("After done flushing: %s" % str(datetime.now()))
>
> .. gives me:
> *Before appending data  2013-04-11 10:42:39.903713  *
> *After done appending: 2013-04-11 11:04:10.002712*
> *After done flushing: 2013-04-11 11:05:50.059893*
> *
> *
> it seems like append() takes a lot of time. Any suggestions on how to
> improve this?
>
> Thanks,
> Shyam
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

[Pytables-users] Row.append() performance

From: Shyam P. K. <sp...@ny...> - 2013-04-11 17:18:05

Hello,

I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables.
The execution time for writing the query result to the file is close to 10
hours, which includes querying the database and then writing to the file.
When I timed the entire execution, I found that it takes as much time to
get the data from the database as it takes to write to the hdf5 file. Here
is the small snippet(P.S: the execution time noted below is not for 122GB
data, but a small subset close to 10GB):

class ContactClass(table.IsDescription):
    name= tb.StringCol(4200)
    address= tb.StringCol(4200)
    emailAddr= tb.StringCol(180)
    phone= tb.StringCol(256)

h5File= table.openFile(<file name>, mode="a", title= "Contacts")
t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
filters=table.Filters(5, 'blosc'), expectedrows=77806938)

resultSet= get data from database
currRow= t.row
print("Before appending data: %s" % str(datetime.now()))
for (attributes ..) in resultSet:
     currRow['name']= attribute[0]
     currRow['address']= attribute[1]
     currRow['emailAddr']= attribute[2]
     currRow['phone']= attribute[3]
     currRow.append()
print("After done appending: %s" % str(datetime.now()))
t.flush()
print("After done flushing: %s" % str(datetime.now()))

.. gives me:
*Before appending data  2013-04-11 10:42:39.903713  *
*After done appending: 2013-04-11 11:04:10.002712*
*After done flushing: 2013-04-11 11:05:50.059893*
*
*
it seems like append() takes a lot of time. Any suggestions on how to
improve this?

Thanks,
Shyam

Re: [Pytables-users] Some method like a "table.readWhereSorted"

From: Anthony S. <sc...@gm...> - 2013-04-11 16:15:12

Thanks for bringing this up, Julio.

Hmm I don't think that this exists currently, but since there are
readWhere() and readSorted() it shouldn't be too hard to implement.  I have
opened issue #225 to this effect.  Pull requests welcome!

https://github.com/PyTables/PyTables/issues/225

Be Well
Anthony


On Wed, Apr 10, 2013 at 1:02 PM, Dr. Louis Wicker <lou...@no...>wrote:

> I am also interested in the this capability, if it exists in some way...
>
> Lou
>
> On Apr 10, 2013, at 12:35 PM, Julio Trevisan <jul...@gm...>
> wrote:
>
> > Hi,
> >
> > Is there a way that I could have the ability of readWhere (i.e., specify
> condition, and fast result) but also using a CSIndex so that the rows come
> sorted in a particular order?
> >
> > I checked readSorted() but it is iterative and does not allow to specify
> a condition.
> >
> > Julio
> >
> ------------------------------------------------------------------------------
> > Precog is a next-generation analytics platform capable of advanced
> > analytics on semi-structured data. The platform includes APIs for
> building
> > apps and a phenomenal toolset for data science. Developers can use
> > our toolset for easy data analysis & visualization. Get a free account!
> >
> http://www2.precog.com/precogplatform/slashdotnewsletter_______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> ----------------------------------------------------------------------------
> | Dr. Louis J. Wicker
> | NSSL/WRDD  Rm 4366
> | National Weather Center
> | 120 David L. Boren Boulevard, Norman, OK 73072
> |
> | E-mail:   Lou...@no...
> | HTTP:    http://www.nssl.noaa.gov/~lwicker
> | Phone:    (405) 325-6340
> | Fax:        (405) 325-6780
> |
> |
> I  For every complex problem, there is a solution that is simple,
> |      neat, and wrong.
> |
> |   -- H. L. Mencken
> |
>
> ----------------------------------------------------------------------------
> | "The contents  of this message are mine personally and
> | do not reflect any position of  the Government or NOAA."
>
> ----------------------------------------------------------------------------
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Some method like a "table.readWhereSorted"

From: Dr. L. W. <lou...@no...> - 2013-04-10 18:03:02

I am also interested in the this capability, if it exists in some way...

Lou

On Apr 10, 2013, at 12:35 PM, Julio Trevisan <jul...@gm...> wrote:

> Hi,
> 
> Is there a way that I could have the ability of readWhere (i.e., specify condition, and fast result) but also using a CSIndex so that the rows come sorted in a particular order?
> 
> I checked readSorted() but it is iterative and does not allow to specify a condition.
> 
> Julio
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter_______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

----------------------------------------------------------------------------
| Dr. Louis J. Wicker
| NSSL/WRDD  Rm 4366
| National Weather Center
| 120 David L. Boren Boulevard, Norman, OK 73072
|
| E-mail:   Lou...@no...
| HTTP:    http://www.nssl.noaa.gov/~lwicker
| Phone:    (405) 325-6340
| Fax:        (405) 325-6780
|
| 
I  For every complex problem, there is a solution that is simple, 
|      neat, and wrong.
|
|   -- H. L. Mencken
|
----------------------------------------------------------------------------
| "The contents  of this message are mine personally and
| do not reflect any position of  the Government or NOAA."
----------------------------------------------------------------------------

[Pytables-users] Some method like a "table.readWhereSorted"

From: Julio T. <jul...@gm...> - 2013-04-10 17:35:59

Hi,

Is there a way that I could have the ability of readWhere (i.e., specify
condition, and fast result) but also using a CSIndex so that the rows come
sorted in a particular order?

I checked readSorted() but it is iterative and does not allow to specify a
condition.

Julio

Re: [Pytables-users] ReadWhere() with a Time64Col in the condition

From: Julio T. <jul...@gm...> - 2013-04-10 17:12:01

Thanks again :)


On Wed, Apr 10, 2013 at 1:53 PM, Anthony Scopatz <sc...@gm...> wrote:

> On Wed, Apr 10, 2013 at 11:40 AM, Julio Trevisan <jul...@gm...>wrote:
>
>> Hi Anthony
>>
>> Thanks again.* *If it is a problem related to floating-point precision,
>> I might use an Int64Col instead, since I don't need the timestamp
>> miliseconds.
>>
>
> Another good plan since integers are exact ;)
>
>
>>
>>
>> Julio
>>
>>
>>
>>
>> On Wed, Apr 10, 2013 at 1:17 PM, Anthony Scopatz <sc...@gm...>wrote:
>>
>>> On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm...
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using a Time64Col called "timestamp" in a condition, and I noticed
>>>> that the condition does not work (i.e., no rows are selected) if I write
>>>> something as:
>>>>
>>>> for row in node.where("timestamp == %f" % t):
>>>>     ...
>>>>
>>>> However, I had this idea of dividing the values by, say 1000, and it
>>>> does work:
>>>>
>>>> for row in node.where("timestamp/1000 == %f" % t/1000):
>>>>     ...
>>>>
>>>> However, this doesn't seem to be an elegant solution. Please could
>>>> someone point out a better solution to this?
>>>>
>>>
>>> Hello Julio,
>>>
>>> While this may not be the most elegant solution it is probably one of
>>> the most appropriate.  The problem here likely stems from the fact that
>>> floating point numbers (which are how Time64Cols are stored) are not exact
>>> representations of the desired value.  For example:
>>>
>>> In [1]: 1.1 + 2.2
>>> Out[1]: 3.3000000000000003
>>>
>>> So when you divide my some constant order of magnitude, you are chopping
>>> off the error associated with floating point precision.   You are creating
>>> a bin of this constant's size around the target value that is "close
>>> enough" to count as equivalent.  There are other mechanisms for alleviating
>>> this issue: dividing and multiplying back (x/10)*10 == y,  right shifting
>>> (platform dependent), taking the difference and have it be less than some
>>> tolerance x - y <= t.  You get the idea.   You have to mitigate this effect
>>> some how.
>>>
>>> For more information please refer to:
>>> http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
>>>
>>>
>>>> Could this be related to the fact that my column name is "timestamp"? I
>>>> ask this because I use a program called HDFView to brose the HDF5 file.
>>>> This program refuses to show the first column when it is called
>>>> "timestamp", but shows it when it is called "id". I don't know if the facts
>>>> are related or not.
>>>>
>>>
>>> This is probably unrelated.
>>>
>>> Be Well
>>> Anthony
>>>
>>>
>>>>
>>>> I don't know if this is useful information, but the conversion of a
>>>> typical "t" to string gives something like this:
>>>>
>>>> >> print "%f" % t
>>>> 1365597435.000000
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Precog is a next-generation analytics platform capable of advanced
>>>> analytics on semi-structured data. The platform includes APIs for
>>>> building
>>>> apps and a phenomenal toolset for data science. Developers can use
>>>> our toolset for easy data analysis & visualization. Get a free account!
>>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>>> _______________________________________________
>>>> Pytables-users mailing list
>>>> Pyt...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Precog is a next-generation analytics platform capable of advanced
>>> analytics on semi-structured data. The platform includes APIs for
>>> building
>>> apps and a phenomenal toolset for data science. Developers can use
>>> our toolset for easy data analysis & visualization. Get a free account!
>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pyt...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] ReadWhere() with a Time64Col in the condition

From: Anthony S. <sc...@gm...> - 2013-04-10 16:54:22

On Wed, Apr 10, 2013 at 11:40 AM, Julio Trevisan <jul...@gm...>wrote:

> Hi Anthony
>
> Thanks again.* *If it is a problem related to floating-point precision, I
> might use an Int64Col instead, since I don't need the timestamp miliseconds.
>

Another good plan since integers are exact ;)


>
>
> Julio
>
>
>
>
> On Wed, Apr 10, 2013 at 1:17 PM, Anthony Scopatz <sc...@gm...>wrote:
>
>> On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm...>wrote:
>>
>>> Hi,
>>>
>>> I am using a Time64Col called "timestamp" in a condition, and I noticed
>>> that the condition does not work (i.e., no rows are selected) if I write
>>> something as:
>>>
>>> for row in node.where("timestamp == %f" % t):
>>>     ...
>>>
>>> However, I had this idea of dividing the values by, say 1000, and it
>>> does work:
>>>
>>> for row in node.where("timestamp/1000 == %f" % t/1000):
>>>     ...
>>>
>>> However, this doesn't seem to be an elegant solution. Please could
>>> someone point out a better solution to this?
>>>
>>
>> Hello Julio,
>>
>> While this may not be the most elegant solution it is probably one of the
>> most appropriate.  The problem here likely stems from the fact that
>> floating point numbers (which are how Time64Cols are stored) are not exact
>> representations of the desired value.  For example:
>>
>> In [1]: 1.1 + 2.2
>> Out[1]: 3.3000000000000003
>>
>> So when you divide my some constant order of magnitude, you are chopping
>> off the error associated with floating point precision.   You are creating
>> a bin of this constant's size around the target value that is "close
>> enough" to count as equivalent.  There are other mechanisms for alleviating
>> this issue: dividing and multiplying back (x/10)*10 == y,  right shifting
>> (platform dependent), taking the difference and have it be less than some
>> tolerance x - y <= t.  You get the idea.   You have to mitigate this effect
>> some how.
>>
>> For more information please refer to:
>> http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
>>
>>
>>> Could this be related to the fact that my column name is "timestamp"? I
>>> ask this because I use a program called HDFView to brose the HDF5 file.
>>> This program refuses to show the first column when it is called
>>> "timestamp", but shows it when it is called "id". I don't know if the facts
>>> are related or not.
>>>
>>
>> This is probably unrelated.
>>
>> Be Well
>> Anthony
>>
>>
>>>
>>> I don't know if this is useful information, but the conversion of a
>>> typical "t" to string gives something like this:
>>>
>>> >> print "%f" % t
>>> 1365597435.000000
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Precog is a next-generation analytics platform capable of advanced
>>> analytics on semi-structured data. The platform includes APIs for
>>> building
>>> apps and a phenomenal toolset for data science. Developers can use
>>> our toolset for easy data analysis & visualization. Get a free account!
>>> http://www2.precog.com/precogplatform/slashdotnewsletter
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pyt...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] ReadWhere() with a Time64Col in the condition

From: Julio T. <jul...@gm...> - 2013-04-10 16:40:47

Hi Anthony

Thanks again.* *If it is a problem related to floating-point precision, I
might use an Int64Col instead, since I don't need the timestamp miliseconds.

Julio




On Wed, Apr 10, 2013 at 1:17 PM, Anthony Scopatz <sc...@gm...> wrote:

> On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm...>wrote:
>
>> Hi,
>>
>> I am using a Time64Col called "timestamp" in a condition, and I noticed
>> that the condition does not work (i.e., no rows are selected) if I write
>> something as:
>>
>> for row in node.where("timestamp == %f" % t):
>>     ...
>>
>> However, I had this idea of dividing the values by, say 1000, and it does
>> work:
>>
>> for row in node.where("timestamp/1000 == %f" % t/1000):
>>     ...
>>
>> However, this doesn't seem to be an elegant solution. Please could
>> someone point out a better solution to this?
>>
>
> Hello Julio,
>
> While this may not be the most elegant solution it is probably one of the
> most appropriate.  The problem here likely stems from the fact that
> floating point numbers (which are how Time64Cols are stored) are not exact
> representations of the desired value.  For example:
>
> In [1]: 1.1 + 2.2
> Out[1]: 3.3000000000000003
>
> So when you divide my some constant order of magnitude, you are chopping
> off the error associated with floating point precision.   You are creating
> a bin of this constant's size around the target value that is "close
> enough" to count as equivalent.  There are other mechanisms for alleviating
> this issue: dividing and multiplying back (x/10)*10 == y,  right shifting
> (platform dependent), taking the difference and have it be less than some
> tolerance x - y <= t.  You get the idea.   You have to mitigate this effect
> some how.
>
> For more information please refer to:
> http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
>
>
>> Could this be related to the fact that my column name is "timestamp"? I
>> ask this because I use a program called HDFView to brose the HDF5 file.
>> This program refuses to show the first column when it is called
>> "timestamp", but shows it when it is called "id". I don't know if the facts
>> are related or not.
>>
>
> This is probably unrelated.
>
> Be Well
> Anthony
>
>
>>
>> I don't know if this is useful information, but the conversion of a
>> typical "t" to string gives something like this:
>>
>> >> print "%f" % t
>> 1365597435.000000
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] ReadWhere() with a Time64Col in the condition

From: Anthony S. <sc...@gm...> - 2013-04-10 16:17:40

On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm...>wrote:

> Hi,
>
> I am using a Time64Col called "timestamp" in a condition, and I noticed
> that the condition does not work (i.e., no rows are selected) if I write
> something as:
>
> for row in node.where("timestamp == %f" % t):
>     ...
>
> However, I had this idea of dividing the values by, say 1000, and it does
> work:
>
> for row in node.where("timestamp/1000 == %f" % t/1000):
>     ...
>
> However, this doesn't seem to be an elegant solution. Please could someone
> point out a better solution to this?
>

Hello Julio,

While this may not be the most elegant solution it is probably one of the
most appropriate.  The problem here likely stems from the fact that
floating point numbers (which are how Time64Cols are stored) are not exact
representations of the desired value.  For example:

In [1]: 1.1 + 2.2
Out[1]: 3.3000000000000003

So when you divide my some constant order of magnitude, you are chopping
off the error associated with floating point precision.   You are creating
a bin of this constant's size around the target value that is "close
enough" to count as equivalent.  There are other mechanisms for alleviating
this issue: dividing and multiplying back (x/10)*10 == y,  right shifting
(platform dependent), taking the difference and have it be less than some
tolerance x - y <= t.  You get the idea.   You have to mitigate this effect
some how.

For more information please refer to:
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

> Could this be related to the fact that my column name is "timestamp"? I
> ask this because I use a program called HDFView to brose the HDF5 file.
> This program refuses to show the first column when it is called
> "timestamp", but shows it when it is called "id". I don't know if the facts
> are related or not.
>

This is probably unrelated.

Be Well
Anthony

>
> I don't know if this is useful information, but the conversion of a
> typical "t" to string gives something like this:
>
> >> print "%f" % t
> 1365597435.000000
>
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

[Pytables-users] ReadWhere() with a Time64Col in the condition

From: Julio T. <jul...@gm...> - 2013-04-10 12:44:20

Hi,

I am using a Time64Col called "timestamp" in a condition, and I noticed
that the condition does not work (i.e., no rows are selected) if I write
something as:

for row in node.where("timestamp == %f" % t):
    ...

However, I had this idea of dividing the values by, say 1000, and it does
work:

for row in node.where("timestamp/1000 == %f" % t/1000):
    ...

However, this doesn't seem to be an elegant solution. Please could someone
point out a better solution to this?

Could this be related to the fact that my column name is "timestamp"? I ask
this because I use a program called HDFView to brose the HDF5 file. This
program refuses to show the first column when it is called "timestamp", but
shows it when it is called "id". I don't know if the facts are related or
not.

I don't know if this is useful information, but the conversion of a typical
"t" to string gives something like this:

>> print "%f" % t
1365597435.000000

Re: [Pytables-users] Reading single column from table

From: Anthony S. <sc...@gm...> - 2013-04-08 18:57:03

I am glad =)
On Apr 8, 2013 12:44 PM, "Julio Trevisan" <jul...@gm...> wrote:

> Hey Anthony
>
> Thanks a lot for this. Your method with map() works around 30000 times
> faster!
>
>
> BEFORE:
> (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.096931 seconds to do
> everything else
> (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.780372 seconds to ZIP
>
>
> AFTER:
> (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.073058 seconds to do
> everything else
> (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.000024 seconds to ZIP
>
>
>
>
>
> On Fri, Mar 22, 2013 at 12:35 PM, Anthony Scopatz <sc...@gm...>wrote:
>
>> On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan <jul...@gm...>wrote:
>>
>>> Hi,
>>>
>>> I just joined this list, I am using PyTables for my project and it works
>>> great and fast.
>>>
>>> I am just trying to optimize some parts of the program and I noticed
>>> that zipping the tuples to get one tuple per column takes much longer than
>>> reading the data itself. The thing is that readWhere() returns one tuple
>>> per row, whereas I I need one tuple per column, so I have to use the zip()
>>> function to achieve this. Is there a way to skip this zip() operation?
>>> Please see below:
>>>
>>>
>>>     def quote_GetData(self, period, name, dt1, dt2):
>>>         """Returns timedata.Quotes object.
>>>
>>>         Arguments:
>>>           period -- value from within infogetter.QuotePeriod
>>>           name -- quote symbol
>>>           dt1, dt2 -- datetime.datetime or timestamp values
>>>
>>>         """
>>>         t = time.time()
>>>         node = self.quote_GetNode(period, name)
>>>         ts1 = misc.datetime2timestamp(dt1)
>>>         ts2 = misc.datetime2timestamp(dt2)
>>>
>>>         L = node.readWhere( \
>>>                    "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \
>>>                    (ts1/1000, ts2/1000))
>>>         rowNum = len(L)
>>>         Q = timedata.Quotes()
>>>         print "%s: took %f seconds to do everything else" % (name,
>>> time.time()-t)
>>>
>>>         t = time.time()
>>>         if rowNum > 0:
>>>             (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \
>>>              Q.numTrades) = zip(*L)
>>>         print "%s: took %f seconds to ZIP" % (name, time.time()-t)
>>>         return Q
>>>
>>> *And the printout:*
>>> BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else
>>> BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP
>>>
>>
>> Hi Julio,
>>
>> The problem here isn't zip (packing and un-packing are generally
>> fast operations -- they happen *all* the time in Python).    Nor is the
>> problem specifically with PyTables.  Rather this is an issue with how you
>> are using numpy structured arrays (look them up).  Basically, this is slow
>> because you are creating a list of column tuples where every element is a
>> Python object of the corresponding type.  For example  upcasting every
>> 32-bit integer to a Python int is very expensive!
>>
>> What you *should* be doing is keeping the columns as numpy arrays, which
>> keeps the memory layout small, continuous, fast, and if done right does not
>> require a copy (which you are doing now).
>>
>> The value of L here is a structured array.  So say I have some
>> other structured array with 4 fields, the right way to do this is to pull
>> out each field individually by indexing
>>
>> a, b, c, d = x['a'], x['b'], x['c'], x['d']
>>
>> or more generally (for all fields):
>>
>> a, b, c, d = map(lambda x: i[x], i.dtype.names)
>>
>> or for some list of fields:
>>
>> a, c, b = map(lambda x: i[x], ['a', 'c', 'b'])
>>
>> Timing both your original method and the new one gives:
>>
>> In [47]: timeit a, b, c, d = zip(*i)
>> 1000 loops, best of 3: 1.3 ms per loop
>>
>> In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names)
>> 100000 loops, best of 3: 2.3 µs per loop
>>
>> So the method I propose is 500x-1000x times faster.  Using numpy
>> idiomatically is very important!
>>
>> Be Well
>> Anthony
>>
>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Everyone hates slow websites. So do we.
>>> Make your web apps faster with AppDynamics
>>> Download AppDynamics Lite for free today:
>>> http://p.sf.net/sfu/appdyn_d2d_mar
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pyt...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_mar
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Reading single column from table

From: Julio T. <jul...@gm...> - 2013-04-08 17:43:52

Hey Anthony

Thanks a lot for this. Your method with map() works around 30000 times
faster!


BEFORE:
(database)DEBUG:BOVESPA.VISTA.PETR4: took 0.096931 seconds to do everything
else
(database)DEBUG:BOVESPA.VISTA.PETR4: took 0.780372 seconds to ZIP


AFTER:
(database)DEBUG:BOVESPA.VISTA.PETR4: took 0.073058 seconds to do everything
else
(database)DEBUG:BOVESPA.VISTA.PETR4: took 0.000024 seconds to ZIP





On Fri, Mar 22, 2013 at 12:35 PM, Anthony Scopatz <sc...@gm...> wrote:

> On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan <jul...@gm...>wrote:
>
>> Hi,
>>
>> I just joined this list, I am using PyTables for my project and it works
>> great and fast.
>>
>> I am just trying to optimize some parts of the program and I noticed that
>> zipping the tuples to get one tuple per column takes much longer than
>> reading the data itself. The thing is that readWhere() returns one tuple
>> per row, whereas I I need one tuple per column, so I have to use the zip()
>> function to achieve this. Is there a way to skip this zip() operation?
>> Please see below:
>>
>>
>>     def quote_GetData(self, period, name, dt1, dt2):
>>         """Returns timedata.Quotes object.
>>
>>         Arguments:
>>           period -- value from within infogetter.QuotePeriod
>>           name -- quote symbol
>>           dt1, dt2 -- datetime.datetime or timestamp values
>>
>>         """
>>         t = time.time()
>>         node = self.quote_GetNode(period, name)
>>         ts1 = misc.datetime2timestamp(dt1)
>>         ts2 = misc.datetime2timestamp(dt2)
>>
>>         L = node.readWhere( \
>>                    "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \
>>                    (ts1/1000, ts2/1000))
>>         rowNum = len(L)
>>         Q = timedata.Quotes()
>>         print "%s: took %f seconds to do everything else" % (name,
>> time.time()-t)
>>
>>         t = time.time()
>>         if rowNum > 0:
>>             (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \
>>              Q.numTrades) = zip(*L)
>>         print "%s: took %f seconds to ZIP" % (name, time.time()-t)
>>         return Q
>>
>> *And the printout:*
>> BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else
>> BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP
>>
>
> Hi Julio,
>
> The problem here isn't zip (packing and un-packing are generally
> fast operations -- they happen *all* the time in Python).    Nor is the
> problem specifically with PyTables.  Rather this is an issue with how you
> are using numpy structured arrays (look them up).  Basically, this is slow
> because you are creating a list of column tuples where every element is a
> Python object of the corresponding type.  For example  upcasting every
> 32-bit integer to a Python int is very expensive!
>
> What you *should* be doing is keeping the columns as numpy arrays, which
> keeps the memory layout small, continuous, fast, and if done right does not
> require a copy (which you are doing now).
>
> The value of L here is a structured array.  So say I have some
> other structured array with 4 fields, the right way to do this is to pull
> out each field individually by indexing
>
> a, b, c, d = x['a'], x['b'], x['c'], x['d']
>
> or more generally (for all fields):
>
> a, b, c, d = map(lambda x: i[x], i.dtype.names)
>
> or for some list of fields:
>
> a, c, b = map(lambda x: i[x], ['a', 'c', 'b'])
>
> Timing both your original method and the new one gives:
>
> In [47]: timeit a, b, c, d = zip(*i)
> 1000 loops, best of 3: 1.3 ms per loop
>
> In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names)
> 100000 loops, best of 3: 2.3 µs per loop
>
> So the method I propose is 500x-1000x times faster.  Using numpy
> idiomatically is very important!
>
> Be Well
> Anthony
>
>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_mar
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Reading single column from table

From: Anthony S. <sc...@gm...> - 2013-03-22 15:35:36

On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan <jul...@gm...>wrote:

> Hi,
>
> I just joined this list, I am using PyTables for my project and it works
> great and fast.
>
> I am just trying to optimize some parts of the program and I noticed that
> zipping the tuples to get one tuple per column takes much longer than
> reading the data itself. The thing is that readWhere() returns one tuple
> per row, whereas I I need one tuple per column, so I have to use the zip()
> function to achieve this. Is there a way to skip this zip() operation?
> Please see below:
>
>
>     def quote_GetData(self, period, name, dt1, dt2):
>         """Returns timedata.Quotes object.
>
>         Arguments:
>           period -- value from within infogetter.QuotePeriod
>           name -- quote symbol
>           dt1, dt2 -- datetime.datetime or timestamp values
>
>         """
>         t = time.time()
>         node = self.quote_GetNode(period, name)
>         ts1 = misc.datetime2timestamp(dt1)
>         ts2 = misc.datetime2timestamp(dt2)
>
>         L = node.readWhere( \
>                    "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \
>                    (ts1/1000, ts2/1000))
>         rowNum = len(L)
>         Q = timedata.Quotes()
>         print "%s: took %f seconds to do everything else" % (name,
> time.time()-t)
>
>         t = time.time()
>         if rowNum > 0:
>             (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \
>              Q.numTrades) = zip(*L)
>         print "%s: took %f seconds to ZIP" % (name, time.time()-t)
>         return Q
>
> *And the printout:*
> BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else
> BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP
>

Hi Julio,

The problem here isn't zip (packing and un-packing are generally
fast operations -- they happen *all* the time in Python).    Nor is the
problem specifically with PyTables.  Rather this is an issue with how you
are using numpy structured arrays (look them up).  Basically, this is slow
because you are creating a list of column tuples where every element is a
Python object of the corresponding type.  For example  upcasting every
32-bit integer to a Python int is very expensive!

What you *should* be doing is keeping the columns as numpy arrays, which
keeps the memory layout small, continuous, fast, and if done right does not
require a copy (which you are doing now).

The value of L here is a structured array.  So say I have some
other structured array with 4 fields, the right way to do this is to pull
out each field individually by indexing

a, b, c, d = x['a'], x['b'], x['c'], x['d']

or more generally (for all fields):

a, b, c, d = map(lambda x: i[x], i.dtype.names)

or for some list of fields:

a, c, b = map(lambda x: i[x], ['a', 'c', 'b'])

Timing both your original method and the new one gives:

In [47]: timeit a, b, c, d = zip(*i)
1000 loops, best of 3: 1.3 ms per loop

In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names)
100000 loops, best of 3: 2.3 µs per loop

So the method I propose is 500x-1000x times faster.  Using numpy
idiomatically is very important!

Be Well
Anthony


>
>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

[Pytables-users] Reading single column from table

From: Julio T. <jul...@gm...> - 2013-03-22 12:11:17

Hi,

I just joined this list, I am using PyTables for my project and it works
great and fast.

I am just trying to optimize some parts of the program and I noticed that
zipping the tuples to get one tuple per column takes much longer than
reading the data itself. The thing is that readWhere() returns one tuple
per row, whereas I I need one tuple per column, so I have to use the zip()
function to achieve this. Is there a way to skip this zip() operation?
Please see below:


    def quote_GetData(self, period, name, dt1, dt2):
        """Returns timedata.Quotes object.

        Arguments:
          period -- value from within infogetter.QuotePeriod
          name -- quote symbol
          dt1, dt2 -- datetime.datetime or timestamp values

        """
        t = time.time()
        node = self.quote_GetNode(period, name)
        ts1 = misc.datetime2timestamp(dt1)
        ts2 = misc.datetime2timestamp(dt2)

        L = node.readWhere( \
                   "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \
                   (ts1/1000, ts2/1000))
        rowNum = len(L)
        Q = timedata.Quotes()
        print "%s: took %f seconds to do everything else" % (name,
time.time()-t)

        t = time.time()
        if rowNum > 0:
            (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \
             Q.numTrades) = zip(*L)
        print "%s: took %f seconds to ZIP" % (name, time.time()-t)
        return Q

*And the printout:*
BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else
BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP

Re: [Pytables-users] AssertionError while attempting a query on indexed column

From: Anthony S. <sc...@gm...> - 2013-03-18 20:49:42

On Fri, Mar 15, 2013 at 5:26 PM, Thadeus Burgess <tha...@th...>wrote:

> Looks similar to https://github.com/PyTables/PyTables/issues/206


Yes this is the same problem. I proposed a test on the issue if someone
wants to try working on it, that would be great!


>
> --
> Thadeus
>
>
>
> On Fri, Mar 15, 2013 at 6:58 PM, Dmitry Fedorov <fe...@ec...>wrote:
>
>> Hi,
>>
>> I'm trying to index a simple uint32 column and search if a value is
>> there. Although during the in-kernel query i get an error like this:
>>
>> Traceback (most recent call last):
>>   File "G:\_2\PytablesTestCode\pytableindex.py", line 9, in <module>
>>     r = [ row.nrow for row in it ]
>>   File "tableExtension.pyx", line 858, in
>> tables.tableExtension.Row.__next__ (tables\tableExtension.c:7788)
>>   File "tableExtension.pyx", line 879, in
>> tables.tableExtension.Row.__next__indexed
>> (tables\tableExtension.c:7922)
>> AssertionError
>>
>> The second time I perform the same query I get no errors but also I
>> got no results...
>>
>> I get this on python 2.7, pytables 2.4.0 and 2.3.1 under windows 64 (7
>> and 8). I'm using pytables distro from
>> <http://www.lfd.uci.edu/~gohlke/pythonlibs/#pytables>
>>
>> I'm sure I've had code like this running just fine on older pytables
>> 2.2.X (with pro) and older python...
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> Here's table creation code:
>>
>> import tables
>> import numpy as np
>> import random
>>
>> #initalizes column types
>> class Columns(tables.IsDescription):
>>             imageid   = tables.UInt32Col(pos=1)
>>             feature   = tables.Float64Col(pos=2, shape=(30,))
>>
>> #creates Table
>> h5file=tables.openFile('features.h5','a', title='features')
>> table = h5file.createTable('/', 'values', Columns,
>> expectedrows=1000000000)
>> table.flush()
>>
>> Columns = table.row
>>
>> #appends values to table
>> for i in xrange(100):
>>     Columns['imageid'] = i
>>     Columns['feature'] = [np.random.random_sample((30,))]
>>     Columns.append()
>>
>> table.flush()
>>
>> table.cols.imageid.createCSIndex()
>> h5file.close()
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> Here's table query code:
>>
>> import time
>> import tables
>>
>> #opens tables
>> h5file=tables.openFile('features.h5','a')
>> table=h5file.root.values
>>
>> it = table.where('imageid == 8')
>> r = [ row.nrow for row in it ]
>>
>> print r
>>
>> -----
>> Thank you,
>> Dmitry
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_mar
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] AssertionError while attempting a query on indexed column

From: Thadeus B. <tha...@th...> - 2013-03-16 00:27:34

Looks similar to https://github.com/PyTables/PyTables/issues/206

--
Thadeus



On Fri, Mar 15, 2013 at 6:58 PM, Dmitry Fedorov <fe...@ec...>wrote:

> Hi,
>
> I'm trying to index a simple uint32 column and search if a value is
> there. Although during the in-kernel query i get an error like this:
>
> Traceback (most recent call last):
>   File "G:\_2\PytablesTestCode\pytableindex.py", line 9, in <module>
>     r = [ row.nrow for row in it ]
>   File "tableExtension.pyx", line 858, in
> tables.tableExtension.Row.__next__ (tables\tableExtension.c:7788)
>   File "tableExtension.pyx", line 879, in
> tables.tableExtension.Row.__next__indexed
> (tables\tableExtension.c:7922)
> AssertionError
>
> The second time I perform the same query I get no errors but also I
> got no results...
>
> I get this on python 2.7, pytables 2.4.0 and 2.3.1 under windows 64 (7
> and 8). I'm using pytables distro from
> <http://www.lfd.uci.edu/~gohlke/pythonlibs/#pytables>
>
> I'm sure I've had code like this running just fine on older pytables
> 2.2.X (with pro) and older python...
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<
> Here's table creation code:
>
> import tables
> import numpy as np
> import random
>
> #initalizes column types
> class Columns(tables.IsDescription):
>             imageid   = tables.UInt32Col(pos=1)
>             feature   = tables.Float64Col(pos=2, shape=(30,))
>
> #creates Table
> h5file=tables.openFile('features.h5','a', title='features')
> table = h5file.createTable('/', 'values', Columns, expectedrows=1000000000)
> table.flush()
>
> Columns = table.row
>
> #appends values to table
> for i in xrange(100):
>     Columns['imageid'] = i
>     Columns['feature'] = [np.random.random_sample((30,))]
>     Columns.append()
>
> table.flush()
>
> table.cols.imageid.createCSIndex()
> h5file.close()
>
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<
> Here's table query code:
>
> import time
> import tables
>
> #opens tables
> h5file=tables.openFile('features.h5','a')
> table=h5file.root.values
>
> it = table.where('imageid == 8')
> r = [ row.nrow for row in it ]
>
> print r
>
> -----
> Thank you,
> Dmitry
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

[Pytables-users] AssertionError while attempting a query on indexed column

From: Dmitry F. <fe...@ec...> - 2013-03-15 23:59:17

Hi,

I'm trying to index a simple uint32 column and search if a value is
there. Although during the in-kernel query i get an error like this:

Traceback (most recent call last):
  File "G:\_2\PytablesTestCode\pytableindex.py", line 9, in <module>
    r = [ row.nrow for row in it ]
  File "tableExtension.pyx", line 858, in
tables.tableExtension.Row.__next__ (tables\tableExtension.c:7788)
  File "tableExtension.pyx", line 879, in
tables.tableExtension.Row.__next__indexed
(tables\tableExtension.c:7922)
AssertionError

The second time I perform the same query I get no errors but also I
got no results...

I get this on python 2.7, pytables 2.4.0 and 2.3.1 under windows 64 (7
and 8). I'm using pytables distro from
<http://www.lfd.uci.edu/~gohlke/pythonlibs/#pytables>

I'm sure I've had code like this running just fine on older pytables
2.2.X (with pro) and older python...

<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Here's table creation code:

import tables
import numpy as np
import random

#initalizes column types
class Columns(tables.IsDescription):
            imageid   = tables.UInt32Col(pos=1)
            feature   = tables.Float64Col(pos=2, shape=(30,))

#creates Table
h5file=tables.openFile('features.h5','a', title='features')
table = h5file.createTable('/', 'values', Columns, expectedrows=1000000000)
table.flush()

Columns = table.row

#appends values to table
for i in xrange(100):
    Columns['imageid'] = i
    Columns['feature'] = [np.random.random_sample((30,))]
    Columns.append()

table.flush()

table.cols.imageid.createCSIndex()
h5file.close()

<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Here's table query code:

import time
import tables

#opens tables
h5file=tables.openFile('features.h5','a')
table=h5file.root.values

it = table.where('imageid == 8')
r = [ row.nrow for row in it ]

print r

-----
Thank you,
Dmitry

Re: [Pytables-users] Writing to CArray

From: Tim B. <tim...@ma...> - 2013-03-11 22:24:14

The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array.

Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;)
I think the netCDF3 functionality has been taken out or at least deprecated (https://github.com/PyTables/PyTables/issues/68). Using the python-netCDF4 module allows me to pull from pretty much any netcdf file and the inherent masking is sometimes very useful where the dataset is smaller and I can live with the lower performance of masks.

I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here).

I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap.

For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough.

Awesome! I am glad that this is working for you.
Yes - appears to work great!

Re: [Pytables-users] Writing to CArray

From: Anthony S. <sc...@gm...> - 2013-03-11 19:16:14

On Sun, Mar 10, 2013 at 8:47 PM, Tim Burgess <tim...@ma...> wrote:

>
> On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote:
>
> > Hey Tim,
> >
> > Awesome dataset! And neat image!
> >
> > As per your request, a couple of minor things I noticed were that you
> probably don't need to do the sanity check each time (great for debugging,
> but not needed always), you are using masked arrays which while sometimes
> convenient are generally slower than creating an array, a mask and applying
> the mask to the array, and you seem to be downcasting from float64 to
> float32 for some reason that I am not entirely clear on (size, speed?).
> >
> > To the more major question of write performance, one thing that you
> could try is compression.  You might want to do some timing studies to find
> the best compressor and level. Performance here can vary a lot based on how
> similar your data is (and how close similar data is to each other).  If you
> have got a bunch of zeros and only a few real data points, even zlib 1 is
> going to be blazing fast compared to writing all those zeros out explicitly.
> >
> > Another thing you could try doing is switching to EArray and using the
> append() method.  This might save PyTables, numpy, hdf5, etc from having to
> check that the shape of "sst_node[qual_indices]" is actually the same as
> the data you are giving it.  Additionally dumping a block of memory to the
> file directly (via append()) is generally faster than having to resolve
> fancy indexes (which are notoriously the slow part of even numpy).
> >
> > Lastly, as a general comment, you seem to be doing a lot of stuff in the
> inner most loop -- including writing to disk.  I would look at how you
> could restructure this to move as much as possible out of this loop.  Your
> data seems to be about 12 GB for a year, so this is probably too big to
> build up the full sst array completely in memory prior to writing.  That
> is, unless you have a computer much bigger than my laptop ;).  But issuing
> one fat write command is probably going to be faster than making 365 of
> them.
> >
> > Happy hacking!
> > Be Well
> > Anthony
> >
>
>
> Thanks Anthony for being so responsive and touching on a number of points.
>
> The netCDF library gives me a masked array so I have to explicitly
> transform that into a regular numpy array.


Ahh interesting.  Depending on the netCDF version the file was made with,
you should be able to read the file directly from PyTables.  You could thus
directly get a normal numpy array.  This *should* be possible, but I have
never tried it ;)


> I've looked under the covers and have seen that the ma masked
> implementation is all pure Python and so there is a performance drawback.
> I'm not up to speed yet on where the numpy.na masking implementation is
> (started a new job here).
>
> I tried to do an implementation in memory (except for the final write) and
> found that I have about 2GB of indices when I extract the quality indices.
> Simply using those indexes, memory usage grows to over 64GB and I
> eventually run out of memory and start churning away in swap.
>
> For the moment, I have pulled down the latest git master and am using the
> new in-memory HDF feature. This seems to give be better performance and is
> code-wise pretty simple so for the moment, it's good enough.
>

Awesome! I am glad that this is working for you.


> Cheers and thanks again, Tim
>
> BTW I viewed your SciPy tutorial. Good stuff!
>

Thanks!


>
>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Writing to CArray

From: Tim B. <tim...@ma...> - 2013-03-11 01:48:19

On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote:

> Hey Tim, 
> 
> Awesome dataset! And neat image!
> 
> As per your request, a couple of minor things I noticed were that you probably don't need to do the sanity check each time (great for debugging, but not needed always), you are using masked arrays which while sometimes convenient are generally slower than creating an array, a mask and applying the mask to the array, and you seem to be downcasting from float64 to float32 for some reason that I am not entirely clear on (size, speed?).
> 
> To the more major question of write performance, one thing that you could try is compression.  You might want to do some timing studies to find the best compressor and level. Performance here can vary a lot based on how similar your data is (and how close similar data is to each other).  If you have got a bunch of zeros and only a few real data points, even zlib 1 is going to be blazing fast compared to writing all those zeros out explicitly.  
> 
> Another thing you could try doing is switching to EArray and using the append() method.  This might save PyTables, numpy, hdf5, etc from having to check that the shape of "sst_node[qual_indices]" is actually the same as the data you are giving it.  Additionally dumping a block of memory to the file directly (via append()) is generally faster than having to resolve fancy indexes (which are notoriously the slow part of even numpy).
> 
> Lastly, as a general comment, you seem to be doing a lot of stuff in the inner most loop -- including writing to disk.  I would look at how you could restructure this to move as much as possible out of this loop.  Your data seems to be about 12 GB for a year, so this is probably too big to build up the full sst array completely in memory prior to writing.  That is, unless you have a computer much bigger than my laptop ;).  But issuing one fat write command is probably going to be faster than making 365 of them.  
> 
> Happy hacking!
> Be Well
> Anthony
> 

Thanks Anthony for being so responsive and touching on a number of points.

The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here).

I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap.

For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough.

Cheers and thanks again, Tim

BTW I viewed your SciPy tutorial. Good stuff!

Re: [Pytables-users] Append only file is growing in size like crazy

From: Thadeus B. <tha...@th...> - 2013-03-08 01:14:28

Thank you for the information. I will run a few more tests over the next
couple of days, one day with no compression, and one day with a chunksize
similar to what will be appended each cycle, hopefully I will get a chance
to report back.

A ptrepack into a file with no compression is half the size of its
append/compress/lots of unused space counterpart. The reason for using
compression is to reduce the IO required from the network backed storage,
not necessarily reduce disk space, although that is a plus.


--
Thadeus



On Thu, Mar 7, 2013 at 5:40 PM, Anthony Scopatz <sc...@gm...> wrote:

> Hi Thadeus,
>
> HDF5 does not guarantee that the data is contiguous on disk between
> blocks.  hat is, there may be empty space in your file.  Furthermore,
> compression really messes with HDF5's ability to predict how large blocks
> will end up being.  To avoid accidental data loss, HDF5 tends to over
> predict the empty buffer space needed.
>
> Thus my guess is that by having this tight loop around open/append/close,
> you keep accidentally triggering extraneous buffer space.  You basically
> have two options:
>
> 1. turn off compression.  size prediction is exact without it.
> 2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the
> day?)
>
> Hope this helps
> Be Well
> Anthony
>
>
> On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <tha...@th...>wrote:
>
>> I have a PyTables file that receives many appends to a Table throughout
>> the day, the file is opened, a small bit of data is appended, and the file
>> is closed. The open/append/close can happen many times in a minute.
>> Anywhere from 1-500 rows are appended at any given time. By the end of the
>> day, this file is expected to have roughly 66000 rows. Chunkshape is set to
>> 1500 for no particular reason (doesn't seem to make a difference, and some
>> other files can be 5 million/day). BLOSC with lvl 9 compression is used on
>> the table. Data is never deleted from the table. There are roughly 12
>> columns on the Table.
>>
>> The problem is that at the end of the day this file is 1GB in size. I
>> don't understand why the file is growing so big. The tbl.size_on_disk shows
>> a meager 20MB.
>>
>> I have used ptrepack with --keep-source-filters and --chunkshape=keep.
>> The new file is only 30MB in size which is reasonable.
>> I have also used ptrepack with --chunkshape=auto and although it set the
>> chunkshape to around 388, there was no significant change in filesize from
>> chunkshape of 1500.
>>
>> Is pytables not re-using chunks on new appends. When 50 rows are
>> appended, is it still writing a chunk sized for 1500 rows. When the next
>> append comes along, it writes a brand new chunk instead of opening the old
>> chunk and appending the data?
>>
>> Should my chunksize really be "expected rows to append each time" instead
>> of "expected total rows"?
>>
>> --
>> Thadeus
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
>> endpoint security space. For insight on selecting the right partner to
>> tackle endpoint security challenges, access the full report.
>> http://p.sf.net/sfu/symantec-dev2dev
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Append only file is growing in size like crazy

From: Anthony S. <sc...@gm...> - 2013-03-07 23:40:48

Hi Thadeus,

HDF5 does not guarantee that the data is contiguous on disk between blocks.
 hat is, there may be empty space in your file.  Furthermore, compression
really messes with HDF5's ability to predict how large blocks will end up
being.  To avoid accidental data loss, HDF5 tends to over predict the empty
buffer space needed.

Thus my guess is that by having this tight loop around open/append/close,
you keep accidentally triggering extraneous buffer space.  You basically
have two options:

1. turn off compression.  size prediction is exact without it.
2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the day?)

Hope this helps
Be Well
Anthony


On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <tha...@th...>wrote:

> I have a PyTables file that receives many appends to a Table throughout
> the day, the file is opened, a small bit of data is appended, and the file
> is closed. The open/append/close can happen many times in a minute.
> Anywhere from 1-500 rows are appended at any given time. By the end of the
> day, this file is expected to have roughly 66000 rows. Chunkshape is set to
> 1500 for no particular reason (doesn't seem to make a difference, and some
> other files can be 5 million/day). BLOSC with lvl 9 compression is used on
> the table. Data is never deleted from the table. There are roughly 12
> columns on the Table.
>
> The problem is that at the end of the day this file is 1GB in size. I
> don't understand why the file is growing so big. The tbl.size_on_disk shows
> a meager 20MB.
>
> I have used ptrepack with --keep-source-filters and --chunkshape=keep. The
> new file is only 30MB in size which is reasonable.
> I have also used ptrepack with --chunkshape=auto and although it set the
> chunkshape to around 388, there was no significant change in filesize from
> chunkshape of 1500.
>
> Is pytables not re-using chunks on new appends. When 50 rows are appended,
> is it still writing a chunk sized for 1500 rows. When the next append comes
> along, it writes a brand new chunk instead of opening the old chunk and
> appending the data?
>
> Should my chunksize really be "expected rows to append each time" instead
> of "expected total rows"?
>
> --
> Thadeus
>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

22 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 7 8 9 10 11 .. 165 > >> (Page 9 of 165)