You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Giovanni L. C. <glc...@gm...> - 2013-08-05 20:15:03
|
Hi Anthony, what do you mean precisely? I tried del ca[:,:] but CArray does not support __delitem__. Looking in the documentation I could only find a method called remove_rows, but it's in Table, not CArray. Maybe I am missing something? Thank, Giovanni On Mon 05 Aug 2013 03:43:42 PM EDT, pyt...@li... wrote: > >> Hello Giovanni, I think you may need to del that slice and then possibly >> repack. Hope this helps. Be Well Anthony On Mon, Aug 5, 2013 at 2:09 PM, >> Giovanni Luca Ciampaglia < glc...@gm...> wrote: >>> Hello all, >>> >>> is there a way to clear out a chunk from a CArray? I noticed that setting >>> the >>> data to zero actually takes disk space, i.e. >>> >>> *** >>> from tables import open_file, BoolAtom >>> >>> h5f = open_file('test.h5', 'w') >>> ca = h5f.create_carray(h5f.root, 'carray', BoolAtom(), shape=(1000,1000), >>> chunkshape=(1,1000)) >>> ca[:,:] = False >>> h5f.close() >>> *** >>> >>> The resulting file takes 249K ... >>> >>> Best, >>> >>> -- >>> Giovanni Luca Ciampaglia >>> >>> Postdoctoral fellow >>> Center for Complex Networks and Systems Research >>> Indiana University >>> >>> ? 910 E 10th St ? Bloomington ? IN 47408 >>> ?http://cnets.indiana.edu/ >>> ?gci...@in... >>> > -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gci...@in... |
From: Oleksandr H. <guz...@gm...> - 2013-08-05 19:43:41
|
Thank you Anthony and Jeff: I will try compression and if that won't be enough, I'll try using pandas for working with dates. Cheers -- Sasha 2013/8/5 Jeff Reback <jr...@ya...> > Here is a pandas solution for doing just this (which uses PyTables under > the hood): > > # create a frame > In [45]: df = > DataFrame(randn(1000,2),index=date_range('20000101',periods=1000)) > > In [53]: df > Out[53]: > <class 'pandas.core.frame.DataFrame'> > DatetimeIndex: 1000 entries, 2000-01-01 00:00:00 to 2002-09-26 00:00:00 > Freq: D > Data columns (total 2 columns): > 0 1000 non-null values > 1 1000 non-null values > dtypes: float64(2) > > # store it as a table > In [46]: store = pd.HDFStore('test.h5',mode='w') > > In [47]: store.append('df',df) > > # select out the index (a datetimeindex in this case) > In [48]: c = store.select_column('df','index') > > # get the coordinates of matching index > In [49]: coords = c[pd.DatetimeIndex(c).month==5] > > # select those rows > In [51]: from pandas.io.pytables import Coordinates > > In [50]: store.select('df',where=Coordinates(coords.index,None,None)) > Out[50]: > <class 'pandas.core.frame.DataFrame'> > DatetimeIndex: 93 entries, 2000-05-01 00:00:00 to 2002-05-31 00:00:00 > Data columns (total 2 columns): > 0 93 non-null values > 1 93 non-null values > dtypes: float64(2) > > > ------------------------------ > *From:* Anthony Scopatz <sc...@gm...> > *To:* Discussion list for PyTables <pyt...@li...> > *Sent:* Monday, August 5, 2013 2:54 PM > *Subject:* Re: [Pytables-users] dates and space > > On Mon, Aug 5, 2013 at 1:38 PM, Oleksandr Huziy <guz...@gm...>wrote: > > Hi Pytables users and developers: > > I have a few questions to which I could not find the answer in the > documentation. Thank you in advance for any help. > > 1. If I store dates in Pytables, does it mean I could write queries like > table.where('date.month == 5')? Is there a common way to pass from python's > datetime to pytable's datetime and inversely? > > > Hello Sasha, > > Pytables times are the actual based off of C time, not Python's date > times. This is because they use the HDF5 time types. So unfortunately you > can't write queries like the one above. (You'd need to talk to numexpr > about getting that kind of query implemented ~_~.) > > Instead I would suggest that you store your times as Float64Atoms and > Float64Cols and then use arithmetic to figure out the query: > > table.where("(x / 3600 / 24)%12 == 5") > > This is not perfect... > > > 2. I have several variables stored in the same file in a separate table > for each variable. And I use separate columns year, month, day, hour, > minute, second - to mark the time for a record (the records are not > necessarily ordered in time) and this is for each variable. I was thinking > to put all the variables in the same table and put missing values for the > variables which do not have outputs for a given time step. Is it possible > to put None as a default value into a table (so I could easily filter dummy > rows). > > > It is not possible to use "None" since that is a Python object of a > different type than the other integers you are trying to stick in the > column. I would suggest that you use values with no actual meaning. If > you are using normal ints you can use -1 to represent missing values. If > you are using unsigned ints you have to pick other values, like 13 for > month on the Julian calendar. > > > But then again the data comes in chunks, does this mean I would have to > check if a row with the same date already exist for a different variable? > > > No you wouldn't you can store the same data multiple times in different > rows. > > > I don't really like the ideas in 2, which are intended to save space, but > maybe all I need is a good compression level? Can somebody advise me on > this? > > > Compression would definitely help here since the date numbers are all > fairly similar. Probably even a compression level of 1 would work. Keep > in mind that sometime using compression actually speeds things up (see the > starving CPU problem). You might just need to experiment with a few > different compression level to see how things go. 0, 1, 5, 9 gives you a > good spread. > > Be Well > Anthony > > > > > > Cheers > -- > Oleksandr (Sasha) Huziy > > > ------------------------------------------------------------------------------ > Get your SQL database under version control now! > Version control is standard for application code, but databases havent > caught up. So what steps can you take to put your SQL databases under > version control? Why should you start doing it? Read more to find out. > http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Get your SQL database under version control now! > Version control is standard for application code, but databases havent > caught up. So what steps can you take to put your SQL databases under > version control? Why should you start doing it? Read more to find out. > http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Get your SQL database under version control now! > Version control is standard for application code, but databases havent > caught up. So what steps can you take to put your SQL databases under > version control? Why should you start doing it? Read more to find out. > http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-08-05 19:14:40
|
Hello Giovanni, I think you may need to del that slice and then possibly repack. Hope this helps. Be Well Anthony On Mon, Aug 5, 2013 at 2:09 PM, Giovanni Luca Ciampaglia < glc...@gm...> wrote: > Hello all, > > is there a way to clear out a chunk from a CArray? I noticed that setting > the > data to zero actually takes disk space, i.e. > > *** > from tables import open_file, BoolAtom > > h5f = open_file('test.h5', 'w') > ca = h5f.create_carray(h5f.root, 'carray', BoolAtom(), shape=(1000,1000), > chunkshape=(1,1000)) > ca[:,:] = False > h5f.close() > *** > > The resulting file takes 249K ... > > Best, > > -- > Giovanni Luca Ciampaglia > > Postdoctoral fellow > Center for Complex Networks and Systems Research > Indiana University > > ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 > ☞ http://cnets.indiana.edu/ > ✉ gci...@in... > > > ------------------------------------------------------------------------------ > Get your SQL database under version control now! > Version control is standard for application code, but databases havent > caught up. So what steps can you take to put your SQL databases under > version control? Why should you start doing it? Read more to find out. > http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Giovanni L. C. <glc...@gm...> - 2013-08-05 19:09:13
|
Hello all, is there a way to clear out a chunk from a CArray? I noticed that setting the data to zero actually takes disk space, i.e. *** from tables import open_file, BoolAtom h5f = open_file('test.h5', 'w') ca = h5f.create_carray(h5f.root, 'carray', BoolAtom(), shape=(1000,1000), chunkshape=(1,1000)) ca[:,:] = False h5f.close() *** The resulting file takes 249K ... Best, -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gci...@in... |
From: Jeff R. <jr...@ya...> - 2013-08-05 19:02:46
|
Here is a pandas solution for doing just this (which uses PyTables under the hood): # create a frame In [45]: df = DataFrame(randn(1000,2),index=date_range('20000101',periods=1000)) In [53]: df Out[53]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 1000 entries, 2000-01-01 00:00:00 to 2002-09-26 00:00:00 Freq: D Data columns (total 2 columns): 0 1000 non-null values 1 1000 non-null values dtypes: float64(2) # store it as a table In [46]: store = pd.HDFStore('test.h5',mode='w') In [47]: store.append('df',df) # select out the index (a datetimeindex in this case) In [48]: c = store.select_column('df','index') # get the coordinates of matching index In [49]: coords = c[pd.DatetimeIndex(c).month==5] # select those rows In [51]: from pandas.io.pytables import Coordinates In [50]: store.select('df',where=Coordinates(coords.index,None,None)) Out[50]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 93 entries, 2000-05-01 00:00:00 to 2002-05-31 00:00:00 Data columns (total 2 columns): 0 93 non-null values 1 93 non-null values dtypes: float64(2) ________________________________ From: Anthony Scopatz <sc...@gm...> To: Discussion list for PyTables <pyt...@li...> Sent: Monday, August 5, 2013 2:54 PM Subject: Re: [Pytables-users] dates and space On Mon, Aug 5, 2013 at 1:38 PM, Oleksandr Huziy <guz...@gm...> wrote: Hi Pytables users and developers: > > >I have a few questions to which I could not find the answer in the documentation. Thank you in advance for any help. > > >1. If I store dates in Pytables, does it mean I could write queries like table.where('date.month == 5')? Is there a common way to pass from python's datetime to pytable's datetime and inversely? Hello Sasha, Pytables times are the actual based off of C time, not Python's date times. This is because they use the HDF5 time types. So unfortunately you can't write queries like the one above. (You'd need to talk to numexpr about getting that kind of query implemented ~_~.) Instead I would suggest that you store your times as Float64Atoms and Float64Cols and then use arithmetic to figure out the query: table.where("(x / 3600 / 24)%12 == 5") This is not perfect... 2. I have several variables stored in the same file in a separate table for each variable. And I use separate columns year, month, day, hour, minute, second - to mark the time for a record (the records are not necessarily ordered in time) and this is for each variable. I was thinking to put all the variables in the same table and put missing values for the variables which do not have outputs for a given time step. Is it possible to put None as a default value into a table (so I could easily filter dummy rows). > It is not possible to use "None" since that is a Python object of a different type than the other integers you are trying to stick in the column. I would suggest that you use values with no actual meaning. If you are using normal ints you can use -1 to represent missing values. If you are using unsigned ints you have to pick other values, like 13 for month on the Julian calendar. But then again the data comes in chunks, does this mean I would have to check if a row with the same date already exist for a different variable? No you wouldn't you can store the same data multiple times in different rows. I don't really like the ideas in 2, which are intended to save space, but maybe all I need is a good compression level? Can somebody advise me on this? > Compression would definitely help here since the date numbers are all fairly similar. Probably even a compression level of 1 would work. Keep in mind that sometime using compression actually speeds things up (see the starving CPU problem). You might just need to experiment with a few different compression level to see how things go. 0, 1, 5, 9 gives you a good spread. Be Well Anthony > > > > > >Cheers >-- >Oleksandr (Sasha) Huziy >------------------------------------------------------------------------------ >Get your SQL database under version control now! >Version control is standard for application code, but databases havent >caught up. So what steps can you take to put your SQL databases under >version control? Why should you start doing it? Read more to find out. >http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk >_______________________________________________ >Pytables-users mailing list >Pyt...@li... >https://lists.sourceforge.net/lists/listinfo/pytables-users > > ------------------------------------------------------------------------------ Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk _______________________________________________ Pytables-users mailing list Pyt...@li... https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Anthony S. <sc...@gm...> - 2013-08-05 18:54:57
|
On Mon, Aug 5, 2013 at 1:38 PM, Oleksandr Huziy <guz...@gm...>wrote: > Hi Pytables users and developers: > > I have a few questions to which I could not find the answer in the > documentation. Thank you in advance for any help. > > 1. If I store dates in Pytables, does it mean I could write queries like > table.where('date.month == 5')? Is there a common way to pass from python's > datetime to pytable's datetime and inversely? > Hello Sasha, Pytables times are the actual based off of C time, not Python's date times. This is because they use the HDF5 time types. So unfortunately you can't write queries like the one above. (You'd need to talk to numexpr about getting that kind of query implemented ~_~.) Instead I would suggest that you store your times as Float64Atoms and Float64Cols and then use arithmetic to figure out the query: table.where("(x / 3600 / 24)%12 == 5") This is not perfect... > 2. I have several variables stored in the same file in a separate table > for each variable. And I use separate columns year, month, day, hour, > minute, second - to mark the time for a record (the records are not > necessarily ordered in time) and this is for each variable. I was thinking > to put all the variables in the same table and put missing values for the > variables which do not have outputs for a given time step. Is it possible > to put None as a default value into a table (so I could easily filter dummy > rows). > It is not possible to use "None" since that is a Python object of a different type than the other integers you are trying to stick in the column. I would suggest that you use values with no actual meaning. If you are using normal ints you can use -1 to represent missing values. If you are using unsigned ints you have to pick other values, like 13 for month on the Julian calendar. > But then again the data comes in chunks, does this mean I would have to > check if a row with the same date already exist for a different variable? > No you wouldn't you can store the same data multiple times in different rows. > I don't really like the ideas in 2, which are intended to save space, but > maybe all I need is a good compression level? Can somebody advise me on > this? > Compression would definitely help here since the date numbers are all fairly similar. Probably even a compression level of 1 would work. Keep in mind that sometime using compression actually speeds things up (see the starving CPU problem). You might just need to experiment with a few different compression level to see how things go. 0, 1, 5, 9 gives you a good spread. Be Well Anthony > > > > Cheers > -- > Oleksandr (Sasha) Huziy > > > ------------------------------------------------------------------------------ > Get your SQL database under version control now! > Version control is standard for application code, but databases havent > caught up. So what steps can you take to put your SQL databases under > version control? Why should you start doing it? Read more to find out. > http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Oleksandr H. <guz...@gm...> - 2013-08-05 18:38:42
|
Hi Pytables users and developers: I have a few questions to which I could not find the answer in the documentation. Thank you in advance for any help. 1. If I store dates in Pytables, does it mean I could write queries like table.where('date.month == 5')? Is there a common way to pass from python's datetime to pytable's datetime and inversely? 2. I have several variables stored in the same file in a separate table for each variable. And I use separate columns year, month, day, hour, minute, second - to mark the time for a record (the records are not necessarily ordered in time) and this is for each variable. I was thinking to put all the variables in the same table and put missing values for the variables which do not have outputs for a given time step. Is it possible to put None as a default value into a table (so I could easily filter dummy rows). But then again the data comes in chunks, does this mean I would have to check if a row with the same date already exist for a different variable? I don't really like the ideas in 2, which are intended to save space, but maybe all I need is a good compression level? Can somebody advise me on this? Cheers -- Oleksandr (Sasha) Huziy |
From: Anthony S. <sc...@gm...> - 2013-08-05 14:50:48
|
On Mon, Aug 5, 2013 at 4:11 AM, Nyirő Gergő <ger...@gm...> wrote: > Hello, > > > We develop a measurement evaluation tool, and we'd like to use > pytables/hdf5 as a middle layer for signal accessing. > > We have to deal with the silly structure of the recorder device > measurement format. > > > > The signals can be accessed via two identifiers: > > * device name: <source of the signal>-<channel of the > message>-<another tag>-<yet another tag> > > * signal name > > > > The first identifier says the source information of the signal, which > can be quite long. > > Therefore I grouped the device name into two layers: > > /<source of the signal> > > /<channel of the message>... > > /<signal name> > > > > So if you have the same message from two channels, than you will get > /foo-device-name > > /channel-1 > > /bar > > /baz > > /channel-2 > > /bar > > /baz > > > > Besides signal loading, we have to search for signal name as fast as > possible, and return with the shortest unique device name part and the > signal name. > > Using the structure above, iterating over the group names is quite > slow. So I build up a table from device and signal name. > > As far as I know, the pytables query does not support string searching > (e.g. startswidth, *foo[0-9]ch*, etc.), so fetching this table lead us > to a pure python loop which is slow again. > > Therefore I build up a python dictionary from the table, which provide > fast iteration against the table, but the init time increased from 100 > ms to 3-4 sec (we have more than 40 000 signals). > > > > Do you have any advice how to search for group names in hdf5 with > pytables in an efficient way? > Hi grego, Searching through group names, like accessing all HDF5 metadata, is slow. For group names this is because rather than searching through a list you are traversing a B-tree, IIRC. So you have to use the couple of tricks that you used: 1) have another Table / Array of all table names, 2) read this in once to a native Python data structure (dict here). However, 4 sec to read in this table seems excessive for data of this size. You are probably not reading this in properly. You should be using: raw_grps = f.root.grp_names[:] or similar. Maybe other people have some other ideas. Be Well Anthony > > ps: I would be most happy with a glob interface. > > > > thanks for your advices in advance, > > gergo > > > ------------------------------------------------------------------------------ > Get your SQL database under version control now! > Version control is standard for application code, but databases havent > caught up. So what steps can you take to put your SQL databases under > version control? Why should you start doing it? Read more to find out. > http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Nyirő G. <ger...@gm...> - 2013-08-05 09:11:38
|
Hello, We develop a measurement evaluation tool, and we'd like to use pytables/hdf5 as a middle layer for signal accessing. We have to deal with the silly structure of the recorder device measurement format. The signals can be accessed via two identifiers: * device name: <source of the signal>-<channel of the message>-<another tag>-<yet another tag> * signal name The first identifier says the source information of the signal, which can be quite long. Therefore I grouped the device name into two layers: /<source of the signal> /<channel of the message>... /<signal name> So if you have the same message from two channels, than you will get /foo-device-name /channel-1 /bar /baz /channel-2 /bar /baz Besides signal loading, we have to search for signal name as fast as possible, and return with the shortest unique device name part and the signal name. Using the structure above, iterating over the group names is quite slow. So I build up a table from device and signal name. As far as I know, the pytables query does not support string searching (e.g. startswidth, *foo[0-9]ch*, etc.), so fetching this table lead us to a pure python loop which is slow again. Therefore I build up a python dictionary from the table, which provide fast iteration against the table, but the init time increased from 100 ms to 3-4 sec (we have more than 40 000 signals). Do you have any advice how to search for group names in hdf5 with pytables in an efficient way? ps: I would be most happy with a glob interface. thanks for your advices in advance, gergo |
From: Francesc A. <fa...@gm...> - 2013-07-29 02:19:13
|
On 7/28/13 9:58 PM, Anthony Scopatz wrote: > > On Sun, Jul 28, 2013 at 8:38 PM, David Reed <dav...@gm... > <mailto:dav...@gm...>> wrote: > > I'm really trying to become more productive using PyTables, but am > struggling with what I should be using. Whats the difference > between a table and an array? > > > Hi David, > > The difference between Arrays and Tables, conceptually is the same as > the different between numpy arrays and numpy structured arrays. The > plain old [Aa]rray is a continuous block of a single data type. > Tables and structured arrays have a more complex data type that is > composed of a continuous sequence of other data types (ie the fields / > columns). Which data structure you use really depends a lot of the > type of problem you are trying to solve and what kinds of questions > you want to answer with that data structure. > > That said, the implementation of Tables is far more similar to EArrays > than Arrays. So a lot of the performance trade offs that you see are > similar. Besides this, another interesting difference is that Tables allow queries to be performed in a similar way to relational databases (but using a more NumPy-esque syntax). Here it is some examples: http://pytables.github.io/cookbook/hints_for_sql_users.html?highlight=query#selecting-data and you can index columns too: http://pytables.github.io/cookbook/hints_for_sql_users.html?highlight=query#creating-an-index so that you can accelerate queries involving indexed columns. -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2013-07-29 01:58:28
|
On Sun, Jul 28, 2013 at 8:38 PM, David Reed <dav...@gm...> wrote: > I'm really trying to become more productive using PyTables, but am > struggling with what I should be using. Whats the difference between a > table and an array? > Hi David, The difference between Arrays and Tables, conceptually is the same as the different between numpy arrays and numpy structured arrays. The plain old [Aa]rray is a continuous block of a single data type. Tables and structured arrays have a more complex data type that is composed of a continuous sequence of other data types (ie the fields / columns). Which data structure you use really depends a lot of the type of problem you are trying to solve and what kinds of questions you want to answer with that data structure. That said, the implementation of Tables is far more similar to EArrays than Arrays. So a lot of the performance trade offs that you see are similar. You should watch my "HDF5 is for Lovers" talk for more generic advice [1]. I hope this helps! Be Well Anthony 1. http://www.youtube.com/watch?v=Nzx0HAd3FiI > > > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: David R. <dav...@gm...> - 2013-07-29 01:39:14
|
I'm really trying to become more productive using PyTables, but am struggling with what I should be using. Whats the difference between a table and an array? |
From: Francesc A. <fa...@gm...> - 2013-07-28 17:38:47
|
More input from Jeff Reback. Hey Jeff, I see that you try to post here from time to time, but your messages bounce because the address that you use as sender is not subscribed. Please make sure that you post from a subscribed address. Thanks! Francesc On 7/28/13 11:35 AM, pyt...@li... wrote: > The attached message has been automatically discarded. Re: [Pytables-users] Compression and Indexing in PyTables.eml Subject: Re: [Pytables-users] Compression and Indexing in PyTables From: Jeff Reback <jr...@ya...> Date: 7/28/13 11:35 AM To: Discussion list for PyTables <pyt...@li...> pandas stores using Pytables and embeds extra meta data in the attributes to enable deserialization to the original pandas structure On Jul 28, 2013, at 11:23 AM, Francesc Alted<fa...@gm...> wrote: > On 7/28/13 10:21 AM, David Reed wrote: >> maybe I wasn't aware of this, but has PANDAS completely wrapped >> PyTables, or is PyTables something I should still be using for storing >> and accessing scientific data, and PANDAS has an access point to it? > Yeah, more the later than the former. PyTables is an standalone > library, but Pandas uses it as another storage backend. > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Francesc A. <fa...@gm...> - 2013-07-28 15:23:50
|
On 7/28/13 10:21 AM, David Reed wrote: > maybe I wasn't aware of this, but has PANDAS completely wrapped > PyTables, or is PyTables something I should still be using for storing > and accessing scientific data, and PANDAS has an access point to it? Yeah, more the later than the former. PyTables is an standalone library, but Pandas uses it as another storage backend. -- Francesc Alted |
From: Francesc A. <fa...@gm...> - 2013-07-28 15:21:35
|
On 7/28/13 9:24 AM, David Reed wrote: > Hi there, I was wondering if there any nice tutorials that show the > different compression options such as zlib, bzo, etc. and how to > actually use them with my tables. > > There seems to be a lot of good information describing the performance > increase under the Optimization Tips section, but I don't see any > clear way of actually doing this. > > Maybe I'm missing something. Well, the compression options are part of the more general Filters helper class: http://pytables.github.io/usersguide/libref/helper_classes.html#the-filters-class This stems from the fact that in HDF5 a compressor is just like another data filter. -- Francesc Alted |
From: David R. <dav...@gm...> - 2013-07-28 14:22:20
|
maybe I wasn't aware of this, but has PANDAS completely wrapped PyTables, or is PyTables something I should still be using for storing and accessing scientific data, and PANDAS has an access point to it? On Sun, Jul 28, 2013 at 9:48 AM, Andreas Hilboll <li...@hi...> wrote: > Hi David, > > Am 28.07.2013 15:24, schrieb David Reed: > > Hi there, I was wondering if there any nice tutorials that show the > > different compression options such as zlib, bzo, etc. and how to > > actually use them with my tables. > > > > There seems to be a lot of good information describing the performance > > increase under the Optimization Tips section, but I don't see any clear > > way of actually doing this. > > > > Maybe I'm missing something. > > Maybe you're missing this: > > http://pandas.pydata.org/pandas-docs/stable/io.html#compression > > The HDFStore constructor has a "complib" kwarg which you can use to set > the compression library. Also look at "complevel" to set the compression > efficiency. > > -- Andreas. > |
From: Andreas H. <li...@hi...> - 2013-07-28 13:48:22
|
Hi David, Am 28.07.2013 15:24, schrieb David Reed: > Hi there, I was wondering if there any nice tutorials that show the > different compression options such as zlib, bzo, etc. and how to > actually use them with my tables. > > There seems to be a lot of good information describing the performance > increase under the Optimization Tips section, but I don't see any clear > way of actually doing this. > > Maybe I'm missing something. Maybe you're missing this: http://pandas.pydata.org/pandas-docs/stable/io.html#compression The HDFStore constructor has a "complib" kwarg which you can use to set the compression library. Also look at "complevel" to set the compression efficiency. -- Andreas. |
From: David R. <dav...@gm...> - 2013-07-28 13:24:31
|
Hi there, I was wondering if there any nice tutorials that show the different compression options such as zlib, bzo, etc. and how to actually use them with my tables. There seems to be a lot of good information describing the performance increase under the Optimization Tips section, but I don't see any clear way of actually doing this. Maybe I'm missing something. Thanks for the help. -Dave |
From: Pushkar R. P. <top...@gm...> - 2013-07-19 00:27:47
|
Thanks. I will try it out and post any findings. Pushkar On Thu, Jul 18, 2013 at 12:36 AM, Andreas Hilboll <li...@hi...> wrote: > > > > You could use pandas_ and the read_table function. There, you have nrows > and skiprows parameters with which you can easily do your own 'streaming'. > > .. _pandas: http://pandas.pydata.org/ On Thu, Jul 18, 2013 at 1:00 AM, Antonio Valentino < ant...@ti...> wrote: > Hi Pushkar, > > Il 18/07/2013 08:45, Pushkar Raj Pande ha scritto: > > Both loadtxt and genfromtxt read the entire data into memory which is not > > desirable. Is there a way to achieve streaming writes? > > > > OK, probably fromfile [1] can help you to cook something that works > without loading the entire file into memory (and without too much > iterations over the file). > > Anyway I strongly recommend you to not perform read/write cycles on > single lines, rather define a reasonable data block size (number of > rows) and process the file in chunks. > > If you find a reasonably simple solution it would be nice to include it > in out documentation as an example or a "recipe" [2] > > [1] > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html#numpy.fromfile > [2] http://pytables.github.io/latest/cookbook/index.html > > best regards > > antonio > > |
From: Valeriy S. <sok...@gm...> - 2013-07-18 09:06:11
|
Thank you, Anthony, I will try VLArray as you suggested =) On Thu, Jul 18, 2013 at 3:39 AM, Anthony Scopatz <sc...@gm...> wrote: > Hello Valeriy, > > For better or worse, the is exactly the performance I would expect. The > thing that you are running up against is that every HDF5 data set has 64 Kb > of header space for meta information. There is no way of changing this > without invalidating the HDF5 spec. The fact that you are seeing an > average of 70 Kb per data set is consistent since data sets don't need to > be contiguously stored. > > I would suggest that you use a VLArray [1] of length-1 string atoms. > You'll lose the filenode interface but you'll also loose the 3200% > overhead =). > > Be Well > Anthony > > 1. > http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-vlarray-class > > > On Wed, Jul 17, 2013 at 3:14 PM, Valeriy Sokolov <sok...@gm... > > wrote: > >> Not sure if the quoted message was delivered to the list (maybe because I >> was not registered on this list), so reposting it this way... >> >> On Fri, Jul 12, 2013 at 5:40 PM, Valeriy Sokolov < >> sok...@gm...> wrote: >> >>> Hi, >>> >>> I am trying to store lots of small (~2Kb) files in the filenode-s of the >>> pytables. And I ran into a trouble with size overhead. >>> >>> 200 such files which consumes in total ~2Mb on the filesystem takes 14Mb >>> in the .h5 file produced by pytables. My experiments show that if I create >>> 200 file nodes and store 1 byte in each, I have .h5 of 14Mb. Approximately >>> from the size like 200Kb per file node I have a linear increase of size. >>> I.e. 400Kb per node leads to 89Mb, and 800Kb per node leads to 164Mb. >>> >>> But I would like to store ~2Kb there and current overhead (like 70Kb per >>> file node) is pretty huge. >>> >>> Could you please help me with work-around for this issue? >>> >>> Thank you in advance. >>> >>> -- >>> Best regards, >>> Valeriy Sokolov. >>> >> >> >> >> -- >> Best regards, >> Valeriy Sokolov. >> >> >> ------------------------------------------------------------------------------ >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > -- Best regards, Valeriy Sokolov. |
From: Antonio V. <ant...@ti...> - 2013-07-18 08:27:51
|
Hi Pushkar, Il 18/07/2013 08:45, Pushkar Raj Pande ha scritto: > Both loadtxt and genfromtxt read the entire data into memory which is not > desirable. Is there a way to achieve streaming writes? > OK, probably fromfile [1] can help you to cook something that works without loading the entire file into memory (and without too much iterations over the file). Anyway I strongly recommend you to not perform read/write cycles on single lines, rather define a reasonable data block size (number of rows) and process the file in chunks. If you find a reasonably simple solution it would be nice to include it in out documentation as an example or a "recipe" [2] [1] http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html#numpy.fromfile [2] http://pytables.github.io/latest/cookbook/index.html best regards antonio > Thanks, > Pushkar > > > On Wed, Jul 17, 2013 at 7:04 PM, Pushkar Raj Pande <top...@gm...>wrote: > >> Thanks Antonio and Anthony. I will give this a try. >> >> -Pushkar >> >> >> On Wed, Jul 17, 2013 at 2:59 PM, < >> pyt...@li...> wrote: >> >>> Date: Wed, 17 Jul 2013 16:59:16 -0500 >>> From: Anthony Scopatz <sc...@gm...> >>> Subject: Re: [Pytables-users] Pytables bulk loading data >>> To: Discussion list for PyTables >>> <pyt...@li...> >>> Message-ID: >>> < >>> CAP...@ma...> >>> Content-Type: text/plain; charset="iso-8859-1" >>> >>> Hi Pushkar, >>> >>> I agree with Antonio. You should load your data with NumPy functions and >>> then write back out to PyTables. This is the fastest way to do things. >>> >>> Be Well >>> Anthony >>> >>> >>> On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino < >>> ant...@ti...> wrote: >>> >>>> Hi Pushkar, >>>> >>>> Il 17/07/2013 19:28, Pushkar Raj Pande ha scritto: >>>>> Hi all, >>>>> >>>>> I am trying to figure out the best way to bulk load data into >>> pytables. >>>>> This question may have been already answered but I couldn't find what >>> I >>>> was >>>>> looking for. >>>>> >>>>> The source data is in form of csv which may require parsing, type >>>> checking >>>>> and setting default values if it doesn't conform to the type of the >>>> column. >>>>> There are over 100 columns in a record. Doing this in a loop in python >>>> for >>>>> each row of the record is very slow compared to just fetching the rows >>>> from >>>>> one pytable file and writing it to another. Difference is almost a >>> factor >>>>> of ~50. >>>>> >>>>> I believe if I load the data using a C procedure that does the parsing >>>> and >>>>> builds the records to write in pytables I can get close to the speed >>> of >>>>> just copying and writing the rows from 1 pytable to another. But may >>> be >>>>> there is something simple and better that already exists. Can someone >>>>> please advise? But if it is a C procedure that I should write can >>> someone >>>>> point me to some examples or snippets that I can refer to put this >>>> together. >>>>> >>>>> Thanks, >>>>> Pushkar >>>>> >>>> >>>> numpy has some tools for loading data from csv files like loadtxt [1], >>>> genfromtxt [2] and other variants. >>>> >>>> Non of them is OK for you? >>>> >>>> [1] >>>> >>>> >>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt >>>> [2] >>>> >>>> >>> http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt >>>> >>>> >>>> cheers >>>> >>>> -- >>>> Antonio Valentino -- Antonio Valentino |
From: Andreas H. <li...@hi...> - 2013-07-18 07:37:44
|
On 18.07.2013 08:45, Pushkar Raj Pande wrote: > Both loadtxt and genfromtxt read the entire data into memory which is > not desirable. Is there a way to achieve streaming writes? > > Thanks, > Pushkar > > > On Wed, Jul 17, 2013 at 7:04 PM, Pushkar Raj Pande <top...@gm... > <mailto:top...@gm...>> wrote: > > Thanks Antonio and Anthony. I will give this a try. > > -Pushkar > > > On Wed, Jul 17, 2013 at 2:59 PM, > <pyt...@li... > <mailto:pyt...@li...>> wrote: > > Date: Wed, 17 Jul 2013 16:59:16 -0500 > From: Anthony Scopatz <sc...@gm... <mailto:sc...@gm...>> > Subject: Re: [Pytables-users] Pytables bulk loading data > To: Discussion list for PyTables > <pyt...@li... > <mailto:pyt...@li...>> > Message-ID: > > <CAP...@ma... > <mailto:CAPk-6T4Ht9%2BNcDd_1OojrbN4U_6%2BO...@ma...>> > Content-Type: text/plain; charset="iso-8859-1" > > Hi Pushkar, > > I agree with Antonio. You should load your data with NumPy > functions and > then write back out to PyTables. This is the fastest way to do > things. > > Be Well > Anthony > > > On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino < > ant...@ti... > <mailto:ant...@ti...>> wrote: > > > Hi Pushkar, > > > > Il 17/07/2013 19:28, Pushkar Raj Pande ha scritto: > > > Hi all, > > > > > > I am trying to figure out the best way to bulk load data > into pytables. > > > This question may have been already answered but I couldn't > find what I > > was > > > looking for. > > > > > > The source data is in form of csv which may require parsing, > type > > checking > > > and setting default values if it doesn't conform to the type > of the > > column. > > > There are over 100 columns in a record. Doing this in a loop > in python > > for > > > each row of the record is very slow compared to just > fetching the rows > > from > > > one pytable file and writing it to another. Difference is > almost a factor > > > of ~50. > > > > > > I believe if I load the data using a C procedure that does > the parsing > > and > > > builds the records to write in pytables I can get close to > the speed of > > > just copying and writing the rows from 1 pytable to another. > But may be > > > there is something simple and better that already exists. > Can someone > > > please advise? But if it is a C procedure that I should > write can someone > > > point me to some examples or snippets that I can refer to > put this > > together. > > > > > > Thanks, > > > Pushkar > > > > > > > numpy has some tools for loading data from csv files like > loadtxt [1], > > genfromtxt [2] and other variants. > > > > Non of them is OK for you? > > > > [1] > > > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt > > [2] > > > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt > > > > > > cheers > > > > -- > > Antonio Valentino > > > > > > > ------------------------------------------------------------------------------ > > See everything from the browser to the database with AppDynamics > > Get end-to-end visibility with application monitoring from > AppDynamics > > Isolate bottlenecks and diagnose root cause in seconds. > > Start your free trial of AppDynamics Pro today! > > > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from > AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > > ------------------------------ > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > End of Pytables-users Digest, Vol 86, Issue 8 > ********************************************* > > > > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > You could use pandas_ and the read_table function. There, you have nrows and skiprows parameters with which you can easily do your own 'streaming'. .. _pandas: http://pandas.pydata.org/ -- Andreas |
From: Pushkar R. P. <top...@gm...> - 2013-07-18 06:46:15
|
Both loadtxt and genfromtxt read the entire data into memory which is not desirable. Is there a way to achieve streaming writes? Thanks, Pushkar On Wed, Jul 17, 2013 at 7:04 PM, Pushkar Raj Pande <top...@gm...>wrote: > Thanks Antonio and Anthony. I will give this a try. > > -Pushkar > > > On Wed, Jul 17, 2013 at 2:59 PM, < > pyt...@li...> wrote: > >> Date: Wed, 17 Jul 2013 16:59:16 -0500 >> From: Anthony Scopatz <sc...@gm...> >> Subject: Re: [Pytables-users] Pytables bulk loading data >> To: Discussion list for PyTables >> <pyt...@li...> >> Message-ID: >> < >> CAP...@ma...> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Hi Pushkar, >> >> I agree with Antonio. You should load your data with NumPy functions and >> then write back out to PyTables. This is the fastest way to do things. >> >> Be Well >> Anthony >> >> >> On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino < >> ant...@ti...> wrote: >> >> > Hi Pushkar, >> > >> > Il 17/07/2013 19:28, Pushkar Raj Pande ha scritto: >> > > Hi all, >> > > >> > > I am trying to figure out the best way to bulk load data into >> pytables. >> > > This question may have been already answered but I couldn't find what >> I >> > was >> > > looking for. >> > > >> > > The source data is in form of csv which may require parsing, type >> > checking >> > > and setting default values if it doesn't conform to the type of the >> > column. >> > > There are over 100 columns in a record. Doing this in a loop in python >> > for >> > > each row of the record is very slow compared to just fetching the rows >> > from >> > > one pytable file and writing it to another. Difference is almost a >> factor >> > > of ~50. >> > > >> > > I believe if I load the data using a C procedure that does the parsing >> > and >> > > builds the records to write in pytables I can get close to the speed >> of >> > > just copying and writing the rows from 1 pytable to another. But may >> be >> > > there is something simple and better that already exists. Can someone >> > > please advise? But if it is a C procedure that I should write can >> someone >> > > point me to some examples or snippets that I can refer to put this >> > together. >> > > >> > > Thanks, >> > > Pushkar >> > > >> > >> > numpy has some tools for loading data from csv files like loadtxt [1], >> > genfromtxt [2] and other variants. >> > >> > Non of them is OK for you? >> > >> > [1] >> > >> > >> http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt >> > [2] >> > >> > >> http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt >> > >> > >> > cheers >> > >> > -- >> > Antonio Valentino >> > >> > >> > >> ------------------------------------------------------------------------------ >> > See everything from the browser to the database with AppDynamics >> > Get end-to-end visibility with application monitoring from AppDynamics >> > Isolate bottlenecks and diagnose root cause in seconds. >> > Start your free trial of AppDynamics Pro today! >> > >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> >> ------------------------------ >> >> >> ------------------------------------------------------------------------------ >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> >> ------------------------------ >> >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> End of Pytables-users Digest, Vol 86, Issue 8 >> ********************************************* >> > > |
From: Pushkar R. P. <top...@gm...> - 2013-07-18 02:05:15
|
Thanks Antonio and Anthony. I will give this a try. -Pushkar On Wed, Jul 17, 2013 at 2:59 PM, < pyt...@li...> wrote: > Date: Wed, 17 Jul 2013 16:59:16 -0500 > From: Anthony Scopatz <sc...@gm...> > Subject: Re: [Pytables-users] Pytables bulk loading data > To: Discussion list for PyTables > <pyt...@li...> > Message-ID: > < > CAP...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > Hi Pushkar, > > I agree with Antonio. You should load your data with NumPy functions and > then write back out to PyTables. This is the fastest way to do things. > > Be Well > Anthony > > > On Wed, Jul 17, 2013 at 2:12 PM, Antonio Valentino < > ant...@ti...> wrote: > > > Hi Pushkar, > > > > Il 17/07/2013 19:28, Pushkar Raj Pande ha scritto: > > > Hi all, > > > > > > I am trying to figure out the best way to bulk load data into pytables. > > > This question may have been already answered but I couldn't find what I > > was > > > looking for. > > > > > > The source data is in form of csv which may require parsing, type > > checking > > > and setting default values if it doesn't conform to the type of the > > column. > > > There are over 100 columns in a record. Doing this in a loop in python > > for > > > each row of the record is very slow compared to just fetching the rows > > from > > > one pytable file and writing it to another. Difference is almost a > factor > > > of ~50. > > > > > > I believe if I load the data using a C procedure that does the parsing > > and > > > builds the records to write in pytables I can get close to the speed of > > > just copying and writing the rows from 1 pytable to another. But may be > > > there is something simple and better that already exists. Can someone > > > please advise? But if it is a C procedure that I should write can > someone > > > point me to some examples or snippets that I can refer to put this > > together. > > > > > > Thanks, > > > Pushkar > > > > > > > numpy has some tools for loading data from csv files like loadtxt [1], > > genfromtxt [2] and other variants. > > > > Non of them is OK for you? > > > > [1] > > > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt > > [2] > > > > > http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt > > > > > > cheers > > > > -- > > Antonio Valentino > > > > > > > ------------------------------------------------------------------------------ > > See everything from the browser to the database with AppDynamics > > Get end-to-end visibility with application monitoring from AppDynamics > > Isolate bottlenecks and diagnose root cause in seconds. > > Start your free trial of AppDynamics Pro today! > > > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > > ------------------------------ > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > End of Pytables-users Digest, Vol 86, Issue 8 > ********************************************* > |
From: Anthony S. <sc...@gm...> - 2013-07-17 23:39:42
|
Hello Valeriy, For better or worse, the is exactly the performance I would expect. The thing that you are running up against is that every HDF5 data set has 64 Kb of header space for meta information. There is no way of changing this without invalidating the HDF5 spec. The fact that you are seeing an average of 70 Kb per data set is consistent since data sets don't need to be contiguously stored. I would suggest that you use a VLArray [1] of length-1 string atoms. You'll lose the filenode interface but you'll also loose the 3200% overhead =). Be Well Anthony 1. http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-vlarray-class On Wed, Jul 17, 2013 at 3:14 PM, Valeriy Sokolov <sok...@gm...>wrote: > Not sure if the quoted message was delivered to the list (maybe because I > was not registered on this list), so reposting it this way... > > On Fri, Jul 12, 2013 at 5:40 PM, Valeriy Sokolov <sok...@gm... > > wrote: > >> Hi, >> >> I am trying to store lots of small (~2Kb) files in the filenode-s of the >> pytables. And I ran into a trouble with size overhead. >> >> 200 such files which consumes in total ~2Mb on the filesystem takes 14Mb >> in the .h5 file produced by pytables. My experiments show that if I create >> 200 file nodes and store 1 byte in each, I have .h5 of 14Mb. Approximately >> from the size like 200Kb per file node I have a linear increase of size. >> I.e. 400Kb per node leads to 89Mb, and 800Kb per node leads to 164Mb. >> >> But I would like to store ~2Kb there and current overhead (like 70Kb per >> file node) is pretty huge. >> >> Could you please help me with work-around for this issue? >> >> Thank you in advance. >> >> -- >> Best regards, >> Valeriy Sokolov. >> > > > > -- > Best regards, > Valeriy Sokolov. > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |