You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Francesc A. <fa...@py...> - 2012-03-27 14:54:25
|
On 3/27/12 1:35 AM, Tobias Erhardt wrote: > Hi everybody > > I've been trying to install pyTables on my OSX Lion machine the past few days without any success. > > In my opinion the problem is that the extension trys to build as universal (i386 and x86_64) while the hdf5 libraries are only available as x86_64 on my system. > > Here is my setup: > Mac OSX Lion patched up to the latest version > XCode 4.3 with the latest command line utilities > Python 2.7 as distributed with OSX Lion > The ScipySuper Pack (https://github.com/fonnesbeck/ScipySuperpack) > > numexpr==2.0.1 > numpy==2.0.0.dev-4c0576f-20120208 > Cython==0.15.1 > > hdf5 was installed using homebrew > > the Build log can be found here: https://gist.github.com/2213302 > Line 471 hints at the problem that the library is not compiled as universal > > BTW: I do have the same problems with the h5py build The log seems fine to me: """ changing mode of /usr/local/bin/nctoh5 to 755 changing mode of /usr/local/bin/ptdump to 755 changing mode of /usr/local/bin/ptrepack to 755 Successfully installed tables Cleaning up... """ why are you saying that the installation did not work? Anyway, the only suspicios thing that I have found in your log file is that HDF5 headers and libraries are in different directories: * Found HDF5 headers at ``/usr/include``, library at ``/usr/local/lib``. This is not grave, but do you have an explanation for this? -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-27 07:21:11
|
>> (but how to grow it in columns without deleting& recreating?) > > You can't (at least on cheap way). Maybe you may want to create > additional tables and grouping them in terms of the columns you are > going to need for your queries. Sorry, it is not clear to me: create new tables and (grouping = grouping in HDF5 Groups) them in terms of the columns? As far as I understood, only columns in the same table (regardless the group of the table) can be queried together with the in-kernel engine? >> Are there alternatives? > > Yes. The alternative would be to have column-wise tables, that would > allow you to add and remove columns at a a cost of almost zero. This > idea of column-wise tables is quite flexible, and would let you have > even variable-length columns, as well as computed columns (that is, data > that is generated from other columns based on other columns). These > will have a lot of applications, IMO. I would like to add this proposal > to our next round of applications for projects to improve PyTables. > Let's see how it goes. This sounds definitely interesting. But I see the interest that PyTables can query columns in different tables in-kernel, because it removes one big constraint for data layout (and this is in turn important because attr dictionaries can only be attached to whole tables AFAIK). The solution I suggest would be that whenever other columns are involved, the in-kernel engine loops over the zip of the columns. It could do a pre-check on column length before starting. This would be a quite useful enhancement for me. > Francesc >> >> -á. >> >> >> >> On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >>> Hi there, >>> >>> I am following advice by Anthony and giving a go at representing >>> different sensors in my dataset as columns in a Table, or in several >>> Tables. This is about in-kernel queries. >>> >>> The documentation of condvars in Table.where [1] says "condvars should >>> consist of identifier-like strings pointing to Column (see The Column >>> class) instances of this table, or to other values (which will be >>> converted to arrays)". >>> >>> Conversion to arrays will likely exhaust the memory and be slow. >>> Furthermore, when I tried with a toy example (naively extrapolating >>> the behaviour of indexing in numpy), I obtained >>> >>> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >>> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >>> >>> (... elided output) >>> ValueError: variable ``b`` refers to a column which is not part of >>> table ``/tetrode1 >>> >>> I am interested in the scenario where an in-kernel query is applied to >>> a table based in columns *from other tables* that still are aligned >>> with the current table (same number of elements). These conditions may >>> be sophisticated and mix columns from the local table as well. >>> >>> One obvious solution would be to put all aligned columns on the same >>> table. But adding columns to a table is cumbersome, and I cannot think >>> beforehand of the many precomputed columns that I would like to use as >>> query conditions. >>> >>> What do you recommend in this scenario? >>> >>> -á. >>> >>> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Tobias E. <tob...@gm...> - 2012-03-27 06:35:53
|
Hi everybody I've been trying to install pyTables on my OSX Lion machine the past few days without any success. In my opinion the problem is that the extension trys to build as universal (i386 and x86_64) while the hdf5 libraries are only available as x86_64 on my system. Here is my setup: Mac OSX Lion patched up to the latest version XCode 4.3 with the latest command line utilities Python 2.7 as distributed with OSX Lion The ScipySuper Pack (https://github.com/fonnesbeck/ScipySuperpack) numexpr==2.0.1 numpy==2.0.0.dev-4c0576f-20120208 Cython==0.15.1 hdf5 was installed using homebrew the Build log can be found here: https://gist.github.com/2213302 Line 471 hints at the problem that the library is not compiled as universal BTW: I do have the same problems with the h5py build Any Help is appreciated Thanks Tobias |
From: Francesc A. <fa...@py...> - 2012-03-26 22:57:33
|
Hi Alvaro, On 3/26/12 12:43 PM, Alvaro Tejero Cantero wrote: > Would it be an option to have > > * raw data on one table > * all imaginable columns used for query conditions in another table Yes, that sounds like a good solution to me. > (but how to grow it in columns without deleting& recreating?) You can't (at least on cheap way). Maybe you may want to create additional tables and grouping them in terms of the columns you are going to need for your queries. > > and fetch indexes for the first based on .whereList(condition) of the second? Exactly. > Are there alternatives? Yes. The alternative would be to have column-wise tables, that would allow you to add and remove columns at a a cost of almost zero. This idea of column-wise tables is quite flexible, and would let you have even variable-length columns, as well as computed columns (that is, data that is generated from other columns based on other columns). These will have a lot of applications, IMO. I would like to add this proposal to our next round of applications for projects to improve PyTables. Let's see how it goes. Francesc > > -á. > > > > On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero<al...@mi...> wrote: >> Hi there, >> >> I am following advice by Anthony and giving a go at representing >> different sensors in my dataset as columns in a Table, or in several >> Tables. This is about in-kernel queries. >> >> The documentation of condvars in Table.where [1] says "condvars should >> consist of identifier-like strings pointing to Column (see The Column >> class) instances of this table, or to other values (which will be >> converted to arrays)". >> >> Conversion to arrays will likely exhaust the memory and be slow. >> Furthermore, when I tried with a toy example (naively extrapolating >> the behaviour of indexing in numpy), I obtained >> >> In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18)& >> (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] >> >> (... elided output) >> ValueError: variable ``b`` refers to a column which is not part of >> table ``/tetrode1 >> >> I am interested in the scenario where an in-kernel query is applied to >> a table based in columns *from other tables* that still are aligned >> with the current table (same number of elements). These conditions may >> be sophisticated and mix columns from the local table as well. >> >> One obvious solution would be to put all aligned columns on the same >> table. But adding columns to a table is cumbersome, and I cannot think >> beforehand of the many precomputed columns that I would like to use as >> query conditions. >> >> What do you recommend in this scenario? >> >> -á. >> >> [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-26 17:43:53
|
Would it be an option to have * raw data on one table * all imaginable columns used for query conditions in another table (but how to grow it in columns without deleting & recreating?) and fetch indexes for the first based on .whereList(condition) of the second? Are there alternatives? -á. On Mon, Mar 26, 2012 at 18:29, Alvaro Tejero Cantero <al...@mi...> wrote: > Hi there, > > I am following advice by Anthony and giving a go at representing > different sensors in my dataset as columns in a Table, or in several > Tables. This is about in-kernel queries. > > The documentation of condvars in Table.where [1] says "condvars should > consist of identifier-like strings pointing to Column (see The Column > class) instances of this table, or to other values (which will be > converted to arrays)". > > Conversion to arrays will likely exhaust the memory and be slow. > Furthermore, when I tried with a toy example (naively extrapolating > the behaviour of indexing in numpy), I obtained > > In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18) & > (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] > > (... elided output) > ValueError: variable ``b`` refers to a column which is not part of > table ``/tetrode1 > > I am interested in the scenario where an in-kernel query is applied to > a table based in columns *from other tables* that still are aligned > with the current table (same number of elements). These conditions may > be sophisticated and mix columns from the local table as well. > > One obvious solution would be to put all aligned columns on the same > table. But adding columns to a table is cumbersome, and I cannot think > beforehand of the many precomputed columns that I would like to use as > query conditions. > > What do you recommend in this scenario? > > -á. > > [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where |
From: Alvaro T. C. <al...@mi...> - 2012-03-26 17:29:47
|
Hi there, I am following advice by Anthony and giving a go at representing different sensors in my dataset as columns in a Table, or in several Tables. This is about in-kernel queries. The documentation of condvars in Table.where [1] says "condvars should consist of identifier-like strings pointing to Column (see The Column class) instances of this table, or to other values (which will be converted to arrays)". Conversion to arrays will likely exhaust the memory and be slow. Furthermore, when I tried with a toy example (naively extrapolating the behaviour of indexing in numpy), I obtained In [109]: valuesext = [x['V01'] for x in tet1.where("""(b>18) & (a<4)""", condvars={'a':tet1.cols.V01,'b':tet2.cols.V02})] (... elided output) ValueError: variable ``b`` refers to a column which is not part of table ``/tetrode1 I am interested in the scenario where an in-kernel query is applied to a table based in columns *from other tables* that still are aligned with the current table (same number of elements). These conditions may be sophisticated and mix columns from the local table as well. One obvious solution would be to put all aligned columns on the same table. But adding columns to a table is cumbersome, and I cannot think beforehand of the many precomputed columns that I would like to use as query conditions. What do you recommend in this scenario? -á. [1] http://pytables.github.com/usersguide/libref.html?highlight=vlstring#tables.Table.where |
From: Francesc A. <fa...@gm...> - 2012-03-22 22:46:00
|
On 3/22/12 1:59 PM, Francesc Alted wrote: > On 3/22/12 12:48 PM, sreeaurovindh viswanathan wrote: >> But.. Can i get sort one column by descending and the other ascending. >> say >> if i have two columns and first i would like to sort the one in >> ascending and then sort the second column based on the search from the >> first. >> >> >> I mean I i have >> >> 1 5 >> 2 6 >> 1 8 >> 2 9 >> >> Could i get an output as >> >> 1 5 >> 1 8 >> 2 6 >> 2 9 > No, this is not supported by PyTables. > > But hey, you can always make use of the sorted iterator, and the > additonal sorting by yourselves. In your example, let's suppose that > column 0 is named 'f0' and column 1 is named 'f1'. Then, the next loop: > > prevval = None > gf1 = [] > for r in t.itersorted('f0'): > if r['f0'] != prevval: > if gf1: > gf1.sort() > print prevval, gf1[::-1] # reverse sorted > prevval = r['f0'] > gf1 = [] > gf1.append(r['f1']) > if gf1: > gf1.sort() > print prevval, gf1[::-1] # reverse sorted Hmm, I just realized that there it is another, equivalent code that solves the same problem: def field_selector(row): return row['f0'] for field, rows_grouped_by_field in itertools.groupby(t.itersorted('f0'), field_selector): group = [ r['f1'] for r in rows_grouped_by_field ] group.sort() print '%s -> %s' % (field, group[::-1]) The performance of both is similar, so use whatever you find more useful. For the record, I'm attaching a couple of self-contained examples that exercises the different approaches. -- Francesc Alted |
From: Alvaro T. C. <al...@mi...> - 2012-03-22 19:10:05
|
It seems that refs were proposed in the past, even with an implementation. Maybe this could be a starting point: http://www.mail-archive.com/pyt...@li.../msg01374.html -á. On Thu, Mar 15, 2012 at 12:56, Alvaro Tejero Cantero <al...@mi...> wrote: > Does PyTables support object region references[1]? > > When using soft links to other files, is a performance penalty > incurred? I like the idea of having the raw data, that never changes, > referenced from another file that is read-only. How do you guys > normally deal with this scenario? > > Álvaro. > > [1] I learnt about them here: http://h5py.alfven.org/docs-2.0/topics/refs.html |
From: Francesc A. <fa...@gm...> - 2012-03-22 18:59:20
|
On 3/22/12 12:48 PM, sreeaurovindh viswanathan wrote: > But.. Can i get sort one column by descending and the other ascending. > say > if i have two columns and first i would like to sort the one in > ascending and then sort the second column based on the search from the > first. > > > I mean I i have > > 1 5 > 2 6 > 1 8 > 2 9 > > Could i get an output as > > 1 5 > 1 8 > 2 6 > 2 9 No, this is not supported by PyTables. But hey, you can always make use of the sorted iterator, and the additonal sorting by yourselves. In your example, let's suppose that column 0 is named 'f0' and column 1 is named 'f1'. Then, the next loop: prevval = None gf1 = [] for r in t.itersorted('f0'): if r['f0'] != prevval: if gf1: gf1.sort() print prevval, gf1[::-1] # reverse sorted prevval = r['f0'] gf1 = [] gf1.append(r['f1']) if gf1: gf1.sort() print prevval, gf1[::-1] # reverse sorted will print the next values: f0-val0 [decreasing list of f1 values] f0-val1 [decreasing list of f1 values] ... f0-valN [decreasing list of f1 values] Hope this helps, -- Francesc Alted |
From: sreeaurovindh v. <sre...@gm...> - 2012-03-22 17:48:58
|
But.. Can i get sort one column by descending and the other ascending. say if i have two columns and first i would like to sort the one in ascending and then sort the second column based on the search from the first. I mean I i have 1 5 2 6 1 8 2 9 Could i get an output as 1 5 1 8 2 6 2 9 Sorry to restart the thread. Thanks Sree aurovindhV On Thu, Mar 22, 2012 at 10:09 PM, Francesc Alted <fa...@gm...> wrote: > On 3/22/12 11:02 AM, sreeaurovindh viswanathan wrote: >> Hi, >> >> If I have three columns in a table and if i wish to sort based on one >> field and then on the other what would be the recommended method.I >> would be sorting atleast 75,00,000 records at a time. >> >> ie I would like to use something equivalent the following sql query. >> >> Select * from sample.table order by query desc,keyword asc. > Provided that you have created a CSI index on this field, there are a > couple of ways for doing this: > > 1) Table.itersorted(): retrieves the table records in sorted order > > 2) Table.readSorted(): retrieves the complete sorted table as a > monolithic structured array > > Both methods follow ascending order by default. Choose a step=-1 for > choosing a descending order. > > -- Francesc Alted > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Ümit S. <uem...@gm...> - 2012-03-22 16:45:33
|
I completely forgot about the CSI index. That's of course much easier than what I suggested ;-) Am 22.03.2012 17:39 schrieb "Francesc Alted" <fa...@gm...>: > On 3/22/12 11:02 AM, sreeaurovindh viswanathan wrote: > > Hi, > > > > If I have three columns in a table and if i wish to sort based on one > > field and then on the other what would be the recommended method.I > > would be sorting atleast 75,00,000 records at a time. > > > > ie I would like to use something equivalent the following sql query. > > > > Select * from sample.table order by query desc,keyword asc. > Provided that you have created a CSI index on this field, there are a > couple of ways for doing this: > > 1) Table.itersorted(): retrieves the table records in sorted order > > 2) Table.readSorted(): retrieves the complete sorted table as a > monolithic structured array > > Both methods follow ascending order by default. Choose a step=-1 for > choosing a descending order. > > -- Francesc Alted > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: sreeaurovindh v. <sre...@gm...> - 2012-03-22 16:45:24
|
Thanks Francesc Alted for your advice. Regards Sree aurovindh V On Thu, Mar 22, 2012 at 10:09 PM, Francesc Alted <fa...@gm...> wrote: > On 3/22/12 11:02 AM, sreeaurovindh viswanathan wrote: >> Hi, >> >> If I have three columns in a table and if i wish to sort based on one >> field and then on the other what would be the recommended method.I >> would be sorting atleast 75,00,000 records at a time. >> >> ie I would like to use something equivalent the following sql query. >> >> Select * from sample.table order by query desc,keyword asc. > Provided that you have created a CSI index on this field, there are a > couple of ways for doing this: > > 1) Table.itersorted(): retrieves the table records in sorted order > > 2) Table.readSorted(): retrieves the complete sorted table as a > monolithic structured array > > Both methods follow ascending order by default. Choose a step=-1 for > choosing a descending order. > > -- Francesc Alted > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@gm...> - 2012-03-22 16:39:29
|
On 3/22/12 11:02 AM, sreeaurovindh viswanathan wrote: > Hi, > > If I have three columns in a table and if i wish to sort based on one > field and then on the other what would be the recommended method.I > would be sorting atleast 75,00,000 records at a time. > > ie I would like to use something equivalent the following sql query. > > Select * from sample.table order by query desc,keyword asc. Provided that you have created a CSI index on this field, there are a couple of ways for doing this: 1) Table.itersorted(): retrieves the table records in sorted order 2) Table.readSorted(): retrieves the complete sorted table as a monolithic structured array Both methods follow ascending order by default. Choose a step=-1 for choosing a descending order. -- Francesc Alted |
From: Ümit S. <uem...@gm...> - 2012-03-22 16:15:07
|
AFAIK there is no sort functionality built into PyTables. I think there are 4 ways to do it: 1.) load all 7.5 million records and sort it in memory (if it fits into the memory) 2.) implement your own external sorting algorithm (http://en.wikipedia.org/wiki/External_sorting) using pytables iterator or by slicing through your table in chunks 3.) create a vector of indices pre-sorted by your criteria and store it in your hdf5 structure. Then use this vector to retrieve the values with the correct sorting. 4.) if you know that you always want to access the data with this sorting, then you can also store the values with the appropriate sorting in the table cheers Ümit On Thu, Mar 22, 2012 at 5:02 PM, sreeaurovindh viswanathan <sre...@gm...> wrote: > Hi, > > If I have three columns in a table and if i wish to sort based on one > field and then on the other what would be the recommended method.I > would be sorting atleast 75,00,000 records at a time. > > ie I would like to use something equivalent the following sql query. > > Select * from sample.table order by query desc,keyword asc. > > How should i do it.. > > Thanks > Sree aurovindh > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: sreeaurovindh v. <sre...@gm...> - 2012-03-22 16:02:48
|
Hi, If I have three columns in a table and if i wish to sort based on one field and then on the other what would be the recommended method.I would be sorting atleast 75,00,000 records at a time. ie I would like to use something equivalent the following sql query. Select * from sample.table order by query desc,keyword asc. How should i do it.. Thanks Sree aurovindh |
From: Anthony S. <sc...@gm...> - 2012-03-21 14:48:48
|
On the other hand Tom, If you know that you will be doing < N insertions in the future, you can always pre-allocate a Table / Array that is of size N and pre-loaded with null values. You can then 'insert' by over-writing the nth row. Furthermore you can always append size N chunks whenever. For most of my problems, this has worked just fine, especially if you are dealing with sparse data. Be Well Anthony On Wed, Mar 21, 2012 at 9:18 AM, Francesc Alted <fa...@gm...> wrote: > On Mar 21, 2012, at 7:08 AM, Tom Diethe wrote: > > >>> I'm writing a wrapper for sparse matrices (CSR format) and therefore > >>> need to store three vectors and 3 scalars: > >>> > >>> - data (float64 vector) > >>> - indices (int32 vector) > >>> - indptr (int32 vector) > >>> > >>> - nrows (int32 scalar) > >>> - ncols (int32 scalar) > >>> - nnz (int32 scalar) > >>> > >>> data and indices will always be the same length as each other (=nnz) > >>> but the indptr vector is much shorter. > >>> > >>> I've written routines that allow you to insert/update/delete rows or > >>> columns of the matrix by acting on these vectors only. However I'm > >>> struggling to work out the best pytables structure to store these, > >>> that allows me to append/insert/delete elements easily, and is > >>> efficient. > >>> > >>> I was using a Group with an EArray for each vector. This works ok but > >>> it seems like you are unable to delete items - is that correct? > >>> > >>> I also tried using a Group with a separate Table for each of the > >>> vectors (I could possibly just have two - one for data and indices and > >>> the other for indptr), but this seems to add a lot of overhead in > >>> manipulating the arrays. > >>> > >>> Is there something simple I'm missing? > > Inserting on PyTables objects is not supported. The reason is that they > are implemented on top of HDF5 datasets, that does not support this either. > HDF5 is meant for dealing large datasets, and implementing insertions (or > deletions) is not an efficient operation (requires a complete rewrite of > the dataset). So, if you are going to need a lot of insertions or > deletions, then PyTables / HDF5 is probably not what you want. > > HTH, > > -- Francesc Alted > > > > > > > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Francesc A. <fa...@gm...> - 2012-03-21 14:19:02
|
On Mar 21, 2012, at 7:08 AM, Tom Diethe wrote: >>> I'm writing a wrapper for sparse matrices (CSR format) and therefore >>> need to store three vectors and 3 scalars: >>> >>> - data (float64 vector) >>> - indices (int32 vector) >>> - indptr (int32 vector) >>> >>> - nrows (int32 scalar) >>> - ncols (int32 scalar) >>> - nnz (int32 scalar) >>> >>> data and indices will always be the same length as each other (=nnz) >>> but the indptr vector is much shorter. >>> >>> I've written routines that allow you to insert/update/delete rows or >>> columns of the matrix by acting on these vectors only. However I'm >>> struggling to work out the best pytables structure to store these, >>> that allows me to append/insert/delete elements easily, and is >>> efficient. >>> >>> I was using a Group with an EArray for each vector. This works ok but >>> it seems like you are unable to delete items - is that correct? >>> >>> I also tried using a Group with a separate Table for each of the >>> vectors (I could possibly just have two - one for data and indices and >>> the other for indptr), but this seems to add a lot of overhead in >>> manipulating the arrays. >>> >>> Is there something simple I'm missing? Inserting on PyTables objects is not supported. The reason is that they are implemented on top of HDF5 datasets, that does not support this either. HDF5 is meant for dealing large datasets, and implementing insertions (or deletions) is not an efficient operation (requires a complete rewrite of the dataset). So, if you are going to need a lot of insertions or deletions, then PyTables / HDF5 is probably not what you want. HTH, -- Francesc Alted |
From: Tom D. <td...@gm...> - 2012-03-21 12:08:57
|
>> I'm writing a wrapper for sparse matrices (CSR format) and therefore >> need to store three vectors and 3 scalars: >> >> - data (float64 vector) >> - indices (int32 vector) >> - indptr (int32 vector) >> >> - nrows (int32 scalar) >> - ncols (int32 scalar) >> - nnz (int32 scalar) >> >> data and indices will always be the same length as each other (=nnz) >> but the indptr vector is much shorter. >> >> I've written routines that allow you to insert/update/delete rows or >> columns of the matrix by acting on these vectors only. However I'm >> struggling to work out the best pytables structure to store these, >> that allows me to append/insert/delete elements easily, and is >> efficient. >> >> I was using a Group with an EArray for each vector. This works ok but >> it seems like you are unable to delete items - is that correct? >> >> I also tried using a Group with a separate Table for each of the >> vectors (I could possibly just have two - one for data and indices and >> the other for indptr), but this seems to add a lot of overhead in >> manipulating the arrays. >> >> Is there something simple I'm missing? >> > > Hi Tom, > > It seems to me that what you want to be using is a Table, which supports a > lot of the functionality you desire, including removing rows (which EArrays > don't easily support) > http://pytables.github.com/usersguide/libref.html?highlight=remove#tables.Table.removeRows > . > > Rather than having a Group of EArrays, have a table whose columns represent > what you were storing in the EArrays. You should take a look at the Tables > tutorials to see how to do this: > http://pytables.github.com/usersguide/tutorials.html > > Be Well > Anthony > > I think I've read those tutorials a dozen times over now .... As far as I can tell there is no way of inserting a row (as opposed to appending it to the end) - am I correct? I'm looking for the equivalent of numpy.insert Tom |
From: Anthony S. <sc...@gm...> - 2012-03-20 19:18:32
|
On Tue, Mar 20, 2012 at 11:46 AM, Tom Diethe <td...@gm...> wrote: > I'm writing a wrapper for sparse matrices (CSR format) and therefore > need to store three vectors and 3 scalars: > > - data (float64 vector) > - indices (int32 vector) > - indptr (int32 vector) > > - nrows (int32 scalar) > - ncols (int32 scalar) > - nnz (int32 scalar) > > data and indices will always be the same length as each other (=nnz) > but the indptr vector is much shorter. > > I've written routines that allow you to insert/update/delete rows or > columns of the matrix by acting on these vectors only. However I'm > struggling to work out the best pytables structure to store these, > that allows me to append/insert/delete elements easily, and is > efficient. > > I was using a Group with an EArray for each vector. This works ok but > it seems like you are unable to delete items - is that correct? > > I also tried using a Group with a separate Table for each of the > vectors (I could possibly just have two - one for data and indices and > the other for indptr), but this seems to add a lot of overhead in > manipulating the arrays. > > Is there something simple I'm missing? > Hi Tom, It seems to me that what you want to be using is a Table, which supports a lot of the functionality you desire, including removing rows (which EArrays don't easily support) http://pytables.github.com/usersguide/libref.html?highlight=remove#tables.Table.removeRows . Rather than having a Group of EArrays, have a table whose columns represent what you were storing in the EArrays. You should take a look at the Tables tutorials to see how to do this: http://pytables.github.com/usersguide/tutorials.html Be Well Anthony > > Tom > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Tom D. <td...@gm...> - 2012-03-20 16:46:14
|
I'm writing a wrapper for sparse matrices (CSR format) and therefore need to store three vectors and 3 scalars: - data (float64 vector) - indices (int32 vector) - indptr (int32 vector) - nrows (int32 scalar) - ncols (int32 scalar) - nnz (int32 scalar) data and indices will always be the same length as each other (=nnz) but the indptr vector is much shorter. I've written routines that allow you to insert/update/delete rows or columns of the matrix by acting on these vectors only. However I'm struggling to work out the best pytables structure to store these, that allows me to append/insert/delete elements easily, and is efficient. I was using a Group with an EArray for each vector. This works ok but it seems like you are unable to delete items - is that correct? I also tried using a Group with a separate Table for each of the vectors (I could possibly just have two - one for data and indices and the other for indptr), but this seems to add a lot of overhead in manipulating the arrays. Is there something simple I'm missing? Tom |
From: sreeaurovindh v. <sre...@gm...> - 2012-03-19 18:40:08
|
Thanks for your clarification and immense help Regards Sree aurovindh V On Mon, Mar 19, 2012 at 11:58 PM, Anthony Scopatz <sc...@gm...> wrote: > On Mon, Mar 19, 2012 at 12:08 PM, sreeaurovindh viswanathan < > sre...@gm...> wrote: > > [snip] > > >> 2) Can you please point out to an example where i can do block hdf5 file >> write using pytables (sorry for this naive question) >> > > The Table.append() method ( > http://pytables.github.com/usersguide/libref.html#tables.Table.append) > allows you to write multiple rows at the same time. Arrays have similar > methods (http://pytables.github.com/usersguide/libref.html#earray-methods) > if they are extensible. Please note that these methods accept any > sequence which can be converted to a numpy record array! > > Be Well > Anthony > > >> >> Thanks >> Sree aurovindh V >> >> On Mon, Mar 19, 2012 at 10:16 PM, Anthony Scopatz <sc...@gm...>wrote: >> >>> What Francesc said ;) >>> >>> >>> On Mon, Mar 19, 2012 at 11:43 AM, Francesc Alted <fa...@gm...>wrote: >>> >>>> My advice regarding parallelization is: do not worry about this *at >>>> all* unless you already spent long time profiling your problem and you are >>>> sure that parallelizing could be of help. 99% of the time is much more >>>> productive focusing on improving serial speed. >>>> >>>> Please, try to follow Anthony's suggestion and split your queries in >>>> blocks, and pass these blocks to PyTables. That would represent a huge >>>> win. For example, use: >>>> >>>> SELECT * FROM `your_table` LIMIT 0, 10000 >>>> >>>> for the first block, and send the results to `Table.append`. Then go >>>> for the second block as: >>>> >>>> SELECT * FROM `your_table` LIMIT 10000, 20000 >>>> >>>> and pass this to `Table.append`. And so on and so forth until you >>>> exhaust all the data in your tables. >>>> >>>> Hope this helps, >>>> >>>> Francesc >>>> >>>> On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote: >>>> >>>> > Hi, >>>> > >>>> > Thanks for your reply.In that case how will be my querying >>>> efficiency? will i be able to query parrellely?(i.e) will i be able to run >>>> multiple queries on a single file.Also if i do it in 6 chunks will i be >>>> able to parrelize it? >>>> > >>>> > >>>> > Thanks >>>> > Sree aurovindh Viswanathan >>>> > On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <sc...@gm...> >>>> wrote: >>>> > Is there any way that you can query and write in much larger chunks >>>> that 6? I don't know much about postgresql in specific, but in general >>>> HDF5 does much better if you can take larger chunks. Perhaps you could at >>>> least do the postgresql in parallel. >>>> > >>>> > Be Well >>>> > Anthony >>>> > >>>> > On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan < >>>> sre...@gm...> wrote: >>>> > The problem is with respect to the writing speed of my computer and >>>> the postgresql query performance.I will explain the scenario in detail. >>>> > >>>> > I have data about 80 Gb (along with approprite database indexes in >>>> place). I am trying to read it from Postgresql database and writing it into >>>> HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 >>>> file.The implementation of Hdf5 is not multithreaded or enabled for >>>> symmetric multi processing. >>>> > >>>> > As for as the postgresql table is concerned the overall record size >>>> is 140 million and I have 5 primary- foreign key referring tables.I am not >>>> using joins as it is not scalable >>>> > >>>> > So for a single lookup i do 6 lookup without joins and write them >>>> into hdf5 format. For each lookup i do 6 inserts into each of the table and >>>> its corresponding arrays. >>>> > >>>> > The queries are really simple >>>> > >>>> > >>>> > select * from x.train where tr_id=1 (primary key & indexed) >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > select q_t from x.qt where q_id=2 (non-primary key but indexed) >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > (similarly four queries) >>>> > >>>> > Each computer writes two hdf5 files and hence the total count comes >>>> around 20 files. >>>> > >>>> > Some Calculations and statistics: >>>> > >>>> > >>>> > Total number of records : 14,37,00,000 >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > Total number >>>> > of records per file : 143700000/20 =71,85,000 >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > The total number >>>> > of records in each file : 71,85,000 * 5 = 3,59,25,000 >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > Current Postgresql database config : >>>> > >>>> > My current Machine : 8GB RAM with i7 2nd generation Processor. >>>> > >>>> > I made changes to the following to postgresql configuration file : >>>> shared_buffers : 2 GB effective_cache_size : 4 GB >>>> > >>>> > Note on current performance: >>>> > >>>> > I have run it for about ten hours and the performance is as follows: >>>> The total number of records written for a single file is about 25,00,000 * >>>> 5 =1,25,00,000 only. It has written 2 such files .considering the size it >>>> would take me atleast 20 hrs for 2 files.I have about 10 files and hence >>>> the total hours would be 200 hrs= 9 days. I have to start my experiments as >>>> early as possible and 10 days is too much. Can you please help me to >>>> enhance the performance. >>>> > >>>> > Questions: 1. Should i use Symmetric multi processing on my >>>> computer.In that case what is suggested or prefereable? 2. Should i use >>>> multi threading.. In that case any links or pointers would be of great help. >>>> > >>>> > >>>> > >>>> > Thanks >>>> > >>>> > Sree aurovindh V >>>> > >>>> > >>>> > >>>> ------------------------------------------------------------------------------ >>>> > This SF email is sponsosred by: >>>> > Try Windows Azure free for 90 days Click Here >>>> > http://p.sf.net/sfu/sfd2d-msazure >>>> > _______________________________________________ >>>> > Pytables-users mailing list >>>> > Pyt...@li... >>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>>> > >>>> > >>>> > >>>> > >>>> ------------------------------------------------------------------------------ >>>> > This SF email is sponsosred by: >>>> > Try Windows Azure free for 90 days Click Here >>>> > http://p.sf.net/sfu/sfd2d-msazure >>>> > _______________________________________________ >>>> > Pytables-users mailing list >>>> > Pyt...@li... >>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>>> > >>>> > >>>> > >>>> ------------------------------------------------------------------------------ >>>> > This SF email is sponsosred by: >>>> > Try Windows Azure free for 90 days Click Here >>>> > >>>> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________ >>>> > Pytables-users mailing list >>>> > Pyt...@li... >>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>>> >>>> -- Francesc Alted >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> This SF email is sponsosred by: >>>> Try Windows Azure free for 90 days Click Here >>>> http://p.sf.net/sfu/sfd2d-msazure >>>> _______________________________________________ >>>> Pytables-users mailing list >>>> Pyt...@li... >>>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> This SF email is sponsosred by: >>> Try Windows Azure free for 90 days Click Here >>> http://p.sf.net/sfu/sfd2d-msazure >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> >> >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2012-03-19 18:29:19
|
On Mon, Mar 19, 2012 at 12:08 PM, sreeaurovindh viswanathan < sre...@gm...> wrote: [snip] > 2) Can you please point out to an example where i can do block hdf5 file > write using pytables (sorry for this naive question) > The Table.append() method ( http://pytables.github.com/usersguide/libref.html#tables.Table.append) allows you to write multiple rows at the same time. Arrays have similar methods (http://pytables.github.com/usersguide/libref.html#earray-methods) if they are extensible. Please note that these methods accept any sequence which can be converted to a numpy record array! Be Well Anthony > > Thanks > Sree aurovindh V > > On Mon, Mar 19, 2012 at 10:16 PM, Anthony Scopatz <sc...@gm...>wrote: > >> What Francesc said ;) >> >> >> On Mon, Mar 19, 2012 at 11:43 AM, Francesc Alted <fa...@gm...>wrote: >> >>> My advice regarding parallelization is: do not worry about this *at all* >>> unless you already spent long time profiling your problem and you are sure >>> that parallelizing could be of help. 99% of the time is much more >>> productive focusing on improving serial speed. >>> >>> Please, try to follow Anthony's suggestion and split your queries in >>> blocks, and pass these blocks to PyTables. That would represent a huge >>> win. For example, use: >>> >>> SELECT * FROM `your_table` LIMIT 0, 10000 >>> >>> for the first block, and send the results to `Table.append`. Then go >>> for the second block as: >>> >>> SELECT * FROM `your_table` LIMIT 10000, 20000 >>> >>> and pass this to `Table.append`. And so on and so forth until you >>> exhaust all the data in your tables. >>> >>> Hope this helps, >>> >>> Francesc >>> >>> On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote: >>> >>> > Hi, >>> > >>> > Thanks for your reply.In that case how will be my querying efficiency? >>> will i be able to query parrellely?(i.e) will i be able to run multiple >>> queries on a single file.Also if i do it in 6 chunks will i be able to >>> parrelize it? >>> > >>> > >>> > Thanks >>> > Sree aurovindh Viswanathan >>> > On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <sc...@gm...> >>> wrote: >>> > Is there any way that you can query and write in much larger chunks >>> that 6? I don't know much about postgresql in specific, but in general >>> HDF5 does much better if you can take larger chunks. Perhaps you could at >>> least do the postgresql in parallel. >>> > >>> > Be Well >>> > Anthony >>> > >>> > On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan < >>> sre...@gm...> wrote: >>> > The problem is with respect to the writing speed of my computer and >>> the postgresql query performance.I will explain the scenario in detail. >>> > >>> > I have data about 80 Gb (along with approprite database indexes in >>> place). I am trying to read it from Postgresql database and writing it into >>> HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 >>> file.The implementation of Hdf5 is not multithreaded or enabled for >>> symmetric multi processing. >>> > >>> > As for as the postgresql table is concerned the overall record size is >>> 140 million and I have 5 primary- foreign key referring tables.I am not >>> using joins as it is not scalable >>> > >>> > So for a single lookup i do 6 lookup without joins and write them into >>> hdf5 format. For each lookup i do 6 inserts into each of the table and its >>> corresponding arrays. >>> > >>> > The queries are really simple >>> > >>> > >>> > select * from x.train where tr_id=1 (primary key & indexed) >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > select q_t from x.qt where q_id=2 (non-primary key but indexed) >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > (similarly four queries) >>> > >>> > Each computer writes two hdf5 files and hence the total count comes >>> around 20 files. >>> > >>> > Some Calculations and statistics: >>> > >>> > >>> > Total number of records : 14,37,00,000 >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > Total number >>> > of records per file : 143700000/20 =71,85,000 >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > The total number >>> > of records in each file : 71,85,000 * 5 = 3,59,25,000 >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > Current Postgresql database config : >>> > >>> > My current Machine : 8GB RAM with i7 2nd generation Processor. >>> > >>> > I made changes to the following to postgresql configuration file : >>> shared_buffers : 2 GB effective_cache_size : 4 GB >>> > >>> > Note on current performance: >>> > >>> > I have run it for about ten hours and the performance is as follows: >>> The total number of records written for a single file is about 25,00,000 * >>> 5 =1,25,00,000 only. It has written 2 such files .considering the size it >>> would take me atleast 20 hrs for 2 files.I have about 10 files and hence >>> the total hours would be 200 hrs= 9 days. I have to start my experiments as >>> early as possible and 10 days is too much. Can you please help me to >>> enhance the performance. >>> > >>> > Questions: 1. Should i use Symmetric multi processing on my >>> computer.In that case what is suggested or prefereable? 2. Should i use >>> multi threading.. In that case any links or pointers would be of great help. >>> > >>> > >>> > >>> > Thanks >>> > >>> > Sree aurovindh V >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > This SF email is sponsosred by: >>> > Try Windows Azure free for 90 days Click Here >>> > http://p.sf.net/sfu/sfd2d-msazure >>> > _______________________________________________ >>> > Pytables-users mailing list >>> > Pyt...@li... >>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>> > >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > This SF email is sponsosred by: >>> > Try Windows Azure free for 90 days Click Here >>> > http://p.sf.net/sfu/sfd2d-msazure >>> > _______________________________________________ >>> > Pytables-users mailing list >>> > Pyt...@li... >>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > This SF email is sponsosred by: >>> > Try Windows Azure free for 90 days Click Here >>> > >>> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________ >>> > Pytables-users mailing list >>> > Pyt...@li... >>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> -- Francesc Alted >>> >>> >>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> This SF email is sponsosred by: >>> Try Windows Azure free for 90 days Click Here >>> http://p.sf.net/sfu/sfd2d-msazure >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >> >> >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: sreeaurovindh v. <sre...@gm...> - 2012-03-19 17:45:18
|
Thanks Jarrod Roberson for your suggestions.I could understand the problem .Will incorporate them in my solution.. Thanks On Mon, Mar 19, 2012 at 10:45 PM, Jarrod Roberson <ja...@ve...> wrote: > On Mon, Mar 19, 2012 at 1:08 PM, sreeaurovindh viswanathan > <sre...@gm...> wrote: >> >> Sorry to misphrase my question.But by querying speed i meant the speed of >> "pytable querying and not the postgresql querying.To rephrase, >> 1) Will i be able to query(using kernel queries) a single HDF5 file using >> pytables parallely with five different programs? How will the efficiency in >> that case.. >> Secondly as per the suggestions, >> >> I will break it into 6 chunks as per your advise and try to incorporate in >> the code.Also i will try to break my query into chunks and write it into >> hdf5 tables as chunks as advised by frensec. But.. > > What you are describing is I/O bound; this means that you are only > going to get as much throughput as your disk sub-system can handle. > Writing in larger batches exploits the caching and block write nature > of fixed disk mechanisms. > Reading in batches does the same thing to exploit builtin caching and > block reads. > > Profile your disks, if you are getting max throughput, buy faster hardware. > > If you are I/O bound multiple threads of execution will almost > guarantee a reduction in throughput and reduction in overall > performance of your application. > This is the laws of physics at work, there is no multi-threaded royal > road to better I/O performance with fixed disks. > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Jarrod R. <ja...@ve...> - 2012-03-19 17:39:32
|
On Mon, Mar 19, 2012 at 1:08 PM, sreeaurovindh viswanathan <sre...@gm...> wrote: > > Sorry to misphrase my question.But by querying speed i meant the speed of > "pytable querying and not the postgresql querying.To rephrase, > 1) Will i be able to query(using kernel queries) a single HDF5 file using > pytables parallely with five different programs? How will the efficiency in > that case.. > Secondly as per the suggestions, > > I will break it into 6 chunks as per your advise and try to incorporate in > the code.Also i will try to break my query into chunks and write it into > hdf5 tables as chunks as advised by frensec. But.. What you are describing is I/O bound; this means that you are only going to get as much throughput as your disk sub-system can handle. Writing in larger batches exploits the caching and block write nature of fixed disk mechanisms. Reading in batches does the same thing to exploit builtin caching and block reads. Profile your disks, if you are getting max throughput, buy faster hardware. If you are I/O bound multiple threads of execution will almost guarantee a reduction in throughput and reduction in overall performance of your application. This is the laws of physics at work, there is no multi-threaded royal road to better I/O performance with fixed disks. |
From: sreeaurovindh v. <sre...@gm...> - 2012-03-19 17:09:20
|
hi, Thanks for your suggestions. Sorry to misphrase my question.But by querying speed i meant the speed of "pytable querying and not the postgresql querying.To rephrase, 1) Will i be able to query(using kernel queries) a single HDF5 file using pytables parallely with five different programs? How will the efficiency in that case.. Secondly as per the suggestions, I will break it into 6 chunks as per your advise and try to incorporate in the code.Also i will try to break my query into chunks and write it into hdf5 tables as chunks as advised by frensec. But.. 2) Can you please point out to an example where i can do block hdf5 file write using pytables (sorry for this naive question) Thanks Sree aurovindh V On Mon, Mar 19, 2012 at 10:16 PM, Anthony Scopatz <sc...@gm...> wrote: > What Francesc said ;) > > > On Mon, Mar 19, 2012 at 11:43 AM, Francesc Alted <fa...@gm...> wrote: > >> My advice regarding parallelization is: do not worry about this *at all* >> unless you already spent long time profiling your problem and you are sure >> that parallelizing could be of help. 99% of the time is much more >> productive focusing on improving serial speed. >> >> Please, try to follow Anthony's suggestion and split your queries in >> blocks, and pass these blocks to PyTables. That would represent a huge >> win. For example, use: >> >> SELECT * FROM `your_table` LIMIT 0, 10000 >> >> for the first block, and send the results to `Table.append`. Then go for >> the second block as: >> >> SELECT * FROM `your_table` LIMIT 10000, 20000 >> >> and pass this to `Table.append`. And so on and so forth until you >> exhaust all the data in your tables. >> >> Hope this helps, >> >> Francesc >> >> On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote: >> >> > Hi, >> > >> > Thanks for your reply.In that case how will be my querying efficiency? >> will i be able to query parrellely?(i.e) will i be able to run multiple >> queries on a single file.Also if i do it in 6 chunks will i be able to >> parrelize it? >> > >> > >> > Thanks >> > Sree aurovindh Viswanathan >> > On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <sc...@gm...> >> wrote: >> > Is there any way that you can query and write in much larger chunks >> that 6? I don't know much about postgresql in specific, but in general >> HDF5 does much better if you can take larger chunks. Perhaps you could at >> least do the postgresql in parallel. >> > >> > Be Well >> > Anthony >> > >> > On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan < >> sre...@gm...> wrote: >> > The problem is with respect to the writing speed of my computer and the >> postgresql query performance.I will explain the scenario in detail. >> > >> > I have data about 80 Gb (along with approprite database indexes in >> place). I am trying to read it from Postgresql database and writing it into >> HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 >> file.The implementation of Hdf5 is not multithreaded or enabled for >> symmetric multi processing. >> > >> > As for as the postgresql table is concerned the overall record size is >> 140 million and I have 5 primary- foreign key referring tables.I am not >> using joins as it is not scalable >> > >> > So for a single lookup i do 6 lookup without joins and write them into >> hdf5 format. For each lookup i do 6 inserts into each of the table and its >> corresponding arrays. >> > >> > The queries are really simple >> > >> > >> > select * from x.train where tr_id=1 (primary key & indexed) >> > >> > >> > >> > >> > >> > >> > >> > >> > select q_t from x.qt where q_id=2 (non-primary key but indexed) >> > >> > >> > >> > >> > >> > >> > >> > >> > (similarly four queries) >> > >> > Each computer writes two hdf5 files and hence the total count comes >> around 20 files. >> > >> > Some Calculations and statistics: >> > >> > >> > Total number of records : 14,37,00,000 >> > >> > >> > >> > >> > >> > >> > >> > Total number >> > of records per file : 143700000/20 =71,85,000 >> > >> > >> > >> > >> > >> > >> > >> > The total number >> > of records in each file : 71,85,000 * 5 = 3,59,25,000 >> > >> > >> > >> > >> > >> > >> > >> > >> > Current Postgresql database config : >> > >> > My current Machine : 8GB RAM with i7 2nd generation Processor. >> > >> > I made changes to the following to postgresql configuration file : >> shared_buffers : 2 GB effective_cache_size : 4 GB >> > >> > Note on current performance: >> > >> > I have run it for about ten hours and the performance is as follows: >> The total number of records written for a single file is about 25,00,000 * >> 5 =1,25,00,000 only. It has written 2 such files .considering the size it >> would take me atleast 20 hrs for 2 files.I have about 10 files and hence >> the total hours would be 200 hrs= 9 days. I have to start my experiments as >> early as possible and 10 days is too much. Can you please help me to >> enhance the performance. >> > >> > Questions: 1. Should i use Symmetric multi processing on my >> computer.In that case what is suggested or prefereable? 2. Should i use >> multi threading.. In that case any links or pointers would be of great help. >> > >> > >> > >> > Thanks >> > >> > Sree aurovindh V >> > >> > >> > >> ------------------------------------------------------------------------------ >> > This SF email is sponsosred by: >> > Try Windows Azure free for 90 days Click Here >> > http://p.sf.net/sfu/sfd2d-msazure >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> > >> ------------------------------------------------------------------------------ >> > This SF email is sponsosred by: >> > Try Windows Azure free for 90 days Click Here >> > http://p.sf.net/sfu/sfd2d-msazure >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> ------------------------------------------------------------------------------ >> > This SF email is sponsosred by: >> > Try Windows Azure free for 90 days Click Here >> > >> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> -- Francesc Alted >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |