|
From: Ernesto <e.p...@un...> - 2010-04-28 14:17:46
|
Hi all, I have a table containing a lot of data (millions of rows). The structure is like the following example: (22777420, 'G', 18, '-') (22777421, 'G', 36, '-') (22777422, 'C', 29, '-') (22777423, 'C', 17, '-') (22777424, 'A', 31, '-') (22777425, 'A', 42, '-') (22777426, 'C', 49, '-') (22777305, 'T', 0, '-') (22777306, 'C', 18, '-') (22777307, 'C', 29, '-') (22777308, 'T', 26, '-') (22777309, 'T', 10, '-') (22777310, 'G', 15, '-') (22777311, 'G', 33, '-') The first column contains an integer. Now I'd like to sort my table according to numbers of the first column. Is there a way to perform this action? A second question concerns the iteration over a huge amount of data. For example, given the above table, I would to work on a subset of rows using an iterator in order to avoid memory errors. Is there also here a simple procedure? Thank you very much in advance for any suggestion, Ernesto |
|
From: Ernesto <e.p...@un...> - 2010-04-28 18:36:35
|
Hi Francesc, thank you for your reply.
> The first column contains an integer. Now I'd like to sort my table
> according to numbers of the first column. Is there a way to perform this
> action?
> Yes. The simplest way is by setting the `sortby` parameter to true in the
> `Table.copy()` method. This triggers an on-disk sorting operation, so you
> don't have to be afraid of your available memory. You will need the Pro
> version for getting this capability.
It means that I can sort a table using the Pro version only. No other solutions are available?
Working with my simple table I found a strange behaviour that probably is due to my limited experience with pytables.
The description of the table is:
/newgroup/table (Table(729036,), shuffle, zlib(1)) 'A table'
description := {
"position": Int32Col(shape=(), dflt=0, pos=0),
"read": StringCol(itemsize=1, shape=(), dflt='', pos=1),
"qual": Int32Col(shape=(), dflt=0, pos=2),
"strand": StringCol(itemsize=1, shape=(), dflt='', pos=3)}
byteorder := 'little'
chunkshape := (819,)
Next I iterate over a subset using the command:
for row in table.where('(qual > 1) & (qual < 10)'):
print row
It works correctly and I get expected results.
Then I use the following command:
results=[row for row in table.where('(qual > 1) & (qual < 10)')]
it works but I get a list containing the same value.
I report the first five rows obtained using both procedures:
- procedure 1:
(167809, 'C', 8, '-')
(167810, 'G', 9, '-')
(167812, 'C', 5, '-')
(167823, 'T', 9, '-')
(1015856, 'G', 5, '-')
- procedure 2:
[(1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-')]
Where is the error?
Thank a lot,
Ernesto
|
|
From: Francesc A. <fa...@py...> - 2010-04-28 19:12:04
|
A Wednesday 28 April 2010 20:22:52 Ernesto escrigué:
> Hi Francesc, thank you for your reply.
>
> > The first column contains an integer. Now I'd like to sort my table
> > according to numbers of the first column. Is there a way to perform this
> > action?
> >
> > Yes. The simplest way is by setting the `sortby` parameter to true in
> > the `Table.copy()` method. This triggers an on-disk sorting operation,
> > so you don't have to be afraid of your available memory. You will need
> > the Pro version for getting this capability.
>
> It means that I can sort a table using the Pro version only. No other
> solutions are available?
I haven't said that no other solutions are available. Only that the Pro venue
is the simplest (and certainly a powerful one ;-) If you want to go Pro, you
may want to use plain NumPy for doing this.
> Working with my simple table I found a strange
> behaviour that probably is due to my limited experience with pytables. The
> description of the table is:
> /newgroup/table (Table(729036,), shuffle, zlib(1)) 'A table'
> description := {
> "position": Int32Col(shape=(), dflt=0, pos=0),
> "read": StringCol(itemsize=1, shape=(), dflt='', pos=1),
> "qual": Int32Col(shape=(), dflt=0, pos=2),
> "strand": StringCol(itemsize=1, shape=(), dflt='', pos=3)}
> byteorder := 'little'
> chunkshape := (819,)
> Next I iterate over a subset using the command:
> for row in table.where('(qual > 1) & (qual < 10)'):
> print row
> It works correctly and I get expected results.
> Then I use the following command:
> results=[row for row in table.where('(qual > 1) & (qual < 10)')]
> it works but I get a list containing the same value.
> I report the first five rows obtained using both procedures:
> - procedure 1:
> (167809, 'C', 8, '-')
> (167810, 'G', 9, '-')
> (167812, 'C', 5, '-')
> (167823, 'T', 9, '-')
> (1015856, 'G', 5, '-')
> - procedure 2:
> [(1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'),
> (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-')]
>
> Where is the error?
No, it is not, but it is a common pitfall for beginners. See:
http://sourceforge.net/mailarchive/forum.php?thread_name=200905141759.16731.faltet%40pytables.org&forum_name=pytables-
users
Anyway, I've decided to avoid this behaviour in 2.2 series. For details, see:
http://pytables.org/trac/ticket/252
Ciao,
--
Francesc Alted
|
|
From: Francesc A. <fa...@py...> - 2010-04-28 19:14:53
|
A Wednesday 28 April 2010 21:11:56 Francesc Alted escrigué: > If you want to go Pro, you may want to use plain NumPy for doing this. I wanted to say "don't want to go Pro", of course :-) -- Francesc Alted |
|
From: Ernesto <e.p...@un...> - 2010-04-28 19:36:14
|
Thanks a lot,
Ernesto
> This is a common bug: You are storing the iterator rather than what it points to. Try doing
>
> results=[row[:] for row in table.where('(qual > 1) & (qual < 10)')]
> This will copy out all of the columns of the row.
>
> (At least, I believe that's the correct syntax). I'm sure you'll get a correction soone nough.
>
> On Wed, Apr 28, 2010 at 2:22 PM, Ernesto <e.p...@un...> wrote:
> Hi Francesc, thank you for your reply.
> > The first column contains an integer. Now I'd like to sort my table
> > according to numbers of the first column. Is there a way to perform this
> > action?
>
> > Yes. The simplest way is by setting the `sortby` parameter to true in the
> > `Table.copy()` method. This triggers an on-disk sorting operation, so you
> > don't have to be afraid of your available memory. You will need the Pro
> > version for getting this capability.
>
> It means that I can sort a table using the Pro version only. No other solutions are available?
> Working with my simple table I found a strange behaviour that probably is due to my limited experience with pytables.
> The description of the table is:
> /newgroup/table (Table(729036,), shuffle, zlib(1)) 'A table'
> description := {
> "position": Int32Col(shape=(), dflt=0, pos=0),
> "read": StringCol(itemsize=1, shape=(), dflt='', pos=1),
> "qual": Int32Col(shape=(), dflt=0, pos=2),
> "strand": StringCol(itemsize=1, shape=(), dflt='', pos=3)}
> byteorder := 'little'
> chunkshape := (819,)
> Next I iterate over a subset using the command:
> for row in table.where('(qual > 1) & (qual < 10)'):
> print row
> It works correctly and I get expected results.
> Then I use the following command:
> results=[row for row in table.where('(qual > 1) & (qual < 10)')]
> it works but I get a list containing the same value.
> I report the first five rows obtained using both procedures:
> - procedure 1:
> (167809, 'C', 8, '-')
> (167810, 'G', 9, '-')
> (167812, 'C', 5, '-')
> (167823, 'T', 9, '-')
> (1015856, 'G', 5, '-')
> - procedure 2:
> [(1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-'), (1041269, 'G', 39, '-')]
>
> Where is the error?
>
> Thank a lot,
>
> Ernesto
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
>
> --
> Nicholas Dunn
> (207) 651-9839
|
|
From: Ernesto <e.p...@un...> - 2010-04-29 07:05:22
|
A Wednesday 28 April 2010 21:11:56 Francesc Alted escrigué: >> If you want to go Pro, you may want to use plain NumPy for doing this. > I wanted to say "don't want to go Pro", of course :-) OK. Could you please provide me a simple example on how to use NunPy for table sorting? In the case I decide to use the Pro version of Pytables is there an academic license available? Thank you again, Ernesto |
|
From: Ernesto <e.p...@un...> - 2010-04-29 07:20:22
|
I'm sorry, I have a last question that it may be close related to what I'm doing. In practice I need to extract a subset of rows from a tables and then sort them according to specific criteria. Of course the best thing should be a pre-sorting of the table before the extraction in order to get a sorted list. Anyway I could extract my rows and sort them in a next step. However, I don't know a priori if my extracted list can fit the available memory. So the question or better my curiosity concerns the possibility to extract a very big subset of rows also when I don't have all available memory. Thank a lot and sorry if I'm taking your time with my basic (or stupid) questions, Ernesto |
|
From: Francesc A. <fa...@py...> - 2010-04-29 09:21:29
|
A Thursday 29 April 2010 09:20:16 Ernesto escrigué: > I'm sorry, I have a last question that it may be close related to what I'm > doing. In practice I need to extract a subset of rows from a tables and > then sort them according to specific criteria. Of course the best thing > should be a pre-sorting of the table before the extraction in order to get > a sorted list. Anyway I could extract my rows and sort them in a next > step. However, I don't know a priori if my extracted list can fit the > available memory. So the question or better my curiosity concerns the > possibility to extract a very big subset of rows also when I don't have > all available memory. Ah, that clarifies a lot what you want to do. If you don't know a priori if your extracted list can fit in memory, then the only solution that I can think of is you to use the `Table.where()` iterator and extract the info you are interested in for each element that the iterator returns. You can see examples on how to use the `Table.where()` iterator in the chapter 3 of the User's Manual and, for an express guide, also here: http://pytables.org/moin/HowToUse#Selectingvalues [Please note that you can even make use of generators (`()`) instead of comprehension lists (`[]`) so as to avoid loading the results in-memory.] You also have a video explaining the basics of querying tables: http://showmedo.com/videotutorials/video?name=1780010&fromSeriesID=178 But still, you need a way to sort out your big tables first, and Pro is the only solution for doing this that *I* know for Python. There is also memmaped NumPy structured arrays, but there exists the limitation that the table size cannot exceed your available *virtual* memory. PyTables Pro does not have this limitation (the only practical limit is your available disk space). > Thank a lot and sorry if I'm taking your time with my basic (or stupid) > questions, Not at all. It is always a bit difficult to figure out all the parameters that people have when trying to solve their problems, but here you have my opinion. You may want to ask the NumPy list for yourself, though. -- Francesc Alted |
|
From: Francesc A. <fa...@py...> - 2010-04-29 09:23:48
|
A Thursday 29 April 2010 09:05:15 Ernesto escrigué: > In the case I decide to use the Pro version of Pytables is > there an academic license available? Thank you again, Not for personal license (95 EUR is low enough already). But if you are interested on a site license, we can talk. -- Francesc Alted |
|
From: Tony T. <to...@lo...> - 2010-04-29 10:32:06
|
On 29 April 2010 19:23, Francesc Alted <fa...@py...> wrote: > A Thursday 29 April 2010 09:05:15 Ernesto escrigué: >> In the case I decide to use the Pro version of Pytables is >> there an academic license available? Thank you again, > > Not for personal license (95 EUR is low enough already). But if you are > interested on a site license, we can talk. When did that happen? I thought it was around 400 EUR? That is low enough, more so if you go ahead with the MKL. Cheers, Tony |
|
From: Francesc A. <fa...@py...> - 2010-04-29 11:41:39
|
A Thursday 29 April 2010 12:31:58 escriguéreu: > On 29 April 2010 19:23, Francesc Alted <fa...@py...> wrote: > > A Thursday 29 April 2010 09:05:15 Ernesto escrigué: > >> In the case I decide to use the Pro version of Pytables is > >> there an academic license available? Thank you again, > > > > Not for personal license (95 EUR is low enough already). But if you are > > interested on a site license, we can talk. > > When did that happen? I thought it was around 400 EUR? That is low > enough, more so if you go ahead with the MKL. Well, that happened almost two years ago: http://blog.gmane.org/gmane.comp.python.pytables.announce/page=1 [heck, time really flies] Regarding MKL, I don't think I'll include it for current Pro offerings. In case I'd go ahead with it, I'd include it only on a possible new Windows Special Edition (WSE), and probably 64-bit enabled. At any rate, my plans are that next PyTables and PyTables Pro 2.2 will be able to be linked easily with an existing MKL library, so that anybody can have access to this PyTables/numexpr/MKL combination if they want so. Cheers, -- Francesc Alted |
|
From: Francesc A. <fa...@py...> - 2010-04-28 14:56:39
|
Ciao Ernesto, A Wednesday 28 April 2010 16:04:06 Ernesto escrigué: > Hi all, > > I have a table containing a lot of data (millions of rows). > The structure is like the following example: > > (22777420, 'G', 18, '-') > (22777421, 'G', 36, '-') > (22777422, 'C', 29, '-') > (22777423, 'C', 17, '-') > (22777424, 'A', 31, '-') > (22777425, 'A', 42, '-') > (22777426, 'C', 49, '-') > (22777305, 'T', 0, '-') > (22777306, 'C', 18, '-') > (22777307, 'C', 29, '-') > (22777308, 'T', 26, '-') > (22777309, 'T', 10, '-') > (22777310, 'G', 15, '-') > (22777311, 'G', 33, '-') > > The first column contains an integer. Now I'd like to sort my table > according to numbers of the first column. Is there a way to perform this > action? Yes. The simplest way is by setting the `sortby` parameter to true in the `Table.copy()` method. This triggers an on-disk sorting operation, so you don't have to be afraid of your available memory. You will need the Pro version for getting this capability. > A second question concerns the iteration over a huge amount of > data. For example, given the above table, I would to work on a subset of > rows using an iterator in order to avoid memory errors. Is there also here > a simple procedure? I think what you are looking for is the `Table.where()` iterator. See: http://www.pytables.org/docs/manual/ch04.html#TableMethods_querying Also, the Pro version has the ability to index your tables, making your queries via `Table.where()` very fast (most specially over completely sorted tables). For some figures on the improvements you can expect, see: http://www.pytables.org/docs/manual/ch05.html#searchOptim and, in particular: http://www.pytables.org/docs/manual/ch05.html#Sorting-indexes Cheers, -- Francesc Alted |