Re: [Pytables-users] Improving my write speed

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Mon, Mar 19, 2012 at 12:08 PM, sreeaurovindh viswanathan <
sre...@gm...> wrote:

[snip]

> 2) Can you please point out to an example where i can do block hdf5 file
> write using pytables (sorry for this naive question)
>

The Table.append() method (
http://pytables.github.com/usersguide/libref.html#tables.Table.append)
allows you to write multiple rows at the same time.  Arrays have similar
methods (http://pytables.github.com/usersguide/libref.html#earray-methods)
if they are extensible.   Please note that these methods accept any
sequence which can be converted to a numpy record array!

Be Well
Anthony

>
> Thanks
> Sree aurovindh V
>
> On Mon, Mar 19, 2012 at 10:16 PM, Anthony Scopatz <sc...@gm...>wrote:
>
>> What Francesc said ;)
>>
>>
>> On Mon, Mar 19, 2012 at 11:43 AM, Francesc Alted <fa...@gm...>wrote:
>>
>>> My advice regarding parallelization is: do not worry about this *at all*
>>> unless you already spent long time profiling your problem and you are sure
>>> that parallelizing could be of help.  99% of the time is much more
>>> productive focusing on improving serial speed.
>>>
>>> Please, try to follow Anthony's suggestion and split your queries in
>>> blocks, and pass these blocks to PyTables.  That would represent a huge
>>> win.  For example, use:
>>>
>>> SELECT * FROM `your_table` LIMIT 0, 10000
>>>
>>> for the first block, and send the results to `Table.append`.  Then go
>>> for the second block as:
>>>
>>> SELECT * FROM `your_table` LIMIT 10000, 20000
>>>
>>> and pass this to `Table.append`.  And so on and so forth until you
>>> exhaust all the data in your tables.
>>>
>>> Hope this helps,
>>>
>>> Francesc
>>>
>>> On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote:
>>>
>>> > Hi,
>>> >
>>> > Thanks for your reply.In that case how will be my querying efficiency?
>>> will i be able to query parrellely?(i.e) will i be able to run multiple
>>> queries on a single file.Also if i do it in 6 chunks will i be able to
>>> parrelize it?
>>> >
>>> >
>>> > Thanks
>>> > Sree aurovindh Viswanathan
>>> > On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <sc...@gm...>
>>> wrote:
>>> > Is there any way that you can query and write in much larger chunks
>>> that 6?  I don't know much about postgresql in specific, but in general
>>> HDF5 does much better if you can take larger chunks.  Perhaps you could at
>>> least do the postgresql in parallel.
>>> >
>>> > Be Well
>>> > Anthony
>>> >
>>> > On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan <
>>> sre...@gm...> wrote:
>>> > The problem is with respect to the writing speed of my computer and
>>> the postgresql query performance.I will explain the scenario in detail.
>>> >
>>> > I have data about 80 Gb (along with approprite database indexes in
>>> place). I am trying to read it from Postgresql database and writing it into
>>> HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5
>>> file.The implementation of Hdf5 is not multithreaded or enabled for
>>> symmetric multi processing.
>>> >
>>> > As for as the postgresql table is concerned the overall record size is
>>> 140 million and I have 5 primary- foreign key referring tables.I am not
>>> using joins as it is not scalable
>>> >
>>> > So for a single lookup i do 6 lookup without joins and write them into
>>> hdf5 format. For each lookup i do 6 inserts into each of the table and its
>>> corresponding arrays.
>>> >
>>> > The queries are really simple
>>> >
>>> >
>>> > select * from x.train where tr_id=1 (primary key & indexed)
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > select q_t from x.qt where q_id=2 (non-primary key but indexed)
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > (similarly four queries)
>>> >
>>> > Each computer writes two hdf5 files and hence the total count comes
>>> around 20 files.
>>> >
>>> > Some Calculations and statistics:
>>> >
>>> >
>>> > Total number of records : 14,37,00,000
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Total number
>>> > of records per file : 143700000/20 =71,85,000
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > The total number
>>> > of records in each file : 71,85,000 * 5 = 3,59,25,000
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Current Postgresql database config :
>>> >
>>> > My current Machine : 8GB RAM with i7 2nd generation Processor.
>>> >
>>> > I made changes to the following to postgresql configuration file :
>>> shared_buffers : 2 GB effective_cache_size : 4 GB
>>> >
>>> > Note on current performance:
>>> >
>>> > I have run it for about ten hours and the performance is as follows:
>>> The total number of records written for a single file is about 25,00,000 *
>>> 5 =1,25,00,000 only. It has written 2 such files .considering the size it
>>> would take me atleast 20 hrs  for 2 files.I have  about 10 files and hence
>>> the total hours would be 200 hrs= 9 days. I have to start my experiments as
>>> early as possible and 10 days is too much. Can you please help me to
>>> enhance the performance.
>>> >
>>> >  Questions: 1. Should i use Symmetric multi processing on my
>>> computer.In that case what is suggested or prefereable?  2. Should i use
>>> multi threading.. In that case any links or pointers would be of great help.
>>> >
>>> >
>>> >
>>> > Thanks
>>> >
>>> > Sree aurovindh V
>>> >
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > This SF email is sponsosred by:
>>> > Try Windows Azure free for 90 days Click Here
>>> > http://p.sf.net/sfu/sfd2d-msazure
>>> > _______________________________________________
>>> > Pytables-users mailing list
>>> > Pyt...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>> >
>>> >
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > This SF email is sponsosred by:
>>> > Try Windows Azure free for 90 days Click Here
>>> > http://p.sf.net/sfu/sfd2d-msazure
>>> > _______________________________________________
>>> > Pytables-users mailing list
>>> > Pyt...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>> >
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > This SF email is sponsosred by:
>>> > Try Windows Azure free for 90 days Click Here
>>> >
>>> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________
>>> > Pytables-users mailing list
>>> > Pyt...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>> -- Francesc Alted
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> This SF email is sponsosred by:
>>> Try Windows Azure free for 90 days Click Here
>>> http://p.sf.net/sfu/sfd2d-msazure
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pyt...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>