From: Anthony S. <sc...@gm...> - 2012-03-19 18:29:19
|
On Mon, Mar 19, 2012 at 12:08 PM, sreeaurovindh viswanathan < sre...@gm...> wrote: [snip] > 2) Can you please point out to an example where i can do block hdf5 file > write using pytables (sorry for this naive question) > The Table.append() method ( http://pytables.github.com/usersguide/libref.html#tables.Table.append) allows you to write multiple rows at the same time. Arrays have similar methods (http://pytables.github.com/usersguide/libref.html#earray-methods) if they are extensible. Please note that these methods accept any sequence which can be converted to a numpy record array! Be Well Anthony > > Thanks > Sree aurovindh V > > On Mon, Mar 19, 2012 at 10:16 PM, Anthony Scopatz <sc...@gm...>wrote: > >> What Francesc said ;) >> >> >> On Mon, Mar 19, 2012 at 11:43 AM, Francesc Alted <fa...@gm...>wrote: >> >>> My advice regarding parallelization is: do not worry about this *at all* >>> unless you already spent long time profiling your problem and you are sure >>> that parallelizing could be of help. 99% of the time is much more >>> productive focusing on improving serial speed. >>> >>> Please, try to follow Anthony's suggestion and split your queries in >>> blocks, and pass these blocks to PyTables. That would represent a huge >>> win. For example, use: >>> >>> SELECT * FROM `your_table` LIMIT 0, 10000 >>> >>> for the first block, and send the results to `Table.append`. Then go >>> for the second block as: >>> >>> SELECT * FROM `your_table` LIMIT 10000, 20000 >>> >>> and pass this to `Table.append`. And so on and so forth until you >>> exhaust all the data in your tables. >>> >>> Hope this helps, >>> >>> Francesc >>> >>> On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote: >>> >>> > Hi, >>> > >>> > Thanks for your reply.In that case how will be my querying efficiency? >>> will i be able to query parrellely?(i.e) will i be able to run multiple >>> queries on a single file.Also if i do it in 6 chunks will i be able to >>> parrelize it? >>> > >>> > >>> > Thanks >>> > Sree aurovindh Viswanathan >>> > On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <sc...@gm...> >>> wrote: >>> > Is there any way that you can query and write in much larger chunks >>> that 6? I don't know much about postgresql in specific, but in general >>> HDF5 does much better if you can take larger chunks. Perhaps you could at >>> least do the postgresql in parallel. >>> > >>> > Be Well >>> > Anthony >>> > >>> > On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan < >>> sre...@gm...> wrote: >>> > The problem is with respect to the writing speed of my computer and >>> the postgresql query performance.I will explain the scenario in detail. >>> > >>> > I have data about 80 Gb (along with approprite database indexes in >>> place). I am trying to read it from Postgresql database and writing it into >>> HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5 >>> file.The implementation of Hdf5 is not multithreaded or enabled for >>> symmetric multi processing. >>> > >>> > As for as the postgresql table is concerned the overall record size is >>> 140 million and I have 5 primary- foreign key referring tables.I am not >>> using joins as it is not scalable >>> > >>> > So for a single lookup i do 6 lookup without joins and write them into >>> hdf5 format. For each lookup i do 6 inserts into each of the table and its >>> corresponding arrays. >>> > >>> > The queries are really simple >>> > >>> > >>> > select * from x.train where tr_id=1 (primary key & indexed) >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > select q_t from x.qt where q_id=2 (non-primary key but indexed) >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > (similarly four queries) >>> > >>> > Each computer writes two hdf5 files and hence the total count comes >>> around 20 files. >>> > >>> > Some Calculations and statistics: >>> > >>> > >>> > Total number of records : 14,37,00,000 >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > Total number >>> > of records per file : 143700000/20 =71,85,000 >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > The total number >>> > of records in each file : 71,85,000 * 5 = 3,59,25,000 >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > Current Postgresql database config : >>> > >>> > My current Machine : 8GB RAM with i7 2nd generation Processor. >>> > >>> > I made changes to the following to postgresql configuration file : >>> shared_buffers : 2 GB effective_cache_size : 4 GB >>> > >>> > Note on current performance: >>> > >>> > I have run it for about ten hours and the performance is as follows: >>> The total number of records written for a single file is about 25,00,000 * >>> 5 =1,25,00,000 only. It has written 2 such files .considering the size it >>> would take me atleast 20 hrs for 2 files.I have about 10 files and hence >>> the total hours would be 200 hrs= 9 days. I have to start my experiments as >>> early as possible and 10 days is too much. Can you please help me to >>> enhance the performance. >>> > >>> > Questions: 1. Should i use Symmetric multi processing on my >>> computer.In that case what is suggested or prefereable? 2. Should i use >>> multi threading.. In that case any links or pointers would be of great help. >>> > >>> > >>> > >>> > Thanks >>> > >>> > Sree aurovindh V >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > This SF email is sponsosred by: >>> > Try Windows Azure free for 90 days Click Here >>> > http://p.sf.net/sfu/sfd2d-msazure >>> > _______________________________________________ >>> > Pytables-users mailing list >>> > Pyt...@li... >>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>> > >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > This SF email is sponsosred by: >>> > Try Windows Azure free for 90 days Click Here >>> > http://p.sf.net/sfu/sfd2d-msazure >>> > _______________________________________________ >>> > Pytables-users mailing list >>> > Pyt...@li... >>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > This SF email is sponsosred by: >>> > Try Windows Azure free for 90 days Click Here >>> > >>> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________ >>> > Pytables-users mailing list >>> > Pyt...@li... >>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> -- Francesc Alted >>> >>> >>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> This SF email is sponsosred by: >>> Try Windows Azure free for 90 days Click Here >>> http://p.sf.net/sfu/sfd2d-msazure >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >> >> >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |