From: Damien K. <dam...@gm...> - 2011-09-14 01:25:10
|
Thanks Anthony - yes the measures are distinct. The two most common problems are: a) calculating moving averages, medians, standard deviations, correlations across a time series for a single company and b) joining companies and calculate (say) the sum of (Measure1 * Measure2) grouped by date or (say) the median, average and standard deviation of Measure 1 for a group of companies on a particular date I suppose the question comes down to whether pytables is more efficient with: - more data by a factor of say 50-100x (plus assume a small overhead to create more data) but then fewer/simpler joins and largely vectorised calculations, or - less data but more joins and calculations (maybe a few more iterative calculations needed)? Related to that is your suggestion of a table for every company. That would require a join (sometimes) across say 1,000 tables which (trapped in my SQL mindset) I would have never considered - is that possible / efficient in pytables? Thanks again for all your help Damien ---------- Forwarded message ---------- From: <pyt...@li...> Date: 14 September 2011 02:20 Subject: Pytables-users Digest, Vol 64, Issue 4 To: pyt...@li... Date: Tue, 13 Sep 2011 11:20:09 -0500 From: Anthony Scopatz <sc...@gm...> Subject: Re: [Pytables-users] Best Practice data design To: Discussion list for PyTables <pyt...@li...> Message-ID: <CAP...@ma...> Content-Type: text/plain; charset="iso-8859-1" On Tue, Sep 13, 2011 at 7:56 AM, Damien Klassen <dam...@gm... >wrote: > Hopefully an easy one for someone out there! > > Am trying to transfer some of my existing calculations from a relational DB > into pytables to speed up a range of calculations. One of the problems is > that the data has irregular dates and I'm not sure where to do the > processing. > > For example: > > > Date CompanyID Measure1 Measure2 ..... Measure 20 > '20110912' 535 5.0 15 > '20110911' 535 5.1 12 > '20110910' 535 4.9 11 > '20110909' 535 3.6 20 > '20110912' 12 6.0 34 > '20110911' 12 6.4 34 > '20110910' 12 4.6 36 > '20110909' 12 2.6 35 > > Date CompanyID Measure21 ........ Measure 30 > '20110901' 535 6.6 > '20110812' 535 6.1 > '20110901' 12 6.0 > '20110814' 12 6.3 > > Date CompanyID Measure31 ...... Measure 100 > '20110911' 535 7.6 > '20110818' 535 7.1 > '20110903' 12 7.0 > '20110818' 12 7.3 > > Usually I will be using the latest value for each measure until a newer one > presents itself (using scikits.timeseries to do the joins). While its not > strictly true, for the purpose of the exercise I think we could assume the > first table has daily data with 10-20 measures, the second has monthly with > 5-10 measures and the third has annual data with 50-70 measures. > > The calculations include a lot of aggregations with different splices of > companies plus a lot of statistical measures on timeseries. > > > > I am thinking my options are: > > > 1. Store in one big table with the minimum data i.e. CompanyID for my first > column and then an array for each column: > > eg column 2 might be Measure1 and 2, so for the example with company 535 > column 2 would be [['20110912',5.0,15],['20110911',5.1,12],[ > '20110910',4.9,11],['20110909',3.6,20]] say filled with daily data. Then > column 3 might be a much smaller array filled with (approximately) monthly > data, column 4 would be smaller again filled with irregular data (say annual > data) > > 2. Store in one huge table but process the data as it goes into the > pytable so that the data is unique on company and the date and I fill all > the empty spaces as it goes in. My rough estimate is that this will make my > database about 50-100 times larger than the first option (maybe 1Tb vs > 10-20Gb). Final thought though is that I am guessing that as the annual data > is the same for 250 weekdays a year that it would compress pretty well... > haven't looked at compression yet... > > 3. Store in multiple small tables and then try to join (I am assuming this > one is wrong because its how I would do it in SQL!) > > 4 ????? > > Hi Damien, Are these measures always distinct between tables? (ie you will never have 'Measure42' in more than one table.) You are right in that option 2 will have a lot of wasted space, which you probably want to avoid. You could also store a table for each company and then have a MeasureID... Company 5 Table Time MeasureID Value 1 16 64.9 2 17 3.6 How you store the data with this kind of data set (and where you do the joins) is really a function of what your most important use case is. You get the greatest advantage out of HDF5 when your data is structured and well formed. However, this is not always the case, and I have been known to use some ideas like in option 3 ;). Be Well Anthony > Thanks for any help - am hoping to get the data design right! > > Damien > > > > > > ------------------------------------------------------------------------------ > BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA > Learn about the latest advances in developing for the > BlackBerry® mobile platform with sessions, labs & more. > See new tools and technologies. Register for BlackBerry® DevCon today! > http://p.sf.net/sfu/rim-devcon-copy1 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ ------------------------------------------------------------------------------ BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA Learn about the latest advances in developing for the BlackBerry® mobile platform with sessions, labs & more. See new tools and technologies. Register for BlackBerry® DevCon today! http://p.sf.net/sfu/rim-devcon-copy1 ------------------------------ _______________________________________________ Pytables-users mailing list Pyt...@li... https://lists.sourceforge.net/lists/listinfo/pytables-users End of Pytables-users Digest, Vol 64, Issue 4 ********************************************* |