Re: [Ground-user] Refreshing Materialized Views Performance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Kelvin,

It is great to hear that the ground report is being so heavily used.

I was/am aware of the bottleneck in the uploading of the data and how the materialised views refresh themselves. The serial Vs concurrent processing information is interesting, in my usage it had been the other way around. Also I'm not 100% sure that the requests are truly concurrent and it is on the todo list to investigate that further, the calls should are concurrent within the jython code however they are going down a single DB connection/cursor and I was not sure if that reverted them back to serial operations. It is on the todo list to investigate creating a connection pool and having multiple cursors to handle the refreshed.

Have a look at the following jython code: 'refreshMaterialisedViewsMT' in the file multithreadDatabaseFactory.py. This will show you the order in which the materialised views need to be run to produce the correcting results from the raw data. The views in each tier are order independent but the tiers must be processed in a serial fashion. This is due to dependencies between the materialised views regarding the input data for each materialised view.

I had been planning to look at rewriting that whole mechanism for creating the report data. Currently there are materialised views that are built off the raw data and each time that the materialised views are refreshed the raw data needs to be reprocessed. The reason is that the materialised view mechanism that was written is a very simple one as creating a materialised view that updated based on changes to the underlying tables was a more complex and time consuming task to taken on than I had time for….there was no native materialised view functionality in postgres at the time of creation of the ground report….and I'm not sure there is native functionality now…

If you feel like hacking here is a link to some code that you could use to change the implementation of the materialised views so that if would not create everything from scratch again. http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views#Very_Lazy_Materialized_Views

I use the snapshot view implementation of materialised views from the above wiki in the ground report.

Postgres 9 brings some interesting changes/additions to the table. My thoughts had been to rewrite the upload mechanism to make use of an unlogged or temporary table for the raw data and stored procedures to update the report tables (analogous to the current materialised views) and then dumping the temp table after the data had been processed. This would keep the overall DB smaller (though you would need to be able to handle the uploaded data volume) and there would not be the reprocessing of the historic data. It should also be faster than the materialised view approach as the concurrency would be programmed into the stored procedures. The only disadvantage is that you would loose the historical record of the raw data from within the db… whether this is a problem or not depends on whether you delete the data files from which the raw data originates. Would you miss having the raw data record within the reporting db?

It would be interesting to have some runtime stats from postgres during the running of the materialised views when the ground report jython script is used….it would good to see if there is any locking going on or whether IO is the limiting factor.

How much memory do you have in your box? have a look at running this script http://pgfoundry.org/projects/pgtune/ . It will outputs an optimised postgresql.conf based on information that you feed it. It has not been updated in a while however it may make some suggestions that could improve your db performance. If  you can throw more memory at the DB process it maybe able to buffer the disk better and reduce your IO.

Cheers
Cal

On 11 Oct 2011, at 13:09, Kelvin Ward wrote:

> Hi
> 
> I've been using ground report a fair bit and I've noticed a performance bottleneck in uploading report data. Specifically, after data from a load run has been put into the database each database 'view' is refreshed using the postgres function refresh_matview. This is an I/O heavy function and running each request serially I see the following times to complete:
> 
>   SELECT refresh_matview('element_summary_http_mv'); 31s
>   ...
>   element_buckets_http_mv 269s !
>   element_percentile_http_mv 98s
>   total_element_per_second_stats_http_mv 57s
>   concurrent_users_byrun_http_mv 54s
>   thread_stop_buckets_http_mv 49s
>   individual_element_per_second_stats_http_mv 43s
>   element_distribution_http_mv 43s
>   ...
>   element_totalbandwidth_http_mv 2s
>   page_distribution_http_mv 6s
>   page_summary_http_mv 6s  
>   element_max_time_http_mv 2s
>   element_min_time_http_mv 2s  
>   static_element_percentile_http_mv 7s
>   dynamic_element_percentile_http_mv 7s
>   page_percentile_http_mv < 1s
>   page_buckets_http_mv 3s
>   page_stderror_http_mv 3s  
>   element_stderror_http_mv < 1s
>   thread_start_buckets_http_mv 1s  
>   total_page_per_second_stats_http_mv 1s  
>   individual_page_per_second_stats_http_mv 1s  
>   static_element_per_second_stats_http_mv 1s
>   dynamic_element_per_second_stats_http_mv 1s
>   element_95confidence_http_mv 1s
>   mean_load_throughput_stats_http_mv 1s
>   concurrent_users_vs_tps_http_mv 1s
>   concurrent_users_vs_pps_http_mv 1s
>   concurrent_users_vs_statictps_http_mv 1s
>   concurrent_users_vs_dynamictps_http_mv 1s
>   concurrent_users_vs_totalrtime_http_mv 1s
>   concurrent_users_vs_pagertime_http_mv 1s
>   concurrent_users_vs_staticrtime_http_mv 1s
>   concurrent_users_vs_dynamicrtime_http_mv 1s
> 
> Right now I have about 80 load runs in the database. I know the I/O performance of my hardware is not great, but even with 80 load runs there's about 5.8 million rows in the slowest view 'element_buckets_http_mv'.
> I'm using postgres9 which has automatic vacuuming of the database and CPU or memory is not hit hard when running refresh_matview(), it's definitely a disk bottleneck. As ground report seems to do 10 concurrent calls to refresh_matview() the disk is being heavily thrashed and running in parallel seems to be slower that serial in this case - it's taking 19 minutes to complete refreshing all the views.
> 
> Trying to understand the ground report python code and postgres, which I'm new to, it seems that refresh_matview deletes all rows in a table like 'element_buckets_http_mv' before copying in new data to it. I'm wondering if that can be improved? 
> 
> Cheers
> Kelvin.
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct_______________________________________________
> Ground-user mailing list
> Gro...@li...
> https://lists.sourceforge.net/lists/listinfo/ground-user