From: Peter D. <pe...@re...> - 2007-05-26 02:43:23
|
Hi, CatTrack currently serves all the data reported from our testing infrastructure. The next step is to put in some nice pretty graphs with historic trends and analysis and possibly arbitrary data mining. As I have no professional experience in this area and limited time to complete this next bit (I allocated 3 days and just about halfway through that time) I want to do what we are most likely to want/need in short-term. My assumption is that we are going to have a window of time over which we track results. The types of things we are going to want to track (based on existing summary) include; * time for dacapo tests * aggregate.best.score for SPECjvm98 * score for SPECjbb2005 * new tests (do not appear in previous results) * missing tests (test not present now but appear in previous results) * new failures (tests that failed but have not failed in previous results) * new successes (tests that succeeded but have not succeeded in previous results) * intermittent tests (tests that have both succeeded and failed in previous results) These would be limited to a host/test-run combo. I am not sure how I will do this but I suspect it will not be as flexible as may be desired. Thus if you have any other elements/statistics/etc you want to track over time then speak up before next Wednesday or it is unlikely to be considered. -- Cheers, Peter Donald |
From: Steve B. <Ste...@an...> - 2007-05-26 04:23:35
|
Hi Peter, This sounds good. We should have a machine to host this for you RSN. > My assumption is that we are going to have a window of time over which > we track results. The types of things we are going to want to track > (based on existing summary) include; I imagine we'd want to track most items over all time, not just over a window. The window is particularly useful for the email (where we have to jam as much info as we can into a very constrained space) and may be a useful view to have on the web page, but I think having a complete history is really useful. I know sourceforge is pretty sucky, but an example is their tracking of downloads etc, whihc you can view at the resolution of a week, a month, a year, or all time (I think). To summarize: 1. A compact format is necessary for email 2. Keeping (and being able to view) data over all time is really useful (and possible on the web). When I worked at Intel they were big on the "dashboard" metaphor: one page which summarizes a lot of status for a project (you show this to your managers), this works best if it is implemented so that you can then drill down on anything specific, but it is great to abstract all that info into a concise page that gives you the big picture on a project. In my dreams, we'd have the following: 1. one email a day which summarized all regression data from all machines, including trends, and provided pointers to comprehensive tracking info 2. a comprehensive web page which allows us to view the regression data from various points of view 3. a mechanism (perhaps folded into 1) which *whacks* us as soon as things start going bad The specific items you listed all seem fine to me. I'm always interested in the issues of data representation. One really elegant idea is Edward Tufte's Sparklines [1].(I've you've not come across him before, Tufte is something of a phenomena in the data representation world). Sparklines are a very efficient and effective way of conveying lots of time series data. A quick sniff shows ruby scripts for generating them [2] . You may like to take a quick look. I'm delighted you're doing this. Whether or not what you end up with comes close to what I've sketched above, I am *sure* it will be a big improvement on what we have now. Thanks for your efforts! --Steve [1] http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR [2] http://luke.francl.org/sparklines/ |
From: Peter D. <pe...@re...> - 2007-05-26 06:32:20
|
On 5/26/07, Steve Blackburn <Ste...@an...> wrote: > To summarize: > 1. A compact format is necessary for email > 2. Keeping (and being able to view) data over all time is really > useful (and possible on the web). I think I should make myself clear. The complete result data is stored in the database BUT it is stored in a highly normalized format. Trying to perform any analysis on this database schema - especially given the quantity of data we collect is going to quickly grind the server to a screeching halt ;) Thus the most obvious answer is to add the data to another set of tables but I have no experience when it comes from transforming from OLTP schemas to OLAP schemas - only really heard about it in theory or in limited contexts. I had a poke about the web and found [1] that gives a simple explanation on how to do it. In theory I guess it would be simple to create a simple star schema that has a simple "statistic fact" table and then dimensions; time, build_configuration, test, statistic_type, test_configuration, host, build_target etc. Maybe it is not so hard ... not sure really. We would still be creating a few facts per day (~5000 per "sanity" run, per host) but it should be reasonable to retrieve them if indexed appropriately. Any other dimensions you are likely to be concerned with should be able to be added overtime. > When I worked at Intel they were big on the "dashboard" metaphor: one > page which summarizes a lot of status for a project (you show this to > your managers), this works best if it is implemented so that you can > then drill down on anything specific, but it is great to abstract all > that info into a concise page that gives you the big picture on a > project. Yup - thats the idea. There is even a dashboard controller already in place ... just nothing displayed by it yet ;) > 1. one email a day which summarized all regression data from all > machines, including trends, and provided pointers to comprehensive > tracking info should be possible ... not sure it will all be done prior to my time running out for a while. > 2. a comprehensive web page which allows us to view the regression > data from various points of view We have one point of view atm ;) If the OLTP->OLAP process goes fine then we may be able to see the data and drill down through the various dimensions ... who knows. > 3. a mechanism (perhaps folded into 1) which *whacks* us as soon as > things start going bad Doable in time .. not now. > Sparklines are a very efficient and effective > way of conveying lots of time series data. A quick sniff shows ruby > scripts for generating them [2] . You may like to take a quick look. Yep - I like. I would probably use [2] instead. His work is usually good. [1] http://www.ciobriefings.com/whitepapers/StarSchema.asp [2] http://nubyonrails.com/pages/sparklines -- Cheers, Peter Donald |