Re: [dacapobench-researchers] Call for contributions: DaCapo '12

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Just a quick reply to some of the points raised here.

What we've tried to do with DaCapo is have a steady-state iteration of the benchmark run for between one second and one minute (ie 1 sec < time < 1 min), where a) the time is as measured on a contemporary machine at the time of the release of the suite, b) the specific running time is determined by the nature of the benchmark (compare fop and trade), and c) where this is achieved via a fairly "natural" input set (ie no spinning in a loop), in strong contrast to jvm2008.   At the end of the day, my view is that having a meaningful (uncontrived) input set trumps specific goals w.r.t. running time.   If a given benchmark falls well outside of this range, then perhaps it is unsuitable and should not be included.   The latter point provides a prompt for removing some benchmarks as we renew the suite.

The rationale for the time range above was that we need to allow researchers to run substantial test sets in the space of 24 hours (as opposed to months).   I have recent experience with massive experiments that take months to run even at native speed, so I'm acutely aware of the need for modest running times (compare SPEC CPU).

I wrote the following just a day or so ago in response to a question about whether or not we were working on larger input sets for DaCapo:

On 04/02/2012, at 3:42 PM, Steve Blackburn wrote:

> I don't think so...   In fact for most of them, its not clear how to do that.   It's more or less evolved to "default" & "testing" (where the latter is not necessarily specially meaningful but is small).  THat's not what we say officially, but I think that's what the situation is in practice.
> 
> I do plan to make a call for new benchmarks though....   On my todo list now (along with call for help with OpenJDK).
> 
> --Steve

As Eliot and Andreas have noted, and I mention in the quoted email above, the current naming situation is unsatisfactory.   Right now "default" and "testing" would be a more honest reflection of what we have.   There was no intention to mislead, of course, it's just that it turns out to be a fairly hard problem.

---Steve

On 06/02/2012, at 10:01 AM, Eliot Moss wrote:

> On 2/5/2012 6:30 AM, Andreas Sewe wrote:
> 
>> Even if the benchmark in question does not have a hard-coded timeout
>> hidden somewhere (like trade* and unfortunately also our own actors
>> benchmark), the overhead (time and/or space) caused by trace capture can
>> sometimes be so massive to make it infeasible for some of the benchmarks.
>> 
>> This is a particular problem with some of the Scala benchmarks, which
>> exhibit extremely frequent method calls or allocations, which, if
>> traced, lead to trace files in the terabyte range (uncompressed).
> 
> Right now I have some traces that compressed are in the 150-200Gb
> range. The compression factor that gzip gets on memory access traces
> from valgrind's lackey tool is impressive, too -- the typical record
> is 14 bytes long and gzip compresses it to around 6 bits, so these
> are traces with over a trillion references. So I certainly know what
> you mean. We're placing on order another 6 to 10 Tb of storage :-) ...
> 
>> BTW, these problems are often not readily apparent from the
>> uninstrumented execution time; said benchmarks complete, in
>> wallclock-terms, just as fast as the others.
> 
> Sure; I think these address traces are most likely reflective of
> wall clock time, modulo their cache locality, but Merlin-like
> traces from Matthew Hertz's Elephant Tracks tool are also
> sizeable while including only call/return, allocation/death,
> and heap-update records.
> 
>> Currently, you have almost no choice but to use the "default" input
>> sizes, as "small" for many benchmarks doesn't do much real work, so any
>> results you report based on a trace of a "small" input (because using
>> "default" proved infeasible) looks a-priori suspicious.
> 
> Yes, I prefer small mostly for testing that things work, but ...
> 
>> However, whether the suspicion is warranted for depends very much on the
>> benchmark. For "fop", e.g., the "small" input is not only meaningful,
>> but also exercises quite different functionality. For other benchmarks,
>> "small" is just a scaled-down version of "default" and for others it
>> does little beyond benchmark setup.
>> 
>> I thus think we need a more principled way of naming our input sizes; in
>> particular, it should be clear whether one input is just a scaled
>> version of another (different number of essentially the same
>> iteration/transactions).
> 
> I agree.
> 
>> Any suggestions?
>> 
>> (And no, the Scala benchmarks don't use a such a naming scheme either;
>> they just use up more ad-hoc names like "tiny" and "gargantuan". ;-)
> 
> Well, this is off the top of my head, but we could use size terms
> for "normally behaving" inputs, and "test" or something like that
> for minimal inputs intended mostly to see whether a benchmark starts
> up, etc., or will tend to fail.
> 
> We can also have numbers or names within a size group, such as small/1,
> small/2, etc., or small/xyz.
> 
> Here is a wondering about standardizing sizes, too. At present, they
> are relative, but within those provided for a benchmark. For some
> purposes I would find it more helpful if they related to the absolute
> running time. Of course this varies with platform in peculiar ways,
> and I don't have a great answer for that, except either to pick one
> JVMs running time to use for this size "binning", or perhaps some
> average of the time achieved by some set of k leading JVMs or of the
> top k times achieved over some set.  Naturally this would have to be
> on some "standard" machine, etc. Tricky -- but then it doesn't have
> to be precise -- it's only to give a sense of what you're getting
> into if you're tracing or something.
> 
> For those purposes we might want to thing in terms of long/short time
> words rather than "size" words. Just a thought.
> 
> Anything like this quickly get tangly with dealing with benchmarks
> and measurement, eh?
> 
> Regards, and happy terabytes to you -- Eliot
> 
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> dacapobench-researchers mailing list
> dac...@li...
> https://lists.sourceforge.net/lists/listinfo/dacapobench-researchers