Menu

BenchmarkingMethodology

Robert Vesse

Methodology

Benchmarks are run on Operation Mixes, an operation mix contains one/more operations from the set of operations supported by the API. The API supports adding custom operations to it if the user desires.

Prior to the actual timed runs some number of warmup runs are performed (default is 5 warmup runs) to exercise the system such that it may ramp up into a hot running state that will show optimum performance. For some systems this may be irrelevant or may require more runs so you can configure this as desired.

The actual benchmark consists of some number of runs of the operation mix (default is 25 runs). By default operations are run in a random order for each run to try and avoid the system under test (SUT) learning the pattern of operations and aggressively caching and thus gaming the benchmark, this randomization may be turned off if desired. The default behaviour is that every operation in the mix is run once and only once for each run of the mix. Like most other aspects of the benchmark this is configurable, for example it is possible to run a subset of the operations for each run of the mix, run some operations multiple times within a mix or to configure how the mix is run entirely to your liking.

Each operation run records both the response time and the runtime for that run, response time is considered to be the time from when the operation is issued to when it starts responding. Different operations may have different definitions of this and some may not be able to differentiate response time from runtime. The runtime is calculated as the total time that the operation runs for, depending on the operation this includes the time to receive and parse the results in order to count them. This is in line with the methodology outlined in other SPARQL benchmarking reports such as http://www.revelytix.com/sites/default/files/TripleStorePerformanceTestingMethodolgy.pdf by Revelytix. Runtime is individual to an operation, operation mix runtimes are a summation of the runtimes of each individual operation rather than the time to run the entire mix of operations i.e. the mix runtime does not include any overhead of the actual benchmarking process as far as is possible.

In some cases we have seen that the overhead of retrieving the operation results may significantly exceed the time taken for the SUT to actually execute the operation, we provide several options for configuring the result formats requested so you may wish to choose whichever format is fastest for the SUT you are benchmarking. See the General Notes pages to learn about the supported result formats. You may also wish to skip the counting of results or apply a fixed limit on the number of results to each query, see the list of options for the relevant options. Even when these options are applied they may not apply to all types of operation.

As the benchmarker runs it reports statistics for each operation mix run. Once all runs have completed
it discards the best and worst N results as outliers (this defaults to 1 but is configurable) before
calculating statistics both per operation and for the complete operation mix.

Currently statistics calculated are as follows:
- Total Runs
- Total Errors, including total errors by category
- Total Results
- Average Results
- Total Response Time
- Average Response Time (Arithmetic Mean)
- Total Runtime
- Average Runtime (Arithmetic Mean)
- Average Runtime (Geometric Mean)
- Minimum Runtime
- Maximum Runtime
- Runtime Variance (Population Variance)
- Runtime Standard Deviation (Population Standard Deviation)
- Operations per Second
- Operations per Hour
- Operation Mixes per Hour

If performing a multi-threaded benchmark then additional stats will be included, see the subsequent section for details.

As well as printing information to stdout the benchmark command (and the API that underpins it) can also generate a CSV and an XML file at the end of the benchmarking process. These file contains overall statistics as well as run and per query statistics. Additionally they may also gather environmental and settings information for the runs.

Single vs Multi-Threaded Benchmarking

By default the benchmarker runs in single threaded mode so only a single operation will ever be running
at one time. The benchmarker can be made to run in multi-threaded mode by configuring a desired number of parallel threads. When running in this mode the entire operation mix is run in parallel so there may be up to N mixes and thus N operations running at one time where N is the number of threads specified by the user.

Please note that the random delay between operations may mean that there are less than N operations
actually running so you may wish to disable delays by using -d 0 or --delay 0 in order to ensure that
there are always N operations running when performing a multi-threaded benchmark. Also you may wish to
disable the randomisation option so that mixes are more likely to be running the same operation simultaneously though this will not necessarily simulate real world usage of a system very well.

Multi-threaded benchmarking allows you to put additional strain on your system so you may see
different performance figures versus single-threaded benchmarking.

When performing multi-threaded benchmarking additional stats will be generated, these are as follows:
- Actual Runtime
- Actual Average Runtime (Arithmetic Mean)
- Actual Operations per Second
- Actual Operations per Hour
- Actual Operation Mixes per Hour

These are equivalent to the single-threaded metrics except that they account for the parallelisation of operations, typically these metrics will show better figures than the single threaded figures though this may depend on the system and the operation. Multi-threaded stats include a tiny fraction of benchmarking overhead which single-threaded stats do not so the figures may be higher for some cases.

Publishing Results

It is strongly suggested that anyone using this tool to publish results follows best practices
with regards to transparency. We recommend disclosing technical specifications of the SUT
being benchmarked and the hardware it was run on, where the benchmarker was run relative to the
benchmarked system (typically we'd expect the two to be run on the same machine or on machines
in the same LAN) and the full output of the benchmarker (ideally either the CSV/XML output plus the console output).


Related

Wiki: CLI
Wiki: Introduction

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.