Performance

Philip Roth Crystal Jernigan
Attachments
Ex_XolotlPerfOutput.png (143484 bytes)

Overview of Performance Infrastructure

Xolotl is currently being developed with an underlying performance monitoring infrastructure. The performance will be monitored utilizing a modified General Purpose Timing Library (GPTL) in conjunction with the Performance Application Programming Interface (PAPI) library. More information about these libraries can be found at the GPTL and PAPI homepages.

The performance infrastructure of Xolotl was designed to support an "always on" collection of performance data. To support this type of collection, a handler registry interface was created in order to obtain, and produce, this performance data. Subsequently, two handler registries, standard and dummy, were developed to implement this interface. The standard handler registry monitors the performance of Xolotl and produces the corresponding performance information while the dummy handler registry does nothing but serve as a placeholder when performance monitoring is unnecessary. This additional choice of implementation was developed in order to allow performance monitoring to be optional, with no code refactoring required.

The performance data acquired by the handler registry is found by implementing the performance interfaces timer, event counter, and hardware counter. Subsequently, the timers, event counters, and hardware counters that will be created depend upon which handler registry is being used., e.g. a DummyEventCounter corresponds to the DummyHandlerRegistry.

The timer interface is responsible for performing all timing statistics including, but not limited to, execution time, call times, and time spent in a specific section of the code. The timing statistics of a program can potentially reveal things such as performance bottlenecks, areas that significantly contribute to execution time, that impact the overall performance of the code.

The event counter interface is responsible for collecting the frequency of specific events or function calls. Not only does this gather performance data on how many times each event occured, but it also reveals those procedures or functions that are never called.

Similarly, the hardware counter interface gathers information about specific hardware occurrences which are found by utilizing PAPI via the GPTL library. The hardware events that are of particular interest are total number of cache misses at the different levels (L1, L2, L3) in the cache hierarchy, the number of branch mispredictions, total number of instructions and cycles executed, as well as the number of floating point instructions and operations executed.

Using This Performance Infrastructure in Xolotl

This section describes how the performance infrastructure described in the previous section needs to be used in Xolotl.

First, the type of handler registry that will be used must be given as a command line argument in order to run Xolotl, by either of the following:

--perfHandler std : (default) specifies the StandardHandlerRegistry
--perfHandler dummy : specifies the DummyHandlerRegistry

The --help option can be passed if additional information regarding the performance argument is needed.

This argument is to determine which handler registry will be used in Xolotl's main program. It should be noted that a single instance of either StandardHandlerRegistry or DummyHandlerRegistry will be created accordingly. Additionally, the harware counters that will be monitored (the choices of which are: L1_CACHE_MISS, L2_CACHE_MISS, L3_CACHE_MISS, BRANCH_MISPRED, TOTAL_CYCLES, TOTAL_INSTRUC, FLPT_INSTRUC, FP_OPS) need to be specified in the main program of Xolotl, which can be done as follows:

std::vector<xolotlPerf::HardwareQuantities> hwq;
hwq.push_back( xolotlPerf::FP_OPS );
hwq.push_back( xolotlPerf::L3_CACHE_MISS );

The StandardHandlerRegistry utilizes the GPTL (and PAPI) libraries to access performance data. In order to gather this information GPTL has a couple initial requirements. First, the StandardHandlerRegistry indicates the hardware quantities which are to be monitored by calling GPTLsetoption( hardware quantity ) for each quantity. Once all GPTL options have been set, StandardHardwareRegistry calls GPTLinitialize to initialize that library. Note that GPTL requires all options to be set prior to initializing GPTL, which is why the Xolotl performance data collection infrastructure must specify which hardware quantities to collect at the time it creates the HandlerRegistry.

Following the definition of the vector of HardwareQuantities in the main program, the type of handler registry (specified by the command line argument described previously) needs to be created; after which, corresponding timers, event counters, and hardware counters are created. Note that, in order to avoid problems with overlapping Timer scopes, MPI should always be initialized before the handler registry is created and finalized at the end of the program. Below is an example of the previously described process that would come next in Xolotl's main program,

MPI_Init( &argc, &argv );

auto handlerRegistry = xolotlPerf::getHandlerRegistry();
auto totalTimer = handlerRegistry->getTimer( "total" );
totalTimer->start();

// do some work

auto solverTimer = handlerRegistry->getTimer( "solve" );
solverTimer->start();
solver.solve();
solverTimer->stop();

// do some work

totalTimer->stop();

// output the performance data
 if( rank == 0 ) {
    handlerRegistry->dump( std::cout );
 }

MPI_Finalize();
// end of program

Configure Requirements

  • If GPTL is configured with OPENMP=yes, Xolotl must also be built with OpenMP support, which can be done by using configure variables such as:

CXX=mpicxx cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS_DEBUG="-g -fopenmp" -DCMAKE_EXE_LINKER_FLAGS_DEBUG="-g -fopenmp" ../xolotl/

Note that if you use a different CMAKE_BUILD_TYPE (e.g., CMAKE_BUILD_TYPE=RelWithDebInfo), the names of the CMAKE_CXX_FLAGS_ and DMAKE_EXE_LINKER_FLAGS_ variables must be adjusted accordingly.

Limitations

The caveats for the current performance data collection infrastructure for Xolotl are:

  • Timer and hardware counter data is currently output by GPTL into files, one per MPI process, named ‘timing.n’ where n is the MPI rank number.
  • Xolotl outputs EventCounter data collected by the rank 0 process. In the future, the data will be aggregated from all MPI processes before output.
  • If your version of GPTL was built with PMPI support, GPTL will still produce ‘timing.n’ files with the MPI timings even if you have run Xolotl using the dummy handlers. We are investigating ways to modify GPTL to disable this feature, but a workaround is to build two versions of GPTL: one with PMPI support and one without, and to use the non-PMPI version when running with the dummy performance data collection infrastructure.
  • If you are NOT using GPTL 5.3, there are some combinations of PAPI hardware counters that cause GPTL to fail. Based on our testing, it appears that any combination of two or fewer hardware counters will succeed, and some combinations of more than two counters will succeed. For instance, (using the xolotl counter names) GPTL can successfully monitor {L1_CACHE_MISSES, FP_OPS} but cannot monitor {L1_CACHE_MISSES, L3_CACHE_MISSES, FP_OPS}. However it can monitor {L1_CACHE_MISSES, L2_CACHE_MISSES, L3_CACHE_MISSES}. We are investigating the cause of this problem.
  • If you are using GPTL 5.3, a call to GPTLsetoption(GPTLmultiplex, 1) must be used to enable PAPI multiplexing in order to monitor any combination of more than two PAPI hardware counters. It should be noted that multiplexing events reduces the accuracy of the reported results.
  • In the current Xolotl build configuration/implementation, the standard performance data collection classes are built if and only if both GPTL and PAPI are available. In the future, we plan to support systems with GPTL but not PAPI, with the limitation that no hardware counter data may be collected on such systems.
  • Versions other than PAPI 5.3.0 have caused problems when building Xolotl, either producing inaccurate performance data or breaking the build completely.

Performance Output Example

The performance statistics of Xolotl are output to files named either "timing.n" or "perfData.n" where n is the MPI rank number.

A preamble is provided at the beginning of the output file that displays some settings GPTL was built with (e.g. "ENABLE_PMPI was false") along with PAPI options that were set (e.g. if PAPI multiplexing was on or off and which hardware counters are to be monitored). Additionally, an explanation is given describing the statistics that are shown.

The performance statistics for each thread are shown where the first column lists the specified timer names. The indentation of timer names indicates that the, indented, timer is contained in the previous, non-indented, timer. For instance, in the above output, "loadNetwork" is contained in "total."

The column labeled "Called" indicates the number of times the corresponding timer was called and the "Recurse" column gives the number of recursive timer calls. The next column, "Wallclock," displays the total wallclock time for each timer; the "max" and "min" columns show the maximum and minimum amount of time spent in the recursive calls (in the output above there are no recursive calls so the "max" and "min" columns are the same as "Wallclock"). The "%_of_total" displays the percentage of time spent in each timer compared to the first timer (in this case, "total"). The last two columns list the PAPI hardware counter statistics.


Related

Wiki: Build Configuration
Wiki: Preprocessor