Re: [Valgrind-users] cpu cycles

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Monday 28 July 2003 18:25, Jeremy Fitzhardinge wrote:
> [...]
> Basically people have completely given up on the idea of how many
> "cycles" a particular instruction takes and mostly given up on the idea
> of cycles for groups of instructions.  The innards of the CPU are not
> well enough documented to really simulate, and even if you could, each
> CPU is different enough that it would take a lot of work for each one.
>
> If you're interested in running time, the only meaningful measurement
> you can make is run the code and see how long it takes to run.  You can
> use the performance counters to glean information about why a particular
> sequence ran slower than expected by looking for "bad" events (cache
> misses, interlocks, stalls, etc), and try to work out what code caused
> them.
>
> The other difficulty with measurement is that the instructions which
> read the timer/counter registers are not necessarily synchronized with
> the code stream, or if they are, are so expensive that they upset the
> thing you're trying to measure.  The best way to profile the code is to
> take a small kernel which you run many times to amortize the cost of
> doing measurement.

Isn't the best way for profiling to do statistical sampling?
The interrupt only perturbs the CPU units every, say 1 million, cache misses.

As you say, events make most sense related to the current instruction stream, 
not a single instruction. So the interrupt handler should relate the event to
the current stream. Unfortunately, counting events related to a stack trace
(e.g. last 4 callers) is not enough, as the current function can contain 
conditional jumps and loops. The outcome of the last few conditional branches 
executed should be stored for this (in hardware?) to allow for reconstructing
the instruction stream leading to a given position. But I suppose this would 
produce way to much data?

> On the plus side, cache misses are so expensive they tend to dominate
> execution time.  If you use cachegrind to reduce your miss rate, that
> may be enough to make things go faster.

That seems to be true. But can't a good compiler hide cache miss latencies by
prefetching or by using the result of a memory fetch e.g. 20 instruction 
later?
In this case, your code already would be as fast as possible, but cachegrind
results would suggest that there still are improvements possible.

Josef