Re: [Valgrind-users] cpu cycles

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mon, 2003-07-28 at 02:53, Joerg Beyer wrote:
> Vincent Penquerc'h wrote:
> >>OK, so an implementation to do cycle counting needs to have
> >>a table that lists for every instruction how many cycles it
> >>need for the different cache-hit/miss situations.
> >>
> >>Are these informations available (for all the processors)?
> > 
> > 
> > Yes, but there is an awful lot of possible combinations. That
> 
> what are the conditions? Is it this or more?
> * L1 Hit, L2 hit, Cache Miss
> * branch prediction was true, false

I'm afraid it is much, much more complicated than this.  There's a lot
more to it than "branch prediction", since modern CPUs will
speculatively execute way in advance.  There's the TLB, and whether
you're getting TLB misses.  There's the breakdown of an instruction into
uops, and how those uops are handled by the various functional units,
and what conflicts they have.

Basically people have completely given up on the idea of how many
"cycles" a particular instruction takes and mostly given up on the idea
of cycles for groups of instructions.  The innards of the CPU are not
well enough documented to really simulate, and even if you could, each
CPU is different enough that it would take a lot of work for each one.

If you're interested in running time, the only meaningful measurement
you can make is run the code and see how long it takes to run.  You can
use the performance counters to glean information about why a particular
sequence ran slower than expected by looking for "bad" events (cache
misses, interlocks, stalls, etc), and try to work out what code caused
them.

The other difficulty with measurement is that the instructions which
read the timer/counter registers are not necessarily synchronized with
the code stream, or if they are, are so expensive that they upset the
thing you're trying to measure.  The best way to profile the code is to
take a small kernel which you run many times to amortize the cost of
doing measurement.

On the plus side, cache misses are so expensive they tend to dominate
execution time.  If you use cachegrind to reduce your miss rate, that
may be enough to make things go faster.

	J