|
From: Vincent Penquerc'h <Vin...@ar...> - 2003-07-28 09:46:20
|
> On the contrary, it's extremely relevant from a practical > point of view. Yes, and it might be sensible to add it either to cachegrind or to a variant of it, since cache misses/hits will be needed for this to be accurate. > It's almost impossible to know how long any one instruction will take. > For example, on my machine, a memory read takes 1 cycle if it > hits the L1 > cache, 10 cycles in the worst case if it only hits the L2 > cache, and 206 > cycles in the worst case if it misses both caches. But out-of-order as you mention here. > execution and all the other fancy modern CPU mumbo-jumbo > means that these > delays are rarely as bad as the worst case. Then throw in branch > mispredictions, and other pipeline stalls... it's a mess. Branch prediction algorithms are known (well, they were for the Pentium, when I last did asm stuff). Stalls have well defined conditions (AGIs, etc, are predictable). So it would be doable. After all, Vtune does (did, at least) this. -- Vincent Penquerc'h |
|
From: Vincent Penquerc'h <Vin...@ar...> - 2003-07-28 09:48:00
|
> OK, so an implementation to do cycle counting needs to have > a table that lists for every instruction how many cycles it > need for the different cache-hit/miss situations. > > Are these informations available (for all the processors)? Yes, but there is an awful lot of possible combinations. That would be nifty to try out a piece of time critical code on various architectures to see how it fares, since one would usually optimize for one's dev machine... -- Vincent Penquerc'h |
|
From: Joerg B. <jo...@we...> - 2003-07-28 09:54:12
|
Vincent Penquerc'h wrote: >>OK, so an implementation to do cycle counting needs to have >>a table that lists for every instruction how many cycles it >>need for the different cache-hit/miss situations. >> >>Are these informations available (for all the processors)? > > > Yes, but there is an awful lot of possible combinations. That what are the conditions? Is it this or more? * L1 Hit, L2 hit, Cache Miss * branch prediction was true, false and this all per instruction, per processor. > would be nifty to try out a piece of time critical code on > various architectures to see how it fares, since one would > usually optimize for one's dev machine... > I thought that the assumptions "all instructions will take the same amount of time", but you tould me that this is to easy :-) Joerg |
|
From: Jeremy F. <je...@go...> - 2003-07-28 16:25:19
|
On Mon, 2003-07-28 at 02:53, Joerg Beyer wrote: > Vincent Penquerc'h wrote: > >>OK, so an implementation to do cycle counting needs to have > >>a table that lists for every instruction how many cycles it > >>need for the different cache-hit/miss situations. > >> > >>Are these informations available (for all the processors)? > > > > > > Yes, but there is an awful lot of possible combinations. That > > what are the conditions? Is it this or more? > * L1 Hit, L2 hit, Cache Miss > * branch prediction was true, false I'm afraid it is much, much more complicated than this. There's a lot more to it than "branch prediction", since modern CPUs will speculatively execute way in advance. There's the TLB, and whether you're getting TLB misses. There's the breakdown of an instruction into uops, and how those uops are handled by the various functional units, and what conflicts they have. Basically people have completely given up on the idea of how many "cycles" a particular instruction takes and mostly given up on the idea of cycles for groups of instructions. The innards of the CPU are not well enough documented to really simulate, and even if you could, each CPU is different enough that it would take a lot of work for each one. If you're interested in running time, the only meaningful measurement you can make is run the code and see how long it takes to run. You can use the performance counters to glean information about why a particular sequence ran slower than expected by looking for "bad" events (cache misses, interlocks, stalls, etc), and try to work out what code caused them. The other difficulty with measurement is that the instructions which read the timer/counter registers are not necessarily synchronized with the code stream, or if they are, are so expensive that they upset the thing you're trying to measure. The best way to profile the code is to take a small kernel which you run many times to amortize the cost of doing measurement. On the plus side, cache misses are so expensive they tend to dominate execution time. If you use cachegrind to reduce your miss rate, that may be enough to make things go faster. J |
|
From: Josef W. <Jos...@gm...> - 2003-07-28 18:20:13
|
On Monday 28 July 2003 18:25, Jeremy Fitzhardinge wrote: > [...] > Basically people have completely given up on the idea of how many > "cycles" a particular instruction takes and mostly given up on the idea > of cycles for groups of instructions. The innards of the CPU are not > well enough documented to really simulate, and even if you could, each > CPU is different enough that it would take a lot of work for each one. > > If you're interested in running time, the only meaningful measurement > you can make is run the code and see how long it takes to run. You can > use the performance counters to glean information about why a particular > sequence ran slower than expected by looking for "bad" events (cache > misses, interlocks, stalls, etc), and try to work out what code caused > them. > > The other difficulty with measurement is that the instructions which > read the timer/counter registers are not necessarily synchronized with > the code stream, or if they are, are so expensive that they upset the > thing you're trying to measure. The best way to profile the code is to > take a small kernel which you run many times to amortize the cost of > doing measurement. Isn't the best way for profiling to do statistical sampling? The interrupt only perturbs the CPU units every, say 1 million, cache misses. As you say, events make most sense related to the current instruction stream, not a single instruction. So the interrupt handler should relate the event to the current stream. Unfortunately, counting events related to a stack trace (e.g. last 4 callers) is not enough, as the current function can contain conditional jumps and loops. The outcome of the last few conditional branches executed should be stored for this (in hardware?) to allow for reconstructing the instruction stream leading to a given position. But I suppose this would produce way to much data? > On the plus side, cache misses are so expensive they tend to dominate > execution time. If you use cachegrind to reduce your miss rate, that > may be enough to make things go faster. That seems to be true. But can't a good compiler hide cache miss latencies by prefetching or by using the result of a memory fetch e.g. 20 instruction later? In this case, your code already would be as fast as possible, but cachegrind results would suggest that there still are improvements possible. Josef |
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 21:23:59
|
On Mon, 28 Jul 2003, Josef Weidendorfer wrote: > That seems to be true. But can't a good compiler hide cache miss latencies by > prefetching or by using the result of a memory fetch e.g. 20 instruction > later? It's hard to hide a 200-cycle L2 miss... N |
Dnia pon 28. lipiec 2003 20:20, Josef Weidendorfer napisa=B3: > That seems to be true. But can't a good compiler hide cache miss latenc= ies > by prefetching or by using the result of a memory fetch e.g. 20 instruc= tion > later? It's not that easy. 1. Prefetching works only for regular accesses, if access adresses depend= on=20 recently loaded data then there is now way for prefetching. As various=20 studies show address prediction rates are very low -- 10-20% hits (compar= e=20 this with jump prediction rates as high as 95-96%). Even proposed (but no= t=20 yet implemented) techniques of speculatively launching pre-computing thre= ad=20 which is thus capable to mimic/predict actual application behaviour allow= for=20 accuracy not higher than 40%. 2. L2 miss on 3GHz P4 costs about 250 cycles, which is probably worth abo= ut=20 300 to 500 instructions. No OoO logic is capable of bypassing such a stal= l=20 (While L1 misses typically worth 19cycles on P4 have high potetial of bei= ng=20 mostly (>50%) hidden by scheduling logic). Even with cache hit rate of 9= 9.8%=20 CPU could still spent about 30% of the time doing nothing and waiting for= =20 memory accesses to finish. And this is not that bad -- Alpha 21164 -- an = in=20 order RISC CPU, without any features to speculatively deal with misses=20 (Itanium or even Transmeta's Crusoe, while being inorder havemechanisms t= o=20 issue reads speculatively) spent about 50% of it's time waiting on memory= =20 (but it suffered 100% on every miss, even L1 miss -- due to lack of eiteh= r=20 OoO nor speculation logic. > In this case, your code already would be as fast as possible, but > cachegrind results would suggest that there still are improvements > possible. The only known way to deal with the problem is to execute multiple thread= s at=20 the same time -- odds are that in case if one thread stalls others are=20 capable of executing, thus increasing CPU utilisation and total computati= onal=20 throughput. There even exists(ed) one spuercomputer system, which didn't=20 suffer from memory latency at all (while it lacked any kind of cache and = jump=20 prediction logic) -- solution was simple -- CPU was keeping state of 128=20 threads, and each cycle another thread instructiion has been executed. Th= at=20 way consecutive instructions from one thread were allawys separated by en= ough=20 time to resolve any conditional jump and to have all the data from memory= in=20 place. Nice design, but completely infeasible for general purpose computi= ng=20 (that machine was designed for veryh highly parallelisable tasks, and dea= lt=20 with them wonderfully) But there is currently no way to bypass memory latency at level of single= =20 threads (you can increase throughput, but can't lover execution time of=20 individual threads). The only aid comes from actually reducing memory lat= ency=20 (but that's hard -- AMD recently did a good job on their K8 family cpu's = --=20 latency improvement of about ~30% -- but that still means that cache miss= =20 will cost 200 to 350 instructions (instead of 300 to 500). Oh, maybe someone will find a way to effectively implement scheuler buffe= r big=20 enough to accomodate for 1000-2000 instructions and pair it with deep mem= ory=20 access reordering logic -- then it would help. rgds Sebastian |
|
From: Laurent D. <Lau...@es...> - 2003-07-28 09:58:29
|
Joerg Beyer wrote:
> OK, so an implementation to do cycle counting needs to have
> a table that lists for every instruction how many cycles it
> need for the different cache-hit/miss situations.
Tables are not an option as P4 for instance is dynamically
scheduled (out of order) with dozens of instructions being
considered for launching...
> Are these informations available (for all the processors)?
Not all information are available at least not enough to
give exact cycle count in every situation (look for some
posts in comp.arch).
Vincent Penquerc'h wrote:
> After all, Vtune does (did, at least) this.
Vtune relies on performance counters, doesn't it?
Laurent
|
|
From: Vincent Penquerc'h <Vin...@ar...> - 2003-07-28 10:07:05
|
> what are the conditions? Is it this or more? > * L1 Hit, L2 hit, Cache Miss > * branch prediction was true, false > > and this all per instruction, per processor. Cache sizes and architecture, BTB size. Some CPUs have different rules for pairing, stalls, etc, even within the "Pentium" denomination. AMD, Cyrix. With or without MMX/SSE/3dnow, or FPU, if you're going back to 486. The main killer is that rules can change between processor steppings, yielding a huge number of different rulesets. But the good thing is that you don't need to implement them all to get a useful tool. Make it data driven, and just plug a new table for a new CPU. -- Vincent Penquerc'h |
|
From: Vincent Penquerc'h <Vin...@ar...> - 2003-07-28 10:23:43
|
> Tables are not an option as P4 for instance is dynamically > scheduled (out of order) with dozens of instructions being > considered for launching... Is this a known algorithm ? My knowledge really stops with the Pentium, so that's a bit outdated :) > > After all, Vtune does (did, at least) this. > > Vtune relies on performance counters, doesn't it? When I used it, it had two modes, one which showed insn pairing, stalls, etc. Static analysis of the code. The second was a dynamic analysis from running the code, which (probably) used the MSR counters. This I never used, as it was too dog slow to be useable on my measly Pentium 100. Static analysis did not take into account anything like cache behavior though, so this is where something like V would shine, by combining the two approaches. -- Vincent Penquerc'h |
|
From: Sebastian K. <Seb...@so...> - 2003-07-28 11:10:59
|
From: "Vincent Penquerc'h" <Vin...@ar...>
> > Tables are not an option as P4 for instance is dynamically
> > scheduled (out of order) with dozens of instructions being
> > considered for launching...
>
> Is this a known algorithm ? My knowledge really stops with the
> Pentium, so that's a bit outdated :)
The problem is that the algorithm is determined by set of constraints, and
not any step-by-step model. To get it right one would require cycle-exact
CPU emulator.
Besides that what is publicly available is only rough idea, not exact
algorithm. Besides that there are many different x86 CPU's in the filed
(didn't count 386 -- two core variants: Intel/AMD and Cyrix), and all are
different: 486 (two cache varinats), P5, P5mmx, PPro (3 cache variants),
PII, PIII KNI, Old mobile PIII (256KB L2, pre-Cu-mine), PIII CuMine, PIII
Tanner 512KB L2, PIII Tanner 256KB L2, Celeron no L2, Celeron 'Mendocino',
Celeron 'Cu-Mine-128', Celeron 256K-L2, PIV-like Celeron (Willamete-128),
finally Nortwood-based Celeron, PII Xeon (2 cache variants), Ppro Overdrive,
old PIII Xeon (2 cache variants), newer PIII Xeon (3 or 4 cache variants),
Xeon DP - PIV based - (two cache variants), Xeon MP (one or two cache
variants); add to that poor but existat x86 layer of Itanium and Itanium II
(and now Itanium III being mainly I2 schrinked to 0.13um process), then from
AMD stable: 486 (both same as Intel and with AMD's own cache variant), 5x86,
K5 (SSA-5) revision1, K5 revision2, K5 revision3 (all K5's very different),
Nx586, K6, K6-2, K6-2 XT, K6-III, K6-2+, K6-III+, Athlon K7, Athlon K75
0.25&0.18um versions(3 cache variants), Athlon T-bird, Duron, Athlon XP & MP
Palomino, Athlon XP & MP T-bred, Athlon XP & MP Barton, Opteron; soon will
be Athlon 64 it 2 or three memory & cache variants from the beginning; then
from Cyrix stable: 486SLC, 486DLC, 486 Blue-Lightning, Cx586, Cx686, Cx686L,
MII; then from VIA-IDT family: Cenaur 6, IDT WinChip-II, IDT WinChip II 3D,
VIA Cyrix III, VIA Cyrix III 'Nehemiah'; then Rise's CPU; then UMC's own S5
(486 variant); then finally Transmeta an thier two version's of Crusoe (and
the upcoming model -- quite different).
Now consider that there are different memory/chipset configurations (some
CPU's have more than 10 of them).
Now remember that some aspects of that performance depend on OS paging and
memory-allocation policy.
Now consider than never PIV and Xeon variants are capable of executing 2
different threads at once (and those of course fight for CPU & bus
resources).
>
> > > After all, Vtune does (did, at least) this.
> >
> > Vtune relies on performance counters, doesn't it?
>
> When I used it, it had two modes, one which showed insn pairing,
> stalls, etc. Static analysis of the code. The second was a dynamic
> analysis from running the code, which (probably) used the MSR
> counters. This I never used, as it was too dog slow to be useable
> on my measly Pentium 100.
But using MSRs wont work under Valgrind -- because ot will show info on
instrumeted code which is of little use.
rgds
Sebastian Kaliszewski
|
|
From: Vincent Penquerc'h <Vin...@ar...> - 2003-07-28 10:38:14
|
> Hmm. Doable, maybe; but very, very difficult. Much harder than you > might think at first. > > To count cycles you need to simulate pretty much everything: the whole > pipleline, caches, TLBs, all that stuff. The SimpleScalar project > (www.simplescalar.com) does that. Yes, I realize there are an awful lot of things that need to be desribed. And probably many more than I already have in mind :) Oh well. I'll have a look at the URL above, sounds interesting, thanks. -- Vincent Penquerc'h |
|
From: Laurent D. <Lau...@es...> - 2003-07-28 10:39:53
|
Vincent Penquerc'h wrote:
> Is this a known algorithm ? My knowledge really stops with the
> Pentium, so that's a bit outdated :)
As I already wrote, not everything is documented. There
are even some parts of the dynamic scheduler that are random
(because a FIFO would have been too costly from a HW point
of view; read in comp.arch).
Regarding Vtune, my *guess* is that static pairing could
be done with the original Pentium as it was a statically
scheduled beast. Modern processors are a completely
different story...
To make things short, you won't be able to make a program
that can simulate a P4 cycle accurately (I would love to be
proven wrong :). All you can do is some educated guesses.
Laurent
|
|
From: Vincent Penquerc'h <Vin...@ar...> - 2003-07-28 10:41:46
|
> I don't think you're ever going to be able to get reliable > cycle counts > out of a Valgrind-based tool (unless that tool reads the performance > counters :) But it can't, since it emulates the code rather than execute it under supervision, right ? -- Vincent Penquerc'h |
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 10:50:09
|
On Mon, 28 Jul 2003, Vincent Penquerc'h wrote: > > I don't think you're ever going to be able to get reliable cycle > > counts out of a Valgrind-based tool (unless that tool reads the > > performance counters :) > > But it can't, since it emulates the code rather than execute it > under supervision, right ? Oh, of course. Silly me. N |
|
From: Vincent Penquerc'h <Vin...@ar...> - 2003-07-28 11:00:47
|
> To make things short, you won't be able to make a program > that can simulate a P4 cycle accurately (I would love to be > proven wrong :). All you can do is some educated guesses. All right, that's too bad, that'd have been quite handy. Thanks for the explanation. -- Vincent Penquerc'h |
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 10:23:34
|
On Mon, 28 Jul 2003, Vincent Penquerc'h wrote: > Yes, and it might be sensible to add it either to cachegrind > or to a variant of it, since cache misses/hits will be needed > for this to be accurate. Don't give too much credit to Cachegrind, what it says is only an approximation. For example, it doesn't even try to take into account virtual->physical address mappings. See developer.kde.org/~sewardj/docs-1.9.5/cg_main.html#cg-top, section 4.12 for a full list of its known shortcomings (Nb: the one about custom malloc() is not a problem since v1.9.6). > Branch prediction algorithms are known (well, they were for the > Pentium, when I last did asm stuff). Stalls have well defined > conditions (AGIs, etc, are predictable). So it would be doable. > After all, Vtune does (did, at least) this. Hmm. Doable, maybe; but very, very difficult. Much harder than you might think at first. To count cycles you need to simulate pretty much everything: the whole pipleline, caches, TLBs, all that stuff. The SimpleScalar project (www.simplescalar.com) does that. N |
|
From: Joerg B. <jo...@we...> - 2003-07-28 11:05:29
|
Nicholas Nethercote wrote: > On Mon, 28 Jul 2003, Vincent Penquerc'h wrote: > > >>Yes, and it might be sensible to add it either to cachegrind >>or to a variant of it, since cache misses/hits will be needed >>for this to be accurate. > > > Don't give too much credit to Cachegrind, what it says is only an but after all, the instruction counter still is a resonable value if you want to verify that some source code optimization works as you expected, right? Joerg |
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 11:16:26
|
On Mon, 28 Jul 2003, Joerg Beyer wrote: > but after all, the instruction counter still is a resonable > value if you want to verify that some source code optimization > works as you expected, right? Most of the time, probably, but fewer instructions doesn't always mean fewer cycles. I would use just 'time' or cycles (eg. counted with performance counters, via Rabbit or similar tool) to evaluate an optimisation. Although that only gives you results on your dev machine. That's one of the biggest problems with all this complicated hardware. N |
|
From: Bastien C. <ba...@ch...> - 2003-07-28 20:09:15
|
On Monday 28 July 2003 13:16, Nicholas Nethercote wrote:
> [...]
> optimisation. Although that only gives you results on your dev machine.
> That's one of the biggest problems with all this complicated hardware.
Why am I becoming nostalgic right now? When I think that I knew each and every
cycle that any instruction took on the C64 6502 ...
Ok, scrap that, going off topic :-)
Salut,
Bastien
--
-- The universe has its own cure for stupidity. --
-- Unfortunately, it doesn't always apply it. --
|
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 21:29:35
|
On Mon, 28 Jul 2003, Bastien Chevreux wrote: > Why am I becoming nostalgic right now? When I think that I knew each and every > cycle that any instruction took on the C64 6502 ... I was thinking the same thing about the 68HC11 :) N |