Re: [Valgrind-users] cpu cycles

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dnia pon 28. lipiec 2003 20:20, Josef Weidendorfer napisa=B3:
> That seems to be true. But can't a good compiler hide cache miss latenc=
ies
> by prefetching or by using the result of a memory fetch e.g. 20 instruc=
tion
> later?

It's not that easy.
1. Prefetching works only for regular accesses, if access adresses depend=
 on=20
recently loaded data then there is now way for prefetching. As various=20
studies show address prediction rates are very low -- 10-20% hits (compar=
e=20
this with jump prediction rates as high as 95-96%). Even proposed (but no=
t=20
yet implemented) techniques of speculatively launching pre-computing thre=
ad=20
which is thus capable to mimic/predict actual application behaviour allow=
 for=20
accuracy not higher than 40%.
2. L2 miss on 3GHz P4 costs about 250 cycles, which is probably worth abo=
ut=20
300 to 500 instructions. No OoO logic is capable of bypassing such a stal=
l=20
(While L1 misses typically worth 19cycles on P4 have high potetial of bei=
ng=20
mostly (>50%) hidden by scheduling logic).  Even with cache hit rate of 9=
9.8%=20
CPU could still spent about 30% of the time doing nothing and waiting for=
=20
memory accesses to finish. And this is not that bad -- Alpha 21164 -- an =
in=20
order RISC CPU, without any features to speculatively deal with misses=20
(Itanium or even Transmeta's Crusoe, while being inorder havemechanisms t=
o=20
issue reads speculatively) spent about 50% of it's time waiting on memory=
=20
(but it suffered 100% on every miss, even L1 miss -- due to lack of eiteh=
r=20
OoO nor speculation logic.

> In this case, your code already would be as fast as possible, but
> cachegrind results would suggest that there still are improvements
> possible.

The only known way to deal with the problem is to execute multiple thread=
s at=20
the same time -- odds are that in case if one thread stalls others are=20
capable of executing, thus increasing CPU utilisation and total computati=
onal=20
throughput. There even exists(ed) one spuercomputer system, which didn't=20
suffer from memory latency at all (while it lacked any kind of cache and =
jump=20
prediction logic) -- solution was simple -- CPU was keeping state of 128=20
threads, and each cycle another thread instructiion has been executed. Th=
at=20
way consecutive instructions from one thread were allawys separated by en=
ough=20
time to resolve any conditional jump and to have all the data from memory=
 in=20
place. Nice design, but completely infeasible for general purpose computi=
ng=20
(that machine was designed for veryh highly parallelisable tasks, and dea=
lt=20
with them wonderfully)
But there is currently no way to bypass memory latency at level of single=
=20
threads (you can increase throughput, but can't lover execution time of=20
individual threads). The only aid comes from actually reducing memory lat=
ency=20
(but that's hard -- AMD recently did a good job on their K8 family cpu's =
--=20
latency improvement of about ~30% -- but that still means that cache miss=
=20
will cost 200 to 350 instructions (instead of 300 to 500).
Oh, maybe someone will find a way to effectively implement scheuler buffe=
r big=20
enough to accomodate for 1000-2000 instructions and pair it with deep mem=
ory=20
access reordering logic -- then it would help.

rgds
Sebastian