|
From: Sebastian K. <sk@z.pl> - 2003-07-28 21:29:08
|
Dnia pon 28. lipiec 2003 20:20, Josef Weidendorfer napisa=B3: > That seems to be true. But can't a good compiler hide cache miss latenc= ies > by prefetching or by using the result of a memory fetch e.g. 20 instruc= tion > later? It's not that easy. 1. Prefetching works only for regular accesses, if access adresses depend= on=20 recently loaded data then there is now way for prefetching. As various=20 studies show address prediction rates are very low -- 10-20% hits (compar= e=20 this with jump prediction rates as high as 95-96%). Even proposed (but no= t=20 yet implemented) techniques of speculatively launching pre-computing thre= ad=20 which is thus capable to mimic/predict actual application behaviour allow= for=20 accuracy not higher than 40%. 2. L2 miss on 3GHz P4 costs about 250 cycles, which is probably worth abo= ut=20 300 to 500 instructions. No OoO logic is capable of bypassing such a stal= l=20 (While L1 misses typically worth 19cycles on P4 have high potetial of bei= ng=20 mostly (>50%) hidden by scheduling logic). Even with cache hit rate of 9= 9.8%=20 CPU could still spent about 30% of the time doing nothing and waiting for= =20 memory accesses to finish. And this is not that bad -- Alpha 21164 -- an = in=20 order RISC CPU, without any features to speculatively deal with misses=20 (Itanium or even Transmeta's Crusoe, while being inorder havemechanisms t= o=20 issue reads speculatively) spent about 50% of it's time waiting on memory= =20 (but it suffered 100% on every miss, even L1 miss -- due to lack of eiteh= r=20 OoO nor speculation logic. > In this case, your code already would be as fast as possible, but > cachegrind results would suggest that there still are improvements > possible. The only known way to deal with the problem is to execute multiple thread= s at=20 the same time -- odds are that in case if one thread stalls others are=20 capable of executing, thus increasing CPU utilisation and total computati= onal=20 throughput. There even exists(ed) one spuercomputer system, which didn't=20 suffer from memory latency at all (while it lacked any kind of cache and = jump=20 prediction logic) -- solution was simple -- CPU was keeping state of 128=20 threads, and each cycle another thread instructiion has been executed. Th= at=20 way consecutive instructions from one thread were allawys separated by en= ough=20 time to resolve any conditional jump and to have all the data from memory= in=20 place. Nice design, but completely infeasible for general purpose computi= ng=20 (that machine was designed for veryh highly parallelisable tasks, and dea= lt=20 with them wonderfully) But there is currently no way to bypass memory latency at level of single= =20 threads (you can increase throughput, but can't lover execution time of=20 individual threads). The only aid comes from actually reducing memory lat= ency=20 (but that's hard -- AMD recently did a good job on their K8 family cpu's = --=20 latency improvement of about ~30% -- but that still means that cache miss= =20 will cost 200 to 350 instructions (instead of 300 to 500). Oh, maybe someone will find a way to effectively implement scheuler buffe= r big=20 enough to accomodate for 1000-2000 instructions and pair it with deep mem= ory=20 access reordering logic -- then it would help. rgds Sebastian |