|
From: Nicholas N. <nj...@ca...> - 2004-06-02 13:56:44
|
On Wed, 2 Jun 2004, Josef Weidendorfer wrote: > > 0x810120D7: rep stosl > > > > 17: CALLM_So > > 18: MOVL $0x0, t12 > > 19: PUSHL t12 > > 20: CALLMo $0xC6 (-rD) > > 21: POPL t12 > > 22: CALLM_Eo > > 23: SHLL $0x2, t12 > > 24: GETL %ECX, t14 > > <insert I-cache access here> > > 25: JIFZL t14, $0x810120D9 > > 26: DECL t14 > > 27: PUTL t14, %ECX > > 28: GETL %EAX, t16 > > 29: GETL %EDI, t18 > > 30: STL t16, (t18) > > 31: ADDL t12, t18 > > 32: PUTL t18, %EDI > > <insert D-cache access here> > > 33: JMPo $0x810120D7 > > > > > > I thought that putting the I-cache access before the JIFZ meant it would > > only be done once, whereas the D-cache access would be done N times. I > > now realise that is wrong; both will be done N times (an "N*I+N*D" > > model). I can't see how the 1*I+N*D model can be done without making big > > changes to the structure of basic blocks in the presence of REP prefixes. > > Isn't it actually "(N+1)*I+N*D" currently, i.e. always 1 instruction fetch > more than data fetches? Hmm, yes. > To correct this, a way would be to have 2 basic blocks for 1 instruction: One > with the instruction fetch, and 1 in the conditional loop with the data > fetch. Am I correct here? > As any instruction with a REP prefix has a size >1 byte, could we artifically > introduce 2 basic blocks? In the example above this would be one instrumented > block for 0x810120d7 (with the call to the instruction fetch), and one for > 0x810120d8 (with the data fetch). The problem here is of course that one can > not switch to the real processor at this point. > > So another idea: A flag to store if the instruction fetch was already done. > Also quite difficult and errorprone: when to reset the flag? > > Another idea: Correct the error afterwards: subtract the number of data > accesses from the number of instruction fetches... I think fixing it by changing Valgrind's core (ie. modifying BB layout) is a bad idea -- I don't want to introduce nasty special cases into the core just so Cachegrind can handle REP prefixes slightly more cleanly. Doing it in Cachegrind is much preferable, but I still can't see how to do that without it being a big pain. > > In which case, maybe the N*I/N*D model is ok. > > Yes, perhaps that's the easiest: I don't think this JIFZ special case makes > any big differences in the result anyway. Have you done any experiments > regarding REP prefixes and the results from real hardware counters for > "instructions retired"? No. > > Then there's one extra complication -- because the JIFZ can exit the basic > > block, putting the instrumentation at the end means that the last > > execution may not be simulated (this is also the case with the current > > method). A more precise approach would be to put the instrumentation > > before the JIFZ, although this would take effort. (A similar thing is > > true for the jecxz instruction, which is translated using JIFZ.) > > I don't see this problem. When CX==0, there is nothing to do (jumping out of > the basic block). Oh yeah, you're right; in this case no data accesses are occurring. I was wrong: the current JIFZ handling gives (N+1)*I + N*D. Removing the special case would give N*I + N*D. So they are different. Does anyone care? (I found another inaccuracy in the simulation today -- CMPS does two data accesses but Cachegrind only treats it as one. Again, only minor...) Thanks. N |