From: Martin C. M. <mm...@it...> - 2008-03-11 14:16:25
|
Hi, I'd like to use the x86 instruction RDTSC for timing some sbcl code. The instruction returns a 64-bit value in registers EDX:EAX that represents the count of ticks from processor reset. Right now I call a C++ version through the FFI, but I'd like to inline it. It should be a simple VOP, but I can't figure out the syntax. Here's the C version: __inline__ uint64_t rdtsc() { uint32_t lo, hi; /* We cannot use "=A", since this would use %rax on x86_64 */ __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); return (uint64_t)hi << 32 | lo; } If someone can provide the code, I'll be happy to add it to the Wikipedia entry which already has C++, D, Pascal and FreeBASIC versions. Thanks, Martin |
From: Christophe R. <cs...@ca...> - 2008-03-11 15:00:22
|
"Martin C. Martin" <mm...@it...> writes: > I'd like to use the x86 instruction RDTSC for timing some sbcl code. > The instruction returns a 64-bit value in registers EDX:EAX that > represents the count of ticks from processor reset. Right now I call a > C++ version through the FFI, but I'd like to inline it. It should be a > simple VOP, but I can't figure out the syntax. Here's the C version: As it happens, Paul Khuong pasted a sketch to the IRC pastebot a few days ago, at <http://paste.lisp.org/display/56885>. That may give you some ideas. Best, Christophe |
From: Vitaly M. <v.m...@gm...> - 2008-03-11 16:17:11
|
Christophe Rhodes <cs...@ca...> writes: > "Martin C. Martin" <mm...@it...> writes: > >> I'd like to use the x86 instruction RDTSC for timing some sbcl code. >> The instruction returns a 64-bit value in registers EDX:EAX that >> represents the count of ticks from processor reset. Right now I call a >> C++ version through the FFI, but I'd like to inline it. It should be a >> simple VOP, but I can't figure out the syntax. Here's the C version: > > As it happens, Paul Khuong pasted a sketch to the IRC pastebot a few > days ago, at <http://paste.lisp.org/display/56885>. That may give you > some ideas. Does about x86-64 and rdx:rax? -- wbr, Vitaly |
From: Martin C. M. <mm...@it...> - 2008-03-11 18:24:45
|
Christophe Rhodes wrote: > "Martin C. Martin" <mm...@it...> writes: > >> I'd like to use the x86 instruction RDTSC for timing some sbcl code. >> The instruction returns a 64-bit value in registers EDX:EAX that >> represents the count of ticks from processor reset. Right now I call a >> C++ version through the FFI, but I'd like to inline it. It should be a >> simple VOP, but I can't figure out the syntax. Here's the C version: > > As it happens, Paul Khuong pasted a sketch to the IRC pastebot a few > days ago, at <http://paste.lisp.org/display/56885>. That may give you > some ideas. Great, thanks! I'm worried about the use of unsigned-reg and unsigned-num, as in: (define-vop (sch::rdtscx) (:policy :fast-safe) (:translate sch::rdtscx) (:temporary (:sc unsigned-reg :offset eax-offset :target lo) eax) (:temporary (:sc unsigned-reg :offset edx-offset :target hi) edx) (:results (hi :scs (unsigned-reg)) (lo :scs (unsigned-reg))) (:result-types unsigned-num unsigned-num) (:generator 5 (inst sch::rdtscx) (move hi edx) (move lo eax))) Will this do the right thing on x86-64? RDTSC only uses the lower 32 bits of each register, i.e. EDX and EAX, not RDX and RAX. Is the code above assuming a 32 bit machine? In particular, is the :results section correct for 64 bit? Best, Martin > > Best, > > Christophe |
From: Paul K. <pk...@gm...> - 2008-03-11 19:10:19
|
On 3/11/08, Vitaly Mayatskikh <v.m...@gm...> wrote: > Christophe Rhodes <cs...@ca...> writes: > > "Martin C. Martin" <mm...@it...> writes: > >> I'd like to use the x86 instruction RDTSC for timing some sbcl code. > >> The instruction returns a 64-bit value in registers EDX:EAX that > >> represents the count of ticks from processor reset. [...] > > As it happens, Paul Khuong pasted a sketch to the IRC pastebot a few > > days ago, at <http://paste.lisp.org/display/56885>. That may give you > > some ideas. > Does about x86-64 and rdx:rax? That code was only tested on x86-64. Remember to use some sort of barrier instruction like CPUID before and after measuring. Paul Khuong |
From: Paul K. <pk...@gm...> - 2008-03-12 17:02:50
|
On Tue, Mar 11, 2008 at 2:24 PM, Martin C. Martin <mm...@it...> wrote: > Christophe Rhodes wrote: > > "Martin C. Martin" <mm...@it...> writes: > > > >> I'd like to use the x86 instruction RDTSC for timing some sbcl code. [...] > >> It should be a > >> simple VOP, but I can't figure out the syntax. Here's the C version: > > > > As it happens, Paul Khuong pasted a sketch to the IRC pastebot a few > > days ago, at <http://paste.lisp.org/display/56885>. That may give you > > some ideas. > > Great, thanks! I'm worried about the use of unsigned-reg and > unsigned-num, as in: > > (define-vop (sch::rdtscx) > (:policy :fast-safe) > (:translate sch::rdtscx) > (:temporary (:sc unsigned-reg :offset eax-offset > :target lo) eax) > (:temporary (:sc unsigned-reg :offset edx-offset > :target hi) edx) > (:results (hi :scs (unsigned-reg)) > (lo :scs (unsigned-reg))) > (:result-types unsigned-num unsigned-num) > (:generator 5 > (inst sch::rdtscx) > (move hi edx) > (move lo eax))) > > Will this do the right thing on x86-64? RDTSC only uses the lower 32 > bits of each register, i.e. EDX and EAX, not RDX and RAX. Is the code > above assuming a 32 bit machine? In particular, is the :results section > correct for 64 bit? Quoting Intel's Instruction Set Reference: "In 64-bit mode, RDTSC behavior is unchanged from 32-bit mode. The upper 32 bits of RAX and RDX are cleared." Remember that, in SBCL, EAX names RAX, to simplify code reuse. As I wrote earlier, I only ever tested that code on x86-64. Paul Khuong |
From: Nikodemus S. <nik...@ra...> - 2008-03-12 20:43:55
|
The attached patch implements SB-VM::WITH-CYCLE-COUNTER (and SB-VM::%READ-CYCLE-COUNTER) for x86 and x86-64. WITH-CYCLE-COUNTER is ment for exporting from SB-SYS, maybe, if it seems that we can implement something similar for other platforms as well. (I would assume so, but I haven't checked.) The patch is a marriage of sorts between Paul's code and the CMUCL code. Or maybe I should say Paul's code and CMUCL comments. :) (defmacro with-cycle-counter (&body body) "Returns the primary value of BODY as the primary value, and the number of CPU cycles elapsed as secondary value." (with-unique-names (hi0 hi1 lo0 lo1) `(multiple-value-bind (,hi0 ,lo0) (%read-cycle-counter) (values (locally ,@body) (multiple-value-bind (,hi1 ,lo1) (%read-cycle-counter) (+ (ash (- ,hi1 ,hi0) 32) (- ,lo1 ,lo0))))))) Unless there are objections I'll merge it in this unexported state for now. Cheers, -- Nikodemus |
From: Martin C. M. <mm...@it...> - 2008-03-12 21:29:52
|
Should @body be wrapped in an unwind-protect? Best, Martin Nikodemus Siivola wrote: > The attached patch implements SB-VM::WITH-CYCLE-COUNTER (and > SB-VM::%READ-CYCLE-COUNTER) for x86 and x86-64. WITH-CYCLE-COUNTER is > ment for exporting from SB-SYS, maybe, if it seems that we can > implement something similar for other platforms as well. (I would > assume so, but I haven't checked.) > > The patch is a marriage of sorts between Paul's code and the CMUCL > code. Or maybe I should say Paul's code and CMUCL comments. :) > > (defmacro with-cycle-counter (&body body) > "Returns the primary value of BODY as the primary value, and the > number of CPU cycles elapsed as secondary value." > (with-unique-names (hi0 hi1 lo0 lo1) > `(multiple-value-bind (,hi0 ,lo0) (%read-cycle-counter) > (values (locally ,@body) > (multiple-value-bind (,hi1 ,lo1) (%read-cycle-counter) > (+ (ash (- ,hi1 ,hi0) 32) > (- ,lo1 ,lo0))))))) > > Unless there are objections I'll merge it in this unexported state for now. > > Cheers, > > -- Nikodemus |
From: Martin C. M. <mm...@it...> - 2008-03-12 22:14:13
|
Juho Snellman wrote: > "Martin C. Martin" <mm...@it...> writes: >> Should @body be wrapped in an unwind-protect? > > Can't see why, given an interface like this where the cycle count is > reported as a return value. If there was a non-local exit, we > obviously can't return the value, because we need to finish the nlx. Why do we need to finish the nlx? I think the main use of this code is while performance tuning, and if the non-local exit is occasionally the way its supposed to exit, we'd want to add the times for those too. Basically, debugging should happen before performance tuning, and if we get an nlx during performance tuning, odds are its expected. > While if body does a local exit, the uwp would've gained nothing over > this version, but execution would be slower, and the implementation > clunkier. > > The uwp could be useful if the definition was something like: > > (defmacro with-cycle-counter ((fun) &body body) > "Calls FUN with the number of CPU cycles elapsed during the execution > of body." > ...) > > But I'm not convinced that interface is much better, and people who > care about reading the cycle counter directly probably also care about > the relatively large uwp overhead. Very true, and your with-cycle-counter is similar to the interface we use. > >> Nikodemus Siivola wrote: >>> The attached patch implements SB-VM::WITH-CYCLE-COUNTER (and >>> SB-VM::%READ-CYCLE-COUNTER) for x86 and x86-64. WITH-CYCLE-COUNTER is >>> ment for exporting from SB-SYS, maybe, if it seems that we can >>> implement something similar for other platforms as well. (I would >>> assume so, but I haven't checked.) >>> >>> The patch is a marriage of sorts between Paul's code and the CMUCL >>> code. Or maybe I should say Paul's code and CMUCL comments. :) >>> >>> (defmacro with-cycle-counter (&body body) >>> "Returns the primary value of BODY as the primary value, and the >>> number of CPU cycles elapsed as secondary value." >>> (with-unique-names (hi0 hi1 lo0 lo1) >>> `(multiple-value-bind (,hi0 ,lo0) (%read-cycle-counter) >>> (values (locally ,@body) >>> (multiple-value-bind (,hi1 ,lo1) (%read-cycle-counter) >>> (+ (ash (- ,hi1 ,hi0) 32) >>> (- ,lo1 ,lo0))))))) >>> >>> Unless there are objections I'll merge it in this unexported state for now. >>> >>> Cheers, >>> >>> -- Nikodemus >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2008. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Sbcl-devel mailing list >> Sbc...@li... >> https://lists.sourceforge.net/lists/listinfo/sbcl-devel >> > |
From: Juho S. <js...@ik...> - 2008-03-12 22:41:41
|
"Martin C. Martin" <mm...@it...> writes: > Juho Snellman wrote: > > "Martin C. Martin" <mm...@it...> writes: > >> Should @body be wrapped in an unwind-protect? > > Can't see why, given an interface like this where the cycle count is > > reported as a return value. If there was a non-local exit, we > > obviously can't return the value, because we need to finish the nlx. > > Why do we need to finish the nlx? I think the main use of this code > is while performance tuning, and if the non-local exit is occasionally > the way its supposed to exit, we'd want to add the times for those too. Right, it's expected. So we have to finish it, or the program will be doing something different than before the WITH-CYCLE-COUNTER was added. (defun foo () (loop (incf *total* (nth-value 2 (with-cycle-counter (if something (return-from foo) (do-something-else))))))) So "given an interface like this where the cycle count is reported as a return value" there are two options. Either we prevent the nlx in the uwp of WITH-CYCLE-COUNTER, in which case this program will loop infinitely. Or we finish it by returning from FOO, in which case we obviously will not be returning a value from WITH-CYCLE-COUNTER, and that iteration of the loop will be unaccounted, and the uwp was useless. -- Juho Snellman |
From: Nikodemus S. <nik...@ra...> - 2008-03-13 10:03:44
|
On 3/13/08, Paul Khuong <pk...@gm...> wrote: > Paul Khuong, not Paul K. Huong (the middle initial is a V if you > insist ;). Oops, sorry! > On a more technical note, I am not sure that inserting a > CPUID only before reading the cycle counter is enough. In theory, the > first RDTSC instruction could be moved further and further back in the > pipeline until its results are used. Right. Updated patch attached. > As for other architectures, the %tick register is user-readable on > Solaris 8+/USparc; I'm not sure about its status on linux/USparc, but > CMUCL sources should know: > 2003-02-20 > Experimental support for hardware cycle counters has been added for > x86 and UltraSPARC platforms. This is based on the RDTSC instruction > on Pentium and better processors, and on reading the %TICK register on > UltraSPARC. > > Alpha has the RPCC instruction, for two 32 bit values packed in a > single 64 bit GPR (the higher value is OS dependent, the lower one cpu > dependent). > > Finally, it seems like PPC has mftb (djb's pages seem useful here > <http://cr.yp.to/hardware/ppc.html>). That's good news. > FFTW's "cycle.h" (<www.fftw.org/cycle.h>) has support for such > instructions on all platforms SBCL supports; unfortunately, it's GPL. > We'd have to ask a lawyer about the interactions ;) GPL would taint SBCL, but cycle.h seems to be under 1-clause MIT, so cribbing from there is not problematic. Cheers, -- Nikodemus |
From: Martin C. M. <mm...@it...> - 2008-03-15 13:50:58
|
One small suggestion: while the rdtsc values can be big (e.g. if the computer hasn't been reset in years), the difference should be small, small enough to fit in a fixnum. So the part that reads: (ash (- hi1 hi0) 32) Could be changed to: (the fixnum (ash (- hi1 hi0) 32)) Similarly after the difference of the lo's is added. Or not, this isn't inside what you're trying to measure, so its not the performance critical part. Best, Martin Nikodemus Siivola wrote: > On Thu, Mar 13, 2008 at 12:03 PM, Nikodemus Siivola > <nik...@ra...> wrote: > >> Right. Updated patch attached. > > I've committed this version as 1.0.15.33. > > Cheers, > > -- Nikodemus > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Sbcl-devel mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-devel |
From: Paul K. <pk...@gm...> - 2008-03-15 16:17:07
|
On 3/15/08, Martin C. Martin <mm...@it...> wrote: > One small suggestion: while the rdtsc values can be big (e.g. if the > computer hasn't been reset in years), the difference should be small, > small enough to fit in a fixnum. So the part that reads: > > (ash (- hi1 hi0) 32) > > Could be changed to: > > (the fixnum (ash (- hi1 hi0) 32)) > > Similarly after the difference of the lo's is added. > > Or not, this isn't inside what you're trying to measure, so its not the > performance critical part. No need to lie to the compiler. The hi/lo values are declared (unsigned-byte 32), so the subtractions are already done inline. What should be declared is that the result of the hi/hi subtraction is (unsigned-byte 32) too. After that, everything will be done inline (with tests), and the slow bignum-consing path will never be taken. That's on a 64 bit architecture, obviously. In any case, if that matters, it's very probably a case of You're Doing It Wrong: the logging overhead will introduce as much or even more noise anyway. Paul Khuong |
From: Paul K. <pk...@gm...> - 2008-03-16 16:01:23
|
On 3/15/08, Paul Khuong <pk...@gm...> wrote: > On 3/15/08, Martin C. Martin <mm...@it...> wrote: > > One small suggestion: while the rdtsc values can be big (e.g. if the > > computer hasn't been reset in years), the difference should be small, > > small enough to fit in a fixnum. So the part that reads: [...] > No need to lie to the compiler. The hi/lo values are declared > (unsigned-byte 32), so the subtractions are already done inline. What > should be declared is that the result of the hi/hi subtraction is > (unsigned-byte 32) too. Except that's actually not true. If you switch to another processor between measurements, the difference could be arbitrarily negative (admittedly, it's unlikely to be lower than -1). I believe most OSes have a processor affinity interface. Linux has sched_getaffinity/sched_setaffinity. OTOH, OS X seems to only offer the ability to disable cores via CHUD (that framework used to have an undocumentated affinity function). But in the end, it's just another source of noise, and doing multiple runs will smooth that all out. As for number consing, if it ends up being a real problem, masking the higher bits of the high words is always an option. Paul Khuong |
From: Martin C. M. <mm...@it...> - 2008-03-16 19:55:11
|
Paul Khuong wrote: > On 3/15/08, Paul Khuong <pk...@gm...> wrote: >> On 3/15/08, Martin C. Martin <mm...@it...> wrote: >> > One small suggestion: while the rdtsc values can be big (e.g. if the >> > computer hasn't been reset in years), the difference should be small, >> > small enough to fit in a fixnum. So the part that reads: > [...] >> No need to lie to the compiler. The hi/lo values are declared >> (unsigned-byte 32), so the subtractions are already done inline. What >> should be declared is that the result of the hi/hi subtraction is >> (unsigned-byte 32) too. > > Except that's actually not true. If you switch to another processor > between measurements, the difference could be arbitrarily negative > (admittedly, it's unlikely to be lower than -1). Not with modern CPUs, RDTSC returns the number of FSB cycles times the nominal CPU clock multiplier. So you get reliable values even if the process has been moved between CPUs. This is true of the Woodcrest CPUs our company bought last year. I don't know about other CPUs though, so what you say may still be true on CPUs that are common in the field. Or maybe not. > I believe most OSes > have a processor affinity interface. Linux has > sched_getaffinity/sched_setaffinity. OTOH, OS X seems to only offer > the ability to disable cores via CHUD (that framework used to have an > undocumentated affinity function). But in the end, it's just another > source of noise, and doing multiple runs will smooth that all out. > > As for number consing, if it ends up being a real problem, masking the > higher bits of the high words is always an option. > > Paul Khuong Best, Martin |
From: Eric M. <eri...@fr...> - 2008-03-11 23:12:57
|
>>>>> "mcm" == Martin C Martin <mm...@it...> writes: mcm> I'd like to use the x86 instruction RDTSC for timing some sbcl code. mcm> The instruction returns a 64-bit value in registers EDX:EAX that mcm> represents the count of ticks from processor reset. CMUCL has had support for this for a while (on x86); see http://trac.common-lisp.net/cmucl/browser/trunk/src/compiler/x86/system.lisp http://trac.common-lisp.net/cmucl/browser/trunk/src/compiler/x86/insts.lisp -- Eric Marsden |
From: Juho S. <js...@ik...> - 2008-03-12 21:57:42
|
"Martin C. Martin" <mm...@it...> writes: > Should @body be wrapped in an unwind-protect? Can't see why, given an interface like this where the cycle count is reported as a return value. If there was a non-local exit, we obviously can't return the value, because we need to finish the nlx. While if body does a local exit, the uwp would've gained nothing over this version, but execution would be slower, and the implementation clunkier. The uwp could be useful if the definition was something like: (defmacro with-cycle-counter ((fun) &body body) "Calls FUN with the number of CPU cycles elapsed during the execution of body." ...) But I'm not convinced that interface is much better, and people who care about reading the cycle counter directly probably also care about the relatively large uwp overhead. > Nikodemus Siivola wrote: > > The attached patch implements SB-VM::WITH-CYCLE-COUNTER (and > > SB-VM::%READ-CYCLE-COUNTER) for x86 and x86-64. WITH-CYCLE-COUNTER is > > ment for exporting from SB-SYS, maybe, if it seems that we can > > implement something similar for other platforms as well. (I would > > assume so, but I haven't checked.) > > > > The patch is a marriage of sorts between Paul's code and the CMUCL > > code. Or maybe I should say Paul's code and CMUCL comments. :) > > > > (defmacro with-cycle-counter (&body body) > > "Returns the primary value of BODY as the primary value, and the > > number of CPU cycles elapsed as secondary value." > > (with-unique-names (hi0 hi1 lo0 lo1) > > `(multiple-value-bind (,hi0 ,lo0) (%read-cycle-counter) > > (values (locally ,@body) > > (multiple-value-bind (,hi1 ,lo1) (%read-cycle-counter) > > (+ (ash (- ,hi1 ,hi0) 32) > > (- ,lo1 ,lo0))))))) > > > > Unless there are objections I'll merge it in this unexported state for now. > > > > Cheers, > > > > -- Nikodemus > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Sbcl-devel mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-devel > -- Juho Snellman |
From: Paul K. <pk...@gm...> - 2008-03-13 04:00:35
|
On 3/12/08, Nikodemus Siivola <nik...@ra...> wrote: > The attached patch implements SB-VM::WITH-CYCLE-COUNTER (and > SB-VM::%READ-CYCLE-COUNTER) for x86 and x86-64. WITH-CYCLE-COUNTER is > ment for exporting from SB-SYS, maybe, if it seems that we can > implement something similar for other platforms as well. (I would > assume so, but I haven't checked.) > > The patch is a marriage of sorts between Paul's code and the CMUCL > code. Or maybe I should say Paul's code and CMUCL comments. :) Paul Khuong, not Paul K. Huong (the middle initial is a V if you insist ;). On a more technical note, I am not sure that inserting a CPUID only before reading the cycle counter is enough. In theory, the first RDTSC instruction could be moved further and further back in the pipeline until its results are used. As for other architectures, the %tick register is user-readable on Solaris 8+/USparc; I'm not sure about its status on linux/USparc, but CMUCL sources should know: 2003-02-20 Experimental support for hardware cycle counters has been added for x86 and UltraSPARC platforms. This is based on the RDTSC instruction on Pentium and better processors, and on reading the %TICK register on UltraSPARC. Alpha has the RPCC instruction, for two 32 bit values packed in a single 64 bit GPR (the higher value is OS dependent, the lower one cpu dependent). Finally, it seems like PPC has mftb (djb's pages seem useful here <http://cr.yp.to/hardware/ppc.html>). FFTW's "cycle.h" (<www.fftw.org/cycle.h>) has support for such instructions on all platforms SBCL supports; unfortunately, it's GPL. We'd have to ask a lawyer about the interactions ;) Paul Khuong |
From: Raymond T. (RT/EUS) <ray...@er...> - 2008-03-13 12:35:08
|
Paul Khuong wrote: > On 3/12/08, Nikodemus Siivola <nik...@ra...> wrote: >> The attached patch implements SB-VM::WITH-CYCLE-COUNTER (and >> SB-VM::%READ-CYCLE-COUNTER) for x86 and x86-64. WITH-CYCLE-COUNTER is >> ment for exporting from SB-SYS, maybe, if it seems that we can >> implement something similar for other platforms as well. (I would >> assume so, but I haven't checked.) >> >> The patch is a marriage of sorts between Paul's code and the CMUCL >> code. Or maybe I should say Paul's code and CMUCL comments. :) > > Paul Khuong, not Paul K. Huong (the middle initial is a V if you > insist ;). On a more technical note, I am not sure that inserting a > CPUID only before reading the cycle counter is enough. In theory, the > first RDTSC instruction could be moved further and further back in the > pipeline until its results are used. > > As for other architectures, the %tick register is user-readable on > Solaris 8+/USparc; I'm not sure about its status on linux/USparc, but > CMUCL sources should know: > 2003-02-20 > Experimental support for hardware cycle counters has been added for > x86 and UltraSPARC platforms. This is based on the RDTSC instruction > on Pentium and better processors, and on reading the %TICK register on > UltraSPARC. Yes, CMUCL reads the %tick register on Ultrasparcs. It also reads the timebase register on ppcs. (But the relationship between the timebase register and the actual CPU clock is "undefined". You have to do some OS stuff to figure that out.) Ray |
From: Martin C. M. <mm...@it...> - 2008-03-13 13:17:28
|
Paul Khuong wrote: > On 3/12/08, Nikodemus Siivola <nik...@ra...> wrote: > On a more technical note, I am not sure that inserting a > CPUID only before reading the cycle counter is enough. In theory, the > first RDTSC instruction could be moved further and further back in the > pipeline until its results are used. I think that RDTSC is such a fast function that it would never be delayed, but I'm just guessing. However, every description I can find, including this one from Intel http://cs.smu.ca/~jamuir/rdtscpm1.pdf only put the cpuid instruction *before* the rdtsc. Best, Martin |
From: Nikodemus S. <nik...@ra...> - 2008-03-14 20:11:26
|
On Thu, Mar 13, 2008 at 12:03 PM, Nikodemus Siivola <nik...@ra...> wrote: > Right. Updated patch attached. I've committed this version as 1.0.15.33. Cheers, -- Nikodemus |