|
From: <sv...@va...> - 2005-12-03 14:27:47
|
Author: sewardj
Date: 2005-12-03 14:27:41 +0000 (Sat, 03 Dec 2005)
New Revision: 5275
Log:
Avoid potential partial-flags stall on P4.
Modified:
trunk/coregrind/m_dispatch/dispatch-x86-linux.S
Modified: trunk/coregrind/m_dispatch/dispatch-x86-linux.S
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/coregrind/m_dispatch/dispatch-x86-linux.S 2005-12-02 23:09:49 U=
TC (rev 5274)
+++ trunk/coregrind/m_dispatch/dispatch-x86-linux.S 2005-12-03 14:27:41 U=
TC (rev 5275)
@@ -101,7 +101,7 @@
jnz fast_lookup_failed
/* increment bb profile counter */
movl VG_(tt_fastN)(,%ebx,4), %edx
- incl (%edx)
+ addl $1, (%edx)
=20
/* Found a match. Call tce[1], which is 8 bytes along, since
each tce element is a 64-bit int. */
|
|
From: Nicholas N. <nj...@cs...> - 2005-12-04 19:31:13
|
On Sat, 3 Dec 2005, sv...@va... wrote: > Log: > Avoid potential partial-flags stall on P4. > > - incl (%edx) > + addl $1, (%edx) I recall hearing that on P3s that 'incl' is faster, but on P4s 'addl' is faster. (Actually, I couldn't remember which was which so I'm guessing from what you've said.) This seems like a good choice since the survey found that aroun 40% of users clearly identified themselves as using P4s, as opposed to about 5% for P3s. Nick |
|
From: Julian S. <js...@ac...> - 2005-12-04 19:40:57
|
> This seems like a good choice since the survey found that aroun 40% of > users clearly identified themselves as using P4s, as opposed to about 5% > for P3s. incl/decl are recommended don't-uses on P4s. Not that it made any measurable difference at all. I did write a small program to measure the branch-mispredict cost on P4, and found it to be 21 cycles. I also established that P4 can only predict one branch target address for an indirect jump (alternating between 2 different ones is worse, and cycling through 4 gives you the full 21-cycle hit). What this means is that each bb dispatch stalls for 21 cycles, which at an IPC of 0.8 is worth 16 ish insns. Bad. J |