|
From: Heiko S. <hsc...@ft...> - 2003-01-09 16:24:46
|
Hi Jonathan, On Thu, 9 Jan 2003, Jonathan Brown wrote: > I found the same mistake in both sse_memcpy and mmx2_memcpy. They both > presume that prefetchnta prefetches 64 bytes. In actual fact, the p3 > prefetches 32 bytes and the p4 prefetches 128 bytes. The patch optimizes > it correctly for the p3. If you want to optimize for the p4 you should > really use movdqa/movdqu. > > Please apply to the tree. [...] thanks for looking into this and sending a patch - i just tried it, but couldn't really reproduce the improvements you measured. my cpuinfo says processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 10 cpu MHz : 1005.050 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 2005.40 ...and like i said - for me, the jitter between subsequently measuring the same code a few times is bigger than the differences i see between the unpatched and the patched memcpy routines. maybe you or someone else can shed some more (experimental or facts-based) light on this matter :) cheers, thanks again, Heiko |