Am Mittwoch, 29. Januar 2003 00:05 schrieb Ian Romanick:
> Felix K=FChling wrote:
> > On Tue, 28 Jan 2003 13:10:41 -0800
> > Ian Romanick <idr@...> wrote:
> >>Felix K=FChling wrote:
> >>>The patch moves the load operations back to the front of the loop as in
> >>>the G3TN_norm_w_lengths case.
> >>Good catch. It looks like this went into the Mesa tree back in October
> >>of 2001...over a year ago! It looks like Andres Lewycky gave Brian some
> >>bad patches. :(
> > Yeah, but until November 2002 (DRI trunk) there was a comment in 3dnow.c
> > that the 3dnow-normal code is broken and it was not used.
> >>I realize that AMD recommends reading memory backwards, but would a
> >>quick-fix be to just use the 3Dnow! prefetch instructions?
"Block Prefetch", page 18, see below.
> > The prefetch instructions used are and must be 3DNow instructions. On
> > Intel Prefetch was introduced with the SSE extension on the PentiumIII.
> > They're not available on older Athlons and K6's.
It all depends on steppings...
Some output from MPlayer, best optimized OSS app I know:
CPU: Advanced Micro Devices Athlon 4 PM Palomino/Athlon MP=20
Multiprocessor/Athlon XP eXtreme Performance (Family: 6, Stepping: 2)
Detected cache-line size is 64 bytes
CPUflags: MMX: 1 MMX2: 1 3DNow: 1 3DNow2: 1 SSE: 1 SSE2: 0
Kompiliert f=FCr x86 CPU mit folgenden Erweiterungen: MMX MMX2 3DNow 3DNowE=
> > Anyway, all that
> > prefetching looks odd to me. In the first transform loop in
> > _mesa_3dnow_transform_normalize_normals memory is prefetched which is
> > never read but only written. This is obviously useless. Then in the
> > normalize loop the memory which was written before is prefetched again.
> > I think this is not necessary. The array is small enough to be still in
> > the cache.
> I believe that prefetchw tells the processor to warm up the cache line
> because it's going to be written soon. I think the prefetching in the
> first loop is probably correct. The prefetchw of (%eax) might need to
> be before the add. I'd have to benchmark it. I'm not sure if I have a
> 3dnow capable box around anymore. If I do, it will be an old K6-III. :)
> > I'll see if I can clean this up a bit. On the mesa-4-0-4 branch this
> > code is disabled anyway, so there is not really a hurry to apply my
> > stupid little patch. About this reading backward thing, where is that
> > documented. I have an AMD Athlon optimization guide from February 2002
> > which doesn't mention it.
> I've seen a reference posted to dri-devel a couple times.
All from me;-)
> Here's a couple references the Dieter posted on 09-Jan-2003:
And here are some numbers:
clear_page by 'normal_clear_page' took 12757 cycles (489.9 MB/s)
clear_page by 'slow_zero_page' took 12478 cycles (500.9 MB/s)
clear_page by 'fast_clear_page' took 9684 cycles (645.4 MB/s)
clear_page by 'faster_clear_page' took 4257 cycles (1468.0 MB/s)
copy_page by 'normal_copy_page' took 9063 cycles (689.6 MB/s)
copy_page by 'slow_copy_page' took 9051 cycles (690.5 MB/s)
copy_page by 'fast_copy_page' took 8125 cycles (769.3 MB/s)
copy_page by 'faster_copy' took 5468 cycles (1143.0 MB/s)
copy_page by 'even_faster' took 5538 cycles (1128.5 MB/s)
copy_page by 'no_prefetch' took 4462 cycles (1400.7 MB/s)
> I'm not sure if this applies to the K6 family or just to Athlons. I
> suspect it may only apply to Athlons, but we may have to test it.
According to AMD (see the gdc2002.htm Presentation) it applies to _all_ mod=
x86 CPU's out there.
> >>Since these functions are globally exported, it might be worth it to
> >>write a quick test that calls the various _transform_normalize_normals
> >>functions to make sure that they all produces the same (or close enough)
> > And:
> > _transform_normalize_normals_no_rot
> > _transform_rescale_normals_no_rot
> > _transform_rescale_normals
> > _transform_normals_no_rot
> > _transform_normals
> > _normalize_normals
> > _rescale_normals
> > These should be tested too, while we're at it.