Thread: Re: [Algorithms] P3 Prefetching.
Brought to you by:
vexxed72
From: Juan C. A. B. <jc...@ro...> - 2001-01-29 18:08:00
|
At 04:55 PM 1/29/2001 +0000, you wrote: >what I've got is an algorithm that reads short strips of memory (about 20 >bytes each) from 6 seperate locations - performs a calculation using the >data and writes a row of result values to a 7th address. > >BUT - I know the address of the strips I'll be wanting after the current >one. So I figured I ought to be able to prefech the 6 addresses whilst >calculating on the current lines... Have you considered the fact that prefetching will read a 32-byte-long 32-byte-aligned piece of memory? If your 20-byte strip crosses a 32-byte-align boundary, then you'll need two prefetches to get it. Prefetching the first and last byte should fix that. >I've put the prefetch calls into the loop but they only give a slight >speed improvement. and when I look in VTune I see that I'm still stalling >waiting for the data as I was before... and I have no explaination as to >why the prefetches are not taking place... >the loop takes about 200 cycles to use a strip of data so there ought to >be plenty time for the prefetch to have completed... Do you touch other memory during those 200 cycles? Maybe you're prefetching that data and it gets evicted before it's used. Also, take into account that the cache is only 2-way, so that you can only (IIRC) have up to two pieces of data from the same portion of two different 4K pages. >I'm most puzzled. Yes, cache optimizations have a way to surprise you like no other. :) Salutaciones, JCAB --------------------------------------------------------------------- Juan Carlos "JCAB" Arevalo Baeza | http://www.roningames.com Senior Technology programmer | mailto:jc...@ro... Ronin Entertainment | ICQ: 10913692 (my opinions are only mine) JCAB's Rumblings: http://www.metro.net/jcab/Rumblings/html/index.html |
From: Jason R. <jas...@vo...> - 2001-01-30 15:05:28
|
thanks for the ideas Juan I think I have to make some more rigorous investigations into this stuff. I've made individual replies to your suggestions. > -----Original Message----- > From: Juan Carlos Arevalo Baeza [mailto:jc...@ro...] > Sent: 29 January 2001 18:15 > To: gda...@li... > Subject: Re: [Algorithms] P3 Prefetching. > > > At 04:55 PM 1/29/2001 +0000, you wrote: > > >what I've got is an algorithm that reads short strips of > memory (about 20 > >bytes each) from 6 seperate locations - performs a > calculation using the > >data and writes a row of result values to a 7th address. > > > >BUT - I know the address of the strips I'll be wanting after > the current > >one. So I figured I ought to be able to prefech the 6 > addresses whilst > >calculating on the current lines... > > Have you considered the fact that prefetching will read a > 32-byte-long > 32-byte-aligned piece of memory? If your 20-byte strip crosses a > 32-byte-align boundary, then you'll need two prefetches to get it. > Prefetching the first and last byte should fix that. > yeah I thought of that - and it is possible - but it wouldnt account for how often I get the misses - and as far as I see from Vtune - the memory reads are causing L2 line allocations - so it's neither been prefetched to L1 nor L2... I think the problem is probably that I've not allowed enough time for the data to arrive after it's been prefetched.. > >I've put the prefetch calls into the loop but they only give > a slight > >speed improvement. and when I look in VTune I see that I'm > still stalling > >waiting for the data as I was before... and I have no > explaination as to > >why the prefetches are not taking place... > >the loop takes about 200 cycles to use a strip of data so > there ought to > >be plenty time for the prefetch to have completed... > > Do you touch other memory during those 200 cycles? Maybe you're > prefetching that data and it gets evicted before it's used. > Also, take into > account that the cache is only 2-way, so that you can only > (IIRC) have up > to two pieces of data from the same portion of two different 4K pages. > no, I dont read or write to other memory.. well except to read variables off the stack and things like that (but they certainly ought to be at least kept in L1 all the time.) and as for the P3 being 2 way associative- is this right? - I thought it was 4 way associative.. in fact I'm pretty certain it's 4-way. I got a reply from someone at intel who said it was maybe becasuse I was simply trying to prefetch more than was allowed by the processor apparently it has a limit of 8 (perhaps 6) concurrent prefetches (coppermine chips)... I'm not sure if prefetches after those 8 are ignored or if they evict older prefetch requests.. I'm not sure which I'd prefer it to do ! it's a shame it doesnt have a queue of them really... and it's also a shame that future memory requests didnt just wait for the prefetch to complete rather than starting a new transfer from scratch. ho hum. I will report back when (if!) I make progress.... Thanks for your help Jason |
From: Christopher H. <chu...@ca...> - 2001-01-30 16:24:50
|
> > no, I dont read or write to other memory.. > well except to read variables off the stack and things like that (but they > certainly ought to be at least kept in L1 all the time.) > > and as for the P3 being 2 way associative- is this right? - I thought it was > 4 way associative.. > > in fact I'm pretty certain it's 4-way. > According to Intel's docs http://developer.intel.com/design/pentium4/manuals/245472.htm The L1 Data Cache for the P6 family(which includes the Pentium III) has 16KBytes, 4-way set associative, 32-byte line size; or 8KBytes, 2-way set associative for earlier P6 processors. For L2 Unified Cache... 128KB, 256KB, 512KB, 1MB, or 2MB 4-way set associative, 32-byte cache line Looks like you were both right. |