Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo
Close
From: Wilkens, Tim <wilkens@ui...>  20011022 16:34:43

Hi Everyone, I'm trying to build a matrix * matrix single precision ( AxB ) optimized kernel for the Athlon. However, I'm having problems getting high throughput. I thought maybe someone here could help me out. I'm using SSE. The kernel of my code.. involves multiplying a 64x64 submatrix of A times a 64x64 submatrix of B. The submatrices are prefetched into cache.. and this kernel should fly at the speed of light. Both submatrix A and B are in L1. My efforts to date are just for testing purposes, so the blocking factor of 64 is likely to change. But for those interested.. I have also tested blocking factors of NB = 36 and NB = 48. I multiply 4 rows of submatrix A at a single time times a column of submatrix B.Then I move to the next 4 rows of submatrix A... and so on. The entire multiplication of submatrix A times a "single" column of B is completely unrolled. Then I loop over the columns of B. It's pivotal that I get "stellar performance" in the dot product 4 rows of submatrix A upon the 64 floats in the column of B submatrix. The data is arranged as such: ** register "edi" points to the first element of submatrix A ** register "esi" points to the column of submatrix B Notes: ====== I bias the edi and esi registers by 128 bytes.. so I can sweep through the entire 64 floats (256 bytes) of each row of A. In this format: [edi128] == address of first element of first row of submatrix A [edi+112] == address of last element of first row of submatrix A SSE uses xmm registers and each contains 4 floats.. or 16 bytes. So I load 16 bytes at a time into the xmm registers. Ok.. the code goes something like this: ========================================================================= . .. ... add edi,128 add esi,128 mov eax,256 ; size in bytes of a single row of submatrix A mov ebx,768 ; size in bytes of a 3 rows of submatrix A xorps xmm5,xmm5 xorps xmm6,xmm6 xorps xmm7,xmm7 xorps xmm8,xmm8 movaps xmm1,XMMWORD PTR [edi128] ; First 4 floats of row 1 of A movaps xmm2,XMMWORD PTR [edi+eax128] ; First 4 floats of row 2 of A movaps xmm3,XMMWORD PTR [edi+eax*2128]; First 4 floats of row 3 of A movaps xmm4,XMMWORD PTR [edi+ebx128] ; First 4 floats of row 4 of A mulps xmm1,XMMWORD PTR [esi128] ; multipy 4 #'s of row 1 with col mulps xmm2,XMMWORD PTR [esi128] ; multipy 4 #'s of row 2 with col mulps xmm3,XMMWORD PTR [esi128] ; multipy 4 #'s of row 3 with col mulps xmm4,XMMWORD PTR [esi128] ; multipy 4 #'s of row 4 with col addps xmm5,xmm1 ; accumulate dot product of row 1 with col addps xmm6,xmm2 ; accumulate dot product of row 2 with col addps xmm7,xmm3 ; accumulate dot product of row 3 with col addps xmm8,xmm4 ; accumulate dot product of row 4 with col ; WE HAVE HANDLED 4 FLOATS now.. so we must load xmm registers ; with data 16 bytes in front of our previous accesses movaps xmm1,XMMWORD PTR [edi112] movaps xmm2,XMMWORD PTR [edi+eax112] movaps xmm3,XMMWORD PTR [edi+eax*2112] movaps xmm4,XMMWORD PTR [edi+ebx112] mulps xmm1,XMMWORD PTR [esi112] mulps xmm2,XMMWORD PTR [esi112] mulps xmm3,XMMWORD PTR [esi112] mulps xmm4,XMMWORD PTR [esi112] addps xmm5,xmm1 addps xmm6,xmm2 addps xmm7,xmm3 addps xmm8,xmm4 movaps xmm1,XMMWORD PTR [edi96] movaps xmm2,XMMWORD PTR [edi+eax96] movaps xmm3,XMMWORD PTR [edi+eax*296] movaps xmm4,XMMWORD PTR [edi+ebx96] mulps xmm1,XMMWORD PTR [esi96] mulps xmm2,XMMWORD PTR [esi96] mulps xmm3,XMMWORD PTR [esi96] mulps xmm4,XMMWORD PTR [esi96] addps xmm5,xmm1 addps xmm6,xmm2 addps xmm7,xmm3 addps xmm8,xmm4 movaps xmm1,XMMWORD PTR [edi80] movaps xmm2,XMMWORD PTR [edi+eax80] movaps xmm3,XMMWORD PTR [edi+eax*280] movaps xmm4,XMMWORD PTR [edi+ebx80] mulps xmm1,XMMWORD PTR [esi80] mulps xmm2,XMMWORD PTR [esi80] mulps xmm3,XMMWORD PTR [esi80] mulps xmm4,XMMWORD PTR [esi80] addps xmm5,xmm1 addps xmm6,xmm2 addps xmm7,xmm3 addps xmm8,xmm4 . .. ... ========================================================================= i'm not getting stellar performance in each 12 SSE instruction package above. In each package.. there are 32 floating point operations. it's taking me.. I believe 13 cycles to execute each of these instructions. Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 = 2.46 FLOPS/CYCLE. This is much to low. does anybody see anything wrong with how I've set up these instructions. I do realize that the first move instruction in each package is 4 bytes.. the other 3 are 5 bytes.. which means I can not decode more than 1 in any given clock cycle. Is this a problem, or can the Athlon only decode AND EXECUTE 1 movaps instruction per clock cycle. Any and all help is greatly appreciated. I do not have a P4 and am not familiar with it's capabilities.. though I wonder how many of the following instructions: movaps mulps addps the p4 can execute in a given clock cycle. Thanks for any assistance... tim wilkens BTW.. this message has also been posted on comp.lang.asm here: http://groups.google.com/groups?hl=en&group=comp.lang.asm.x86&selm=7b1e74d1. 0110211952.146ca68a%40posting.google.com 