## math-atlas-results — List specifically for timing results

You can subscribe to this list here.

 2001 2002 2003 2004 Jan Feb Mar Apr May Jun Jul Aug Sep (9) Oct (3) Nov Dec Jan Feb (1) Mar Apr May (1) Jun (1) Jul Aug (1) Sep Oct Nov Dec Jan Feb Mar Apr May (1) Jun (2) Jul (2) Aug Sep (1) Oct Nov Dec (1) Jan Feb Mar (1) Apr May Jun Jul (2) Aug Sep Oct Nov Dec
S M T W T F S

1

2

3

4

5

6

7

8

9

10

11

12
(1)
13

14

15

16

17

18

19
(1)
20

21

22
(1)
23

24

25

26

27

28

29

30

31

Showing 3 results of 3

 [Math-atlas-results] Trying to build an SSE optimized sgemm kernel for Athlon... From: Wilkens, Tim - 2001-10-22 16:34:43 ```Hi Everyone, I'm trying to build a matrix * matrix single precision ( AxB ) optimized kernel for the Athlon. However, I'm having problems getting high throughput. I thought maybe someone here could help me out. I'm using SSE. The kernel of my code.. involves multiplying a 64x64 submatrix of A times a 64x64 submatrix of B. The submatrices are prefetched into cache.. and this kernel should fly at the speed of light. Both submatrix A and B are in L1. My efforts to date are just for testing purposes, so the blocking factor of 64 is likely to change. But for those interested.. I have also tested blocking factors of NB = 36 and NB = 48. I multiply 4 rows of submatrix A at a single time times a column of submatrix B.Then I move to the next 4 rows of submatrix A... and so on. The entire multiplication of submatrix A times a "single" column of B is completely unrolled. Then I loop over the columns of B. It's pivotal that I get "stellar performance" in the dot product 4 rows of submatrix A upon the 64 floats in the column of B submatrix. The data is arranged as such: ** register "edi" points to the first element of submatrix A ** register "esi" points to the column of submatrix B Notes: ====== I bias the edi and esi registers by 128 bytes.. so I can sweep through the entire 64 floats (256 bytes) of each row of A. In this format: [edi-128] == address of first element of first row of submatrix A [edi+112] == address of last element of first row of submatrix A SSE uses xmm registers and each contains 4 floats.. or 16 bytes. So I load 16 bytes at a time into the xmm registers. Ok.. the code goes something like this: ========================================================================= . .. ... add edi,128 add esi,128 mov eax,256 ; size in bytes of a single row of submatrix A mov ebx,768 ; size in bytes of a 3 rows of submatrix A xorps xmm5,xmm5 xorps xmm6,xmm6 xorps xmm7,xmm7 xorps xmm8,xmm8 movaps xmm1,XMMWORD PTR [edi-128] ; First 4 floats of row 1 of A movaps xmm2,XMMWORD PTR [edi+eax-128] ; First 4 floats of row 2 of A movaps xmm3,XMMWORD PTR [edi+eax*2-128]; First 4 floats of row 3 of A movaps xmm4,XMMWORD PTR [edi+ebx-128] ; First 4 floats of row 4 of A mulps xmm1,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 1 with col mulps xmm2,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 2 with col mulps xmm3,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 3 with col mulps xmm4,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 4 with col addps xmm5,xmm1 ; accumulate dot product of row 1 with col addps xmm6,xmm2 ; accumulate dot product of row 2 with col addps xmm7,xmm3 ; accumulate dot product of row 3 with col addps xmm8,xmm4 ; accumulate dot product of row 4 with col ; WE HAVE HANDLED 4 FLOATS now.. so we must load xmm registers ; with data 16 bytes in front of our previous accesses movaps xmm1,XMMWORD PTR [edi-112] movaps xmm2,XMMWORD PTR [edi+eax-112] movaps xmm3,XMMWORD PTR [edi+eax*2-112] movaps xmm4,XMMWORD PTR [edi+ebx-112] mulps xmm1,XMMWORD PTR [esi-112] mulps xmm2,XMMWORD PTR [esi-112] mulps xmm3,XMMWORD PTR [esi-112] mulps xmm4,XMMWORD PTR [esi-112] addps xmm5,xmm1 addps xmm6,xmm2 addps xmm7,xmm3 addps xmm8,xmm4 movaps xmm1,XMMWORD PTR [edi-96] movaps xmm2,XMMWORD PTR [edi+eax-96] movaps xmm3,XMMWORD PTR [edi+eax*2-96] movaps xmm4,XMMWORD PTR [edi+ebx-96] mulps xmm1,XMMWORD PTR [esi-96] mulps xmm2,XMMWORD PTR [esi-96] mulps xmm3,XMMWORD PTR [esi-96] mulps xmm4,XMMWORD PTR [esi-96] addps xmm5,xmm1 addps xmm6,xmm2 addps xmm7,xmm3 addps xmm8,xmm4 movaps xmm1,XMMWORD PTR [edi-80] movaps xmm2,XMMWORD PTR [edi+eax-80] movaps xmm3,XMMWORD PTR [edi+eax*2-80] movaps xmm4,XMMWORD PTR [edi+ebx-80] mulps xmm1,XMMWORD PTR [esi-80] mulps xmm2,XMMWORD PTR [esi-80] mulps xmm3,XMMWORD PTR [esi-80] mulps xmm4,XMMWORD PTR [esi-80] addps xmm5,xmm1 addps xmm6,xmm2 addps xmm7,xmm3 addps xmm8,xmm4 . .. ... ========================================================================= i'm not getting stellar performance in each 12 SSE instruction package above. In each package.. there are 32 floating point operations. it's taking me.. I believe 13 cycles to execute each of these instructions. Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 = 2.46 FLOPS/CYCLE. This is much to low. does anybody see anything wrong with how I've set up these instructions. I do realize that the first move instruction in each package is 4 bytes.. the other 3 are 5 bytes.. which means I can not decode more than 1 in any given clock cycle. Is this a problem, or can the Athlon only decode AND EXECUTE 1 movaps instruction per clock cycle. Any and all help is greatly appreciated. I do not have a P4 and am not familiar with it's capabilities.. though I wonder how many of the following instructions: movaps mulps addps the p4 can execute in a given clock cycle. Thanks for any assistance... tim wilkens BTW.. this message has also been posted on comp.lang.asm here: http://groups.google.com/groups?hl=en&group=comp.lang.asm.x86&selm=7b1e74d1. 0110211952.146ca68a%40posting.google.com ```
 [Math-atlas-results] Athlon results From: R Clint Whaley - 2001-10-19 02:36:20 ```Guys, The wheels are still turning to get out the 3.3.8 release. I'm sending some pre-release timings, just to give hope to the Athlon users out there. Julian Ruhe has submitted an assembly-language kernel that improves ATLAS's double precision Athlon performance by over 25%. Just to whet your appetite, I include some timings using his new kernel below. I'm comparing my development tree using his kernel (mislabeled as 3.3.8) against an old release I had setting around on the machine, 3.3.2. 3.3.2 will have the same DGEMM performance as the present release, 3.3.7. My development tree adds no performance wins over 3.3.7, so the whole difference you see is Julian's kernel. The kernels are written in nasm assembly, and will be available in source form for the curious in the next release. This is why we can't just give you the kernel to add to your 3.3.7 stuff: I had to add additional kernel support for non-C contributions (our other assembly routines used gnu assembler, and thus could be handled by gcc). The numbers are for a 1.2Ghz Athlon (pre-Athlon4) with DDR memory. The kernel performs similarly for older systems (you get about the same % of peak on my 600Mhz Athlon classic, roughly 920Mflop) . . . And before someone asks, yes, this is getting the right answer as well :-> Cheers, Clint 3.3.2: Old ATLAS release on same machine 3.3.8: My development tree + Julian's Athlon kernel 1.2Ghz Athlon (2.4Gflop peak): 100 200 300 400 500 600 700 800 900 1000 ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== 3.3.2 dMM 1136.4 1271.8 1388.6 1280.0 1315.8 1393.5 1372.0 1383.8 1429.4 1418.4 3.3.8 dMM 1315.8 1377.8 1567.7 1600.0 1666.7 1728.0 1759.0 1735.6 1778.0 1785.7 3.3.2 dLU 676.0 841.3 914.1 982.7 970.8 1027.3 1054.3 1065.7 1116.3 1110.3 3.3.8 dLU 698.6 917.8 1052.5 1064.7 1147.7 1232.7 1202.2 1263.0 1312.4 1359.5 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== 3.3.8 dMM 1818.9 1799.3 1824.5 1842.7 1820.3 1823.3 1855.6 1832.7 1840.8 1848.7 3.3.8 dLU 1387.1 1406.4 1451.8 1472.1 1532.0 1536.0 1455.5 1446.2 1437.2 1473.8 ```
 [Math-atlas-results] SSE warnings, Band matrix request feature From: Camm Maguire - 2001-10-12 01:30:53 ```Greetings! Two items: 1) In trying to clean up the warnings on the l2 SSE kernels, I'm finding that many of them only appear when using the 2.96 (broken) gcc version on torc. 2.95.x and 3.0.2 don't appear to show these warnings, which refer to macro redefinitions, but I have only tested 3.0.2 on non-i386 machines. In any case, my code includes the same header multiple times, between each of which a few key macros are changed. And certain of the macros in the header file thus multiply included give the redefinition warning with 2.96, while others adjacently defined do not. No apparent rhyme or reason. I can certainly work around with undef's, or some moderate rewriting, but I'd like to get a minimal fix in first, so I'm wondering whether 2.96 is faulty in this respect and should be ignored. As long as I've used these macros, redefining the same macro to the same value never produces a warning, but maybe I've been relying on non-standard cpp all this time. 2) I've gotten interested in band matrices recently, and am wondering how atlas handles these. Take the extreme case of a diagonal matrix, 'band packed' so that the diagonal elements are contiguous in memory. For s{tsg}bmv, there seems to be no way the basic atlas code can hand this off to a kernel without moving the memory around. But this would be an easily vectorizeable operation. Should we have a 4rth l2 kernel to deal with band matrices? Take care, -- Camm Maguire camm@... ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah ```

Showing 3 results of 3