[Math-atlas-results] 3.5.2: I don't need your damned pity!
Brought to you by:
rwhaley,
tonyc040457
From: R C. W. <rw...@cs...> - 2003-05-04 15:10:02
|
Guys, In order to allow for dynamically linked libs, I've been trying to get performance similar to Julian's athlon code with code written in gas assembler. After a whole lot of work, I got a kernel that is not quite as efficient in-cache as Julian's kernel, but seemed to tie or beat it out-of-cache for all precisions except double. Even for double, however, my new kernels could be used for cleanup. This took an entire week of evenings and two weekends of intensive effort, capped by the posting of 3.5.2 to sourceforge. I then ran the timing to see what kind of speedup I got. The short answer: not really any :-{ I ran problems between 100-1000 on my 600Mhz Athlon classic, for all four precisions, using matrix multiply and LU factorization. For double, the addition of cleanup did help slightly. I was most excited about single, where use of a larger blocking factor (60 vs. 30) allowed me to obtain a slighly higher peak matmul. However, the increase in NB caused LU to run *slower*, so maybe I'll back out this change for 3.5.3. For single precision complex, again there was a larger NB for increased matmul peak. The LU times appear to run within clock speed of each other. For double precision complex, there appears to have been a microscopic speedup. Yay, Clint Timings for a 600Mhz Athlon classic, with 512K cache, but timers flushing 2MB: VERS OP 100 200 300 400 500 600 700 800 900 1000 ===== === ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== 3.4.1 sMM 800 897 913 931 943 982 953 966 959 966 3.5.2 sMM 732 897 928 931 962 982 980 957 978 976 3.4.1 sLU 373 546 636 687 724 757 761 784 790 784 3.5.2 sLU 370 492 587 626 666 705 729 741 771 765 3.4.1 dMM 714 822 887 883 877 939 903 906 935 935 3.5.2 dMM 723 789 914 898 909 900 915 923 947 930 3.4.1 dLU 321 462 550 591 589 636 659 696 719 716 3.5.2 dLU 332 465 565 586 622 685 678 696 719 709 3.4.1 cMM 857 900 939 914 917 944 939 944 951 946 3.5.2 cMM 822 886 939 931 943 960 959 954 969 960 3.4.1 cLU 473 664 727 758 775 783 795 827 834 844 3.5.2 cLU 473 637 735 726 757 817 795 822 830 846 3.4.1 zMM 723 800 864 883 870 882 880 881 891 884 3.5.2 zMM 833 823 864 883 870 900 891 892 903 893 3.4.1 zLU 420 510 573 632 687 681 703 711 736 755 3.5.2 zLU 417 547 616 620 640 689 703 726 739 764 |