#849 Poor DGEMV performance on ARM

Stable_(v3.10.x)
closed-works-for-me
5
2014-07-09
2012-08-16
No

Dear Atlas comunity,

I'm developing an application that should perform matrix-vector products continuously in an ARM processor. I'm testing it at "Snowball" embedded computer: http://www.calao-systems.com/articles.php?lng=en&pg=6186.

I wrote a simple program using GSL library for testing ATLAS performance under ARM architecture and I got surprised that the performance using ATLAS (2m45.839s) was very similar to the non-optimized GSL blas library (2m47.631s). For comparison, i did the same in a desktop Intel i3 computer, and in this case, the timing was 0m7.199s with ATLAS against 0m12.745s without ATLAS.

Is it expected to do so? Are you having similar performance? Or am I missing something, probably in configuration step? For ARM processor I followed the configuration flags recomended by http://www.vesperix.com/arm/atlas-arm/faq/index.html for a Ubuntu 12.04 install.

Below I post the ATLAS instalation steps, test program compiling and timing and system/cpu information.

Thanks in advance.

Olavo Luppi

==== ARM ARCHITECTURE ====

1) ATLAS Instalation
---------------------
$ cd ~/libs/ATLAS-3.10.0/build/
$ ../source/configure -D c -DATL_ARM_HARDFP=1 -Si archdef 0 -Fa alg -mfloat-abi=hard
$ make build
$ make check
$ make ptcheck

2a) Running test file without ATLAS
------------------------------------
$ make
gcc -Wall -ggdb -ansi -c -o main.o main.c
gcc -Wall -ggdb -ansi main.o -o teste -lm -lgsl -lgslcblas
$ time ./teste
real 2m47.631s
user 2m47.020s
sys 0m0.180s

2b) Running test file with ATLAS
---------------------------------
$ make
gcc -Wall -ggdb -ansi -c -o main.o main.c
gcc -Wall -ggdb -ansi main.o -o teste -lgsl -lc -L/home/linaro/libs/current_ATLAS/build/lib/ -lptcblas -latlas -lpthread -lm
$ time ./teste
real 2m45.839s
user 4m59.100s
sys 0m5.350s

3) System/Cpu Info
-------------------
$ uname -a
Linux linaro-ubuntu-desktop 3.3.1-37-linaro-lt-ux500 #37~lt~ci~20120524012446+1337874072~4fbea4f1-Ubuntu SMP PREEMPT armv7l armv7l armv7l GNU/Linux

$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 1 (v7l)
processor : 0
BogoMIPS : 4.80

processor : 1
BogoMIPS : 4.80

Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 1

Hardware : ST-Ericsson Snowball platform
Revision : 0000
Serial : 0000000000000000

=============================
==== INTEL ARCHITECTURE ====
=============================

1) ATLAS Instalation
---------------------
$ cd ~/libs/ATLAS-3.10.0/build/
$ ../source/configure -D c -DPentiumCPS=3100.000
$ make build
$ make check
$ make ptcheck

2a) Running test file without ATLAS
------------------------------------
$ make
gcc -Wall -ggdb -ansi -c -o main.o main.c
gcc -Wall -ggdb -ansi main.o -o teste -lm -lgsl -lgslcblas
$ time ./teste
real 0m12.745s
user 0m12.725s
sys 0m0.020s

2b) Running test file with ATLAS
---------------------------------
$ make
gcc -Wall -ggdb -ansi -c -o main.o main.c
gcc -Wall -ggdb -ansi main.o -o teste -lgsl -lc -L/home/olavo/libs/current_ATLAS/build/lib/ -lptcblas -latlas -lpthread -lm
$ time ./teste
real 0m7.199s
user 0m21.253s
sys 0m2.124s

3) System/Cpu Info
-------------------
$ uname -a
Linux drummond 3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

$ tail -n 27 /proc/cpuinfo
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
stepping : 7
microcode : 0x25
cpu MHz : 3100.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6185.90
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

Discussion

  • On my ARM system, I get almost double the speed of the fortran BLAS I've got access to. Here is the output of:
    ./xdl2blastst -F 20

    ------------------------------- GEMV --------------------------------
    TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
    ==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
    0 N 100 100 1.0 1000 1 1.0 1 0.00 39.4 1.00 -----
    0 N 100 100 1.0 1000 1 1.0 1 0.00 43.6 1.11 PASS
    1 N 200 200 1.0 1000 1 1.0 1 0.00 39.9 1.00 -----
    1 N 200 200 1.0 1000 1 1.0 1 0.00 74.0 1.85 PASS
    2 N 300 300 1.0 1000 1 1.0 1 0.00 40.2 1.00 -----
    2 N 300 300 1.0 1000 1 1.0 1 0.00 76.1 1.89 PASS
    3 N 400 400 1.0 1000 1 1.0 1 0.01 40.3 1.00 -----
    3 N 400 400 1.0 1000 1 1.0 1 0.00 80.0 1.98 PASS
    4 N 500 500 1.0 1000 1 1.0 1 0.01 40.3 1.00 -----
    4 N 500 500 1.0 1000 1 1.0 1 0.01 76.3 1.89 PASS
    5 N 600 600 1.0 1000 1 1.0 1 0.02 40.4 1.00 -----
    5 N 600 600 1.0 1000 1 1.0 1 0.01 77.8 1.93 PASS
    6 N 700 700 1.0 1000 1 1.0 1 0.02 40.3 1.00 -----
    6 N 700 700 1.0 1000 1 1.0 1 0.01 78.2 1.94 PASS
    7 N 800 800 1.0 1000 1 1.0 1 0.03 40.2 1.00 -----
    7 N 800 800 1.0 1000 1 1.0 1 0.02 79.9 1.99 PASS
    8 N 900 900 1.0 1000 1 1.0 1 0.04 40.1 1.00 -----
    8 N 900 900 1.0 1000 1 1.0 1 0.02 80.0 1.99 PASS
    9 N 1000 1000 1.0 1000 1 1.0 1 0.05 40.1 1.00 -----
    9 N 1000 1000 1.0 1000 1 1.0 1 0.03 78.6 1.96 PASS

    10 tests run, 10 passed

    Run from OBJdir/bin. What do you get for that?

    I am using the normal ATLAS install with this trick:
    http://math-atlas.sourceforge.net/errata.html#armhardfp

    It is true that the L2BLAS don't get near as much improvement as the L3BLAS in general, but I usually see some improvement unless the problems are very small.

    Let me know,
    Clint

     
    • assigned_to: nobody --> rwhaley
    • status: open --> open-works-for-me
     
    • status: open-works-for-me --> closed-works-for-me