portability patch & fixed FPU asm bug w/ gcc

2006-11-12
2013-03-26
  • Peter Cordes
    Peter Cordes
    2006-11-12

    GCC/GNU binutils AT&T-style assembly syntax uses buggy opcodes on purpose to be bug-compatible with early 386 Unix implementations.  So you need to write fsub where you mean fsubr.  Not even objdump -d -M intel fixes the opcodes, so use a different disassembler to check this.  See http://bugs.debian.org/372528 for more info.

    My suggestion is to put the assembly code in a separate file, which you assemble with NASM.  Then you'd have to write whole functions, so you'd need wrappers to call your C++ member functions, etc, but you'd only have to maintain a single ASM version for Windows/gcc platforms/whatever.  And you can use normal Intel syntax!  (You can use intel syntax with the GNU assembler, with the .intel directive, but that doesn't fix the FPU opcodes, so I'd go with NASM if I were you.)

    BTW, I got FFFF to build on a machine in a cluster I admin at work.  It's a dual-socket dual-core Opteron, running Solaris 10 amd64 :)  I had to fix the Makefile, and I also cleaned up the code some.  (I installed freeglut with pkg-get, from blastwave.org.  That's why my Makefile refers to /opt/csw.)  diff posted at http://cordes.ca/~peter/FFFF/ffff323-cleanup.diff.bz2

    notes:
    * I renamed Makefile.linux to Makefile.  The irix makefile stuff could be merged in there and detected with
    ifeq (($system uname),whatever)
    ... modifi vars
    endif
    The Makefile detects SunOS (Solaris) and OS X (Darwin), and tweaks stuff for them.

    * I had to fix extensions.cpp (#define GLX_GLXEXT_PROTOTYPES, not just GL_...) to get it to compile w/ the library versions I have (Debian unstable and Ubuntu Dapper).  A lot of tweaking in the department was needed to compile on OS X.

    * I had to use g++, not gcc, otherwise it doesn't link on Solaris.

    * -fno-inline-functions is needed to avoid having the same inline asm twice in the output, where the label names will conflict.  I think there's a way to write local labels, or something, but just use NASM and put the ASM functions in a separate file.  Since you're doing a line at a time, the call overhead is negligible.  Actually, you'd probably be ok if you made them static inline, so there wouldn't be a non-inline copy at all.

    * It's possible to be a _lot_ more portable than #ifdef __linux__.  I hacked up the source so if it's not on Apple or sgi, it defines ffff_unix and checks that for most things in FFFF3.cpp.  In other files, I tried to re-arrange #if clauses so it ends with a #else and does the Linux/Solaris/whatever unix thing there.  I used sysconf() to get the number of online CPUs, since this is POSIX standard, and works on Linux+probably many others.  The glibc cpu count function was the only non-portable thing in the code, at least between GNU/Linux and Solaris+gcc.  Currently, it compiles and works correctly on Linux (make -f Makefile CFG=release) and on my Solaris system.  I don't have a win32 dev environment.  For OS X, I have an account on a dual G5 at work, but not any admin access or a desktop.  I had to tweak some things, but it now compiles with make -f Makefile.unix.  It doesn't run over remote X11, though, so I can't do more than:
    FFFF v3.2.3
    (C)1994-2006 Daniele Paccaloni (daniele.paccaloni@dylogic.com)
    Initalizing...
    Number of CPUs: 2
    SMP support available, creating 1 slave threads.
    AltiVec instructions supported. Switching to AltiVec quadpoints computation.
    SSE2 instructions NOT supported.
    3DNow! instructions NOT available.
    kCGErrorRangeCheck : Window Server communications from outside of session allowed for root and console user only
    Thread 1 says: "I'm a slave, I'm alive."
    INIT_Processeses(), could not establish the default connection to the WindowServer.Abort trap
    The Makefile section that detects Darwin substitutes /System/Library/Frameworks/GLUT.framework/GLUT for -lglut, since AFAICT, that's the only GLUT shared library.  And yes, it's not called anything .so, just GLUT!?!

    So yeah, __linux__ doesn't appear in the source anymore :)

    * I changed some of the #ifs to #if defined(__GNUC__) && defined(__i386__) before using gcc-style inline ASM...   __GNUC__ and __ppc__ or __POWERPC__ for ppc asm would probably be the right thing to do, so it will work on linux/bsd/whatever on ppc.  Again, this mostly goes away if you use NASM.  Then you mostly just need to do actually OS-dependent tests, where it is appropriate to test for __APPLE__, with their weird openGL, etc.
    BTW, __i386__ isn't be defined on amd64 linux gcc, so maybe test on __i386__ or __amd64__, since the same asm should work for both.  Or just leave the program as 32bit if you don't want to take advantage of the 8 extra xmm registers...  (although that forces people to have 32bit libs for glut and all that.)

    * I had to tell gcc that one of the inline asm statements clobbered memory, or else the stack frame would get corrupted, causing a segfault at the next function return.  Not sure what's up with that, or whether a more specific constraint would have been sufficient.

    * After all that hacking, FFFF compiles out of the box for me with make -f Makefile.unix CFG=release on:
    - Debian unstable on x86
    - Solaris 10 amd64 (with 32bit gcc, though)
    - MacOS X (10.3?) don't know if it really runs properly, though.
    I turn on -msse2 and -m3dnow (on x86), so it can use the inline asm.

    benchmark results, on a Sun Fire v40z (quad 2.0GHz Opteron) running Solaris 10 (on the console with very slow Trident graphics, w/X.org;  shader program results included for comedy value.  It's actually so slow even for 2D that it's most usable inside a vncserver.)  compiled in 32bit mode with g++ 3.4.3, -O3 -march=k8 -ffast-math -funroll-loops -fomit-frame-pointer -Wall -fno-inline-functions

    double buffering, so no redraws while benching.  system idle.  window not covered or anything.
    $ release/ffff
    FFFF v3.2.3
    (C)1994-2006 Daniele Paccaloni (daniele.paccaloni@dylogic.com)
    Initalizing...
    Number of CPUs: 4
    SMP support available, creating 3 slave threads.
    Thread 1 says: "I'm a slave, I'm alive."
    Thread 2 says: "I'm a slave, I'm alive."
    SSE instructions supported. Switching to SSE quadpoints computation.
    SSE2 instructions supported.
    3DNow! instructions NOT available.
    Thread 3 says: "I'm a slave, I'm alive."
    OpenGL v1.5 Mesa 6.2.1
      Renderer: Mesa X11
      Vendor: Brian Paul

    FFFF v3.2.3 BENCHMARK (Using 1 CPU, no render)
    size:  500*500
    maxiters:      9999
    rangex:        -2.00 to 1.00
    rangey:        -1.50 to 1.50
      [4f] SSE benchmark:
        1.168 sec
        360.033 MegaIters/sec
      [2d] SSE2 benchmark:
        2.245 sec
        187.306 MegaIters/sec
      [2f] 3DNow! benchmark:
        Not supported.
      [1d] FPU ASM benchmark:
        3.092 sec
        136.012 MegaIters/sec
      [1d] FPU C benchmark:
        3.175 sec
        132.421 MegaIters/sec
      [4?] GPU VertexProgram benchmark (beta! maxiters=10) on Mesa X11:
        184.881 sec
        2.704 MegaIters/sec
      [4?] GPU VertexProgram benchmark (beta! maxiters=10) on Mesa X11:
      Maximum number of FP ALU instructions: 48
      Maximum number of FP native params: 64
      FP is hardware native (63 ALU instructions).
        380.192 sec
        2.630 MegaIters/sec

    g++'s FPU loop is _almost_ as fast as your ASM.  I don't think I tried to get it to compile with Sun's studio compiler, but if I do I'll try with -fast -xarch=native64 -xvector=simd and see if it can use some mulpd instructions.  It doesn't with XaoS, but it's not just naively going over adjacent pixels.  Speaking of which, I'd love to see XaoS harness the brute force of your optimized SSE and SSE2 loops!  Or if there was an MPI version of this, I could harness all 80 CPUs in the cluster when they're otherwise idle :)  Then 10000 max iterations would be the biggest limit on zooming :)

    On my desktop at home, running Debian GNU/Linux on an athlon64 2.2GHz (3200+: socket754 newcastle core) in 32bit mode, with g++ 4.1, I get:
    release/ffff
    ...
    OpenGL v2.0.2 NVIDIA 87.76
      Renderer: GeForce 6200/AGP/SSE2/3DNOW!
      Vendor: NVIDIA Corporation
    ...
    FFFF v3.2.3 BENCHMARK (Using 1 CPU, no render)
    size:  500*500
    maxiters:      9999
    rangex:        -2.00 to 1.00
    rangey:        -1.50 to 1.50
      [4f] SSE benchmark:
        1.059 sec
        397.082 MegaIters/sec
      [2d] SSE2 benchmark:
        2.059 sec
        204.197 MegaIters/sec
      [2f] 3DNow! benchmark:
        Not supported.
      [1d] FPU ASM benchmark:
        3.010 sec
        139.714 MegaIters/sec
      [1d] FPU C benchmark:
        3.102 sec
        135.571 MegaIters/sec
      [4?] GPU VertexProgram benchmark (beta! maxiters=10) on GeForce 6200/AGP/SSE2/3DNOW!:
        2.405 sec
        207.920 MegaIters/sec
      [4?] GPU FragmentProgram benchmark (beta! maxiters=20) on GeForce 6200/AGP/SSE2/3DNOW!:
      Maximum number of FP ALU instructions: 4096
      Maximum number of FP native params: 1024
      FP is hardware native (63 ALU instructions).
        2.433 sec
        411.043 MegaIters/sec

    The fragment and vertex programs work correctly on GNU/Linux, even after my hacking, BTW.

    happy hacking,
    Peter Cordes
    peter@cordes.ca

     
    • Peter Cordes
      Peter Cordes
      2007-03-20

      I updated my patch:
      http://cordes.ca/~peter/FFFF/ffff323-cleanup.diff.gz

      It works on AMD64 with gcc now.  All I had to do was use mallopt() to tell malloc to always allocate from the heap, even for big requests.  That keeps the PixelBuffer in the low 32bits of the address space, so the same inline asm blocks will work unchanged.  (otherwise they segfault on the truncated addresses resulting from storing a pointer in %esi.)  C, FPU ASM, SSE, and SSE2 all work on my Core 2 Duo running AMD64 Ubuntu Edgy :)  It should work on Intel Macs, too, because the feature test macros are on defined(__SSE2__), not assuming __APPLE__ means PPC, or anything.

      I've tested on 32bit GNU/Linux, 64bit GNU/Linux, and 32bit gcc on x86 Solaris.

      And this time I made sure to include the Makefile in the patch.  It was omitted last time, but this time I used diff --unidirectional-new-file.

       
    • Luc Simard
      Luc Simard
      2007-05-08

      Thanks Peter for all this information. It's really appreciated.