#1 PPC optimizations


Would any of the documents help in coding a faster PPC core
(specifically the FFT optimizations using Altivec)? Or maybe for
future factoring support?



  • Guillermo Ballester Valor

    Logged In: YES

    I've read on the fly the document about FFT. As a summary,
    on page 21 they showed the timings for real FFTS for their
    better algorithm (I think). For a 512 K FFT they spend 99'250
    ms in a G4 500 Mhz.

    In a recent timing for a G4 at 500 Mhz a tester showed us a
    timing about 345 ms per Lucas-Lehmer iteration. A L-L
    iteration has two FFTS (one direct and one Inverse), a
    convolution, a bit carry-phase, and normalization
    a denormalization and some more little stuffs. In adition, I still
    don't know whether their timings are for double or single
    precision. I think this would not offer us a big improvement but
    lot of work in re-designing the FFTs, anyway I'll see it deeply.
    We always can find an interesting idea.

    OTOH, altivec instructions are not usable for us because they
    aren't double precision capable, and we need that precision.


    Guillermo Ballester Valor.

  • Klaus Kastens

    Klaus Kastens - 2003-02-26
    • priority: 5 --> 3
    • assigned_to: nobody --> gbvalor
  • Nobody/Anonymous

    Logged In: NO

    Greetings, it's Paulie from mersenneforums.org and TeamPrimeRib (I posted this original topic).

    If Altivec isn't good enough because we need atleast double precision, how about going oct-precision, or would that be too slow?


    Thanks so much Guillermo

  • Klaus Kastens

    Klaus Kastens - 2003-03-03

    Logged In: YES

    Some months ago I did some FPU throughput measurements on
    various PowerPC processors.
    One of these was a MPC7410 which has identical instruction
    timings as the MPC7400 used in the oct3a.pdf paper.

    Here is a comparison of the octuple precision emulation and
    native double precision on a 500 MHz MPC7400/10:

    "octuple" double
    add 9.4 MOcts 375 MFLOPS
    sub 8.8 MOcts 375 MFLOPS
    mul 4.5 MOcts 375 MFLOPS
    mantissa 224 bits 53 bits
    wordsize ? 19/20 bits

    For the scalar FPU the throughput is limited to 3/4
    instructions per cycle, because each time the FPU-pipeline
    is completely filled there is a single cycle stall. => 375
    The wordsize is the number of bits used per vector element
    and choosen to make sure the accumulated roundoff error
    doesn't excede a certain limit.
    For a rough estimate let's assume the wordsize in the
    octuple precision version is ten times the size of the
    double precision version. (IIRC it's actually less then half
    the size of the mantissa.)
    This results into a ten times smaller FFT size, and with
    O(N*log(N)) uses 1/33 of the instructions necessary for the
    double precision version.
    OTOH these instructions are some 40-80 times slower, so even
    with the very optimistic wordsize I estimate the Altivec
    version is slower.

    You mentioned factoring in your initial question.
    Ernst Mayer (Mlucas) is working on a trialfactorer
    (currently limited to 65 bits, 96 bits planned). This makes
    heavy use of 64x64=128-bit integer products which aren't
    available on 32-bit PowerPC - and therefore have to be
    emulated from 32x32=64-bit.
    I replaced some of the core routines with asm to improve the
    performance on my G3 and was looking into using AltiVec on
    the G4.
    Unfortunately the AltiVec unit only supports 16x16=32-bit
    integer multiplies, which means using 4 times as much
    sub-products on the AltiVec unit as compared to the scalar
    integer unit.
    Nevertheless I still hope to see some performance increase
    especially on the first generation G4 (7400/7410), since
    here the scalar integer multiply instruction aren't pipelined.

  • Klaus Kastens

    Klaus Kastens - 2003-03-03
    • labels: --> FFT improvements
    • priority: 3 --> 5
  • Phil Carmody

    Phil Carmody - 2005-09-15

    Logged In: YES

    Altivec does not have 64-bit float support. I believe that
    makes it impractical. The only way it might be useful would
    be in 'big limb' style transforms such as Nussbaumer.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks