Would any of the documents help in coding a faster PPC core
(specifically the FFT optimizations using Altivec)? Or maybe for
future factoring support?
Logged In: YES
I've read on the fly the document about FFT. As a summary,
on page 21 they showed the timings for real FFTS for their
better algorithm (I think). For a 512 K FFT they spend 99'250
ms in a G4 500 Mhz.
In a recent timing for a G4 at 500 Mhz a tester showed us a
timing about 345 ms per Lucas-Lehmer iteration. A L-L
iteration has two FFTS (one direct and one Inverse), a
convolution, a bit carry-phase, and normalization
a denormalization and some more little stuffs. In adition, I still
don't know whether their timings are for double or single
precision. I think this would not offer us a big improvement but
lot of work in re-designing the FFTs, anyway I'll see it deeply.
We always can find an interesting idea.
OTOH, altivec instructions are not usable for us because they
aren't double precision capable, and we need that precision.
Guillermo Ballester Valor.
Logged In: NO
Greetings, it's Paulie from mersenneforums.org and TeamPrimeRib (I posted this original topic).
If Altivec isn't good enough because we need atleast double precision, how about going oct-precision, or would that be too slow?
Thanks so much Guillermo
Logged In: YES
Some months ago I did some FPU throughput measurements on
various PowerPC processors.
One of these was a MPC7410 which has identical instruction
timings as the MPC7400 used in the oct3a.pdf paper.
Here is a comparison of the octuple precision emulation and
native double precision on a 500 MHz MPC7400/10:
add 9.4 MOcts 375 MFLOPS
sub 8.8 MOcts 375 MFLOPS
mul 4.5 MOcts 375 MFLOPS
mantissa 224 bits 53 bits
wordsize ? 19/20 bits
For the scalar FPU the throughput is limited to 3/4
instructions per cycle, because each time the FPU-pipeline
is completely filled there is a single cycle stall. => 375
The wordsize is the number of bits used per vector element
and choosen to make sure the accumulated roundoff error
doesn't excede a certain limit.
For a rough estimate let's assume the wordsize in the
octuple precision version is ten times the size of the
double precision version. (IIRC it's actually less then half
the size of the mantissa.)
This results into a ten times smaller FFT size, and with
O(N*log(N)) uses 1/33 of the instructions necessary for the
double precision version.
OTOH these instructions are some 40-80 times slower, so even
with the very optimistic wordsize I estimate the Altivec
version is slower.
You mentioned factoring in your initial question.
Ernst Mayer (Mlucas) is working on a trialfactorer
(currently limited to 65 bits, 96 bits planned). This makes
heavy use of 64x64=128-bit integer products which aren't
available on 32-bit PowerPC - and therefore have to be
emulated from 32x32=64-bit.
I replaced some of the core routines with asm to improve the
performance on my G3 and was looking into using AltiVec on
Unfortunately the AltiVec unit only supports 16x16=32-bit
integer multiplies, which means using 4 times as much
sub-products on the AltiVec unit as compared to the scalar
Nevertheless I still hope to see some performance increase
especially on the first generation G4 (7400/7410), since
here the scalar integer multiply instruction aren't pipelined.
Logged In: YES
Altivec does not have 64-bit float support. I believe that
makes it impractical. The only way it might be useful would
be in 'big limb' style transforms such as Nussbaumer.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.