I've read on the fly the document about FFT. As a summary,
on page 21 they showed the timings for real FFTS for their
better algorithm (I think). For a 512 K FFT they spend 99'250
ms in a G4 500 Mhz.
In a recent timing for a G4 at 500 Mhz a tester showed us a
timing about 345 ms per Lucas-Lehmer iteration. A L-L
iteration has two FFTS (one direct and one Inverse), a
convolution, a bit carry-phase, and normalization
a denormalization and some more little stuffs. In adition, I still
don't know whether their timings are for double or single
precision. I think this would not offer us a big improvement but
lot of work in re-designing the FFTs, anyway I'll see it deeply.
We always can find an interesting idea.
OTOH, altivec instructions are not usable for us because they
aren't double precision capable, and we need that precision.
Sincerly,
Guillermo Ballester Valor.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Some months ago I did some FPU throughput measurements on
various PowerPC processors.
One of these was a MPC7410 which has identical instruction
timings as the MPC7400 used in the oct3a.pdf paper.
Here is a comparison of the octuple precision emulation and
native double precision on a 500 MHz MPC7400/10:
For the scalar FPU the throughput is limited to 3/4
instructions per cycle, because each time the FPU-pipeline
is completely filled there is a single cycle stall. => 375
MFLOPS
The wordsize is the number of bits used per vector element
and choosen to make sure the accumulated roundoff error
doesn't excede a certain limit.
For a rough estimate let's assume the wordsize in the
octuple precision version is ten times the size of the
double precision version. (IIRC it's actually less then half
the size of the mantissa.)
This results into a ten times smaller FFT size, and with
O(N*log(N)) uses 1/33 of the instructions necessary for the
double precision version.
OTOH these instructions are some 40-80 times slower, so even
with the very optimistic wordsize I estimate the Altivec
version is slower.
You mentioned factoring in your initial question.
Ernst Mayer (Mlucas) is working on a trialfactorer
(currently limited to 65 bits, 96 bits planned). This makes
heavy use of 64x64=128-bit integer products which aren't
available on 32-bit PowerPC - and therefore have to be
emulated from 32x32=64-bit.
I replaced some of the core routines with asm to improve the
performance on my G3 and was looking into using AltiVec on
the G4.
Unfortunately the AltiVec unit only supports 16x16=32-bit
integer multiplies, which means using 4 times as much
sub-products on the AltiVec unit as compared to the scalar
integer unit.
Nevertheless I still hope to see some performance increase
especially on the first generation G4 (7400/7410), since
here the scalar integer multiply instruction aren't pipelined.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Altivec does not have 64-bit float support. I believe that
makes it impractical. The only way it might be useful would
be in 'big limb' style transforms such as Nussbaumer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Logged In: YES
user_id=139650
I've read on the fly the document about FFT. As a summary,
on page 21 they showed the timings for real FFTS for their
better algorithm (I think). For a 512 K FFT they spend 99'250
ms in a G4 500 Mhz.
In a recent timing for a G4 at 500 Mhz a tester showed us a
timing about 345 ms per Lucas-Lehmer iteration. A L-L
iteration has two FFTS (one direct and one Inverse), a
convolution, a bit carry-phase, and normalization
a denormalization and some more little stuffs. In adition, I still
don't know whether their timings are for double or single
precision. I think this would not offer us a big improvement but
lot of work in re-designing the FFTs, anyway I'll see it deeply.
We always can find an interesting idea.
OTOH, altivec instructions are not usable for us because they
aren't double precision capable, and we need that precision.
Sincerly,
Guillermo Ballester Valor.
Logged In: NO
Greetings, it's Paulie from mersenneforums.org and TeamPrimeRib (I posted this original topic).
If Altivec isn't good enough because we need atleast double precision, how about going oct-precision, or would that be too slow?
http://developer.apple.com/hardware/ve/pdf/oct3a.pdf
Thanks so much Guillermo
Logged In: YES
user_id=278634
Some months ago I did some FPU throughput measurements on
various PowerPC processors.
One of these was a MPC7410 which has identical instruction
timings as the MPC7400 used in the oct3a.pdf paper.
Here is a comparison of the octuple precision emulation and
native double precision on a 500 MHz MPC7400/10:
"octuple" double
-------------------------------
add 9.4 MOcts 375 MFLOPS
sub 8.8 MOcts 375 MFLOPS
mul 4.5 MOcts 375 MFLOPS
mantissa 224 bits 53 bits
wordsize ? 19/20 bits
For the scalar FPU the throughput is limited to 3/4
instructions per cycle, because each time the FPU-pipeline
is completely filled there is a single cycle stall. => 375
MFLOPS
The wordsize is the number of bits used per vector element
and choosen to make sure the accumulated roundoff error
doesn't excede a certain limit.
For a rough estimate let's assume the wordsize in the
octuple precision version is ten times the size of the
double precision version. (IIRC it's actually less then half
the size of the mantissa.)
This results into a ten times smaller FFT size, and with
O(N*log(N)) uses 1/33 of the instructions necessary for the
double precision version.
OTOH these instructions are some 40-80 times slower, so even
with the very optimistic wordsize I estimate the Altivec
version is slower.
You mentioned factoring in your initial question.
Ernst Mayer (Mlucas) is working on a trialfactorer
(currently limited to 65 bits, 96 bits planned). This makes
heavy use of 64x64=128-bit integer products which aren't
available on 32-bit PowerPC - and therefore have to be
emulated from 32x32=64-bit.
I replaced some of the core routines with asm to improve the
performance on my G3 and was looking into using AltiVec on
the G4.
Unfortunately the AltiVec unit only supports 16x16=32-bit
integer multiplies, which means using 4 times as much
sub-products on the AltiVec unit as compared to the scalar
integer unit.
Nevertheless I still hope to see some performance increase
especially on the first generation G4 (7400/7410), since
here the scalar integer multiply instruction aren't pipelined.
Logged In: YES
user_id=975397
Altivec does not have 64-bit float support. I believe that
makes it impractical. The only way it might be useful would
be in 'big limb' style transforms such as Nussbaumer.