flint-devel Mailing List for FLINT: Fast Library for Number Theory (Page 29)

Status: Pre-Alpha

Brought to you by: dmharvey, wbhart

flint-devel — Development list for the FLINT project

You can subscribe to this list here.

Flat | Threaded

<< < 1 .. 27 28 29 (Page 29 of 29)

Re: [Fastlibnt-devel] ssfft progress report

From: David H. <dmh...@ma...> - 2007-04-17 18:19:51

On Apr 17, 2007, at 1:57 PM, Development list for FLINT wrote:

> I just had a look at your code and one big performance
> hit is the lack of static inline functions.

I have some "inline" functions which are not "static inline". What is 
the difference between "static inline" and just "inline". I thought it 
was only a linkage issue. What does plain "inline" do?

Of the remaining functions, I think the only candidates for inlining, 
which are not already marked inline, are the ones starting with 
"coeff". Some of these look a bit long to be inlined (code bloat), but 
I agree I should try inlining some of the shorter ones.

But this doesn't answer the basic question about the slowness of the 
code on the G5.

Perhaps I need to explain what's going on with the new code, so you can 
see why I am perplexed.

The ssfft code has basically three layers. The bottom layer is the 
functions starting with "basic". These are very low level coefficient 
operations on raw blocks of memory, like rotations, and bitshifts with 
carry handling. The middle layer are the functions starting with 
"coeff". These are allowed to do things like swap buffers, they make 
decisions about how to decompose large rotations into bitshifts and 
limbshifts etc. Finally the top layer consists of functions that call 
the coefficient operations in some appropriate order to carry out FFTs.

Now the bottom and middle layers have NOT changed between the trunk 
version and my new version. I am only fiddling with the top layer for 
this new code. (There is one minor change I want to make to some middle 
layer code at some point, but I haven't got to that yet.)

In particular the old code has just as much inlining going on as the 
new code. In fact I would argue the new code is *better* inlined, for 
the following reason. The old code used a table lookup to decide which 
of the 16 variants of the radix-4 transform to call on each block. So 
it was using function pointers all over the place. Surely function 
pointers are the arch-nemesis of inlining.

The new code being profiled is basically all in one function: 
ssfft_fft_iterative(). It doesn't call any other FFT functions, it 
calls directly into the middle and bottom layers to do everything. 
There should be much less function call overhead than before.

david

Re: [Fastlibnt-devel] ssfft progress report

From: William H. <ha...@ya...> - 2007-04-17 18:04:50

Don't worry about it for now then. We eventually need
to do something similar to what Victor does with
different versions for processor which are susceptible
to certain kinds of problems. But for now if it is
better on most processors, put it into the trunk so we
can start using it.

Bill.

--- David Harvey <dmh...@ma...> wrote:

> 
> On Apr 17, 2007, at 1:47 PM, Development list for
> FLINT wrote:
> 
> > Apparently the G5 also likes you to access data as
> > soon as possible before it is "used" (whatever
> that
> 
> ....
> 
> Damn all that stuff you mention sounds really
> annoying. If this kind of 
> thing is really the problem, I don't think I have
> the time now to be 
> spending on reorganising loops for a specific
> processor, *especially* 
> one that is probably going to disappear in less than
> a few years. These 
> are mostly the kinds of things a compiler should be
> able to work out 
> itself, if it knows the processor well enough. The
> code is running much 
> better on every other chip.
> 
> david
> 
> 
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Fastlibnt-devel] ssfft progress report

From: David H. <dmh...@ma...> - 2007-04-17 18:01:47

On Apr 17, 2007, at 1:47 PM, Development list for FLINT wrote:

> Apparently the G5 also likes you to access data as
> soon as possible before it is "used" (whatever that

....

Damn all that stuff you mention sounds really annoying. If this kind of 
thing is really the problem, I don't think I have the time now to be 
spending on reorganising loops for a specific processor, *especially* 
one that is probably going to disappear in less than a few years. These 
are mostly the kinds of things a compiler should be able to work out 
itself, if it knows the processor well enough. The code is running much 
better on every other chip.

david

Re: [Fastlibnt-devel] another test

From: William H. <ha...@ya...> - 2007-04-17 18:00:16

Yet another test. Please ignore.

--- Development list for FLINT
<fas...@li...> wrote:

> hi I'm just testing I can send from my math.harvard
> account now....
> 
> david
> 
> 
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 17:57:08

I just had a look at your code and one big performance
hit is the lack of static inline functions.

Every function which does not need to be accessed
outside the module should be static. Probably the
compiler does this automatically where possible, but
in your case your test code probably wants to access
it, making static impossible. The difference is the
overhead in function calls. Functions which can be
made static have less overhead.

The way to get around this is to introduce wrapper
functions for your static functions. The wrapper
functions can be called by your test code and they in
turn will call your static functions. Obviously other
functions in the same module will not call the wrapper
functions, but call the functions directly, since they
are in the same file.

Everything that is short (say less than 10 lines)
should be static inlined. This is apparently
particularly important on the G5.

Bill.

--- Development list for FLINT
<fas...@li...> wrote:

> 
> --- Development list for FLINT
> <fas...@li...> wrote:
> 
> > 
> > On Apr 17, 2007, at 1:26 PM, Development list for
> > FLINT wrote:
> > 
> > > Are you compiling with all the G5 compiler
> > options:
> > >
> > > -mcpu=970 -mtune=970 -mpowerpc64
> > 
> > No. I've been using
> > 
> > -m64 -funroll-loops -fexpensive-optimizations -O3
> 
> You should still use -funroll-loops and -O3
> 
> > 
> > I will try your suggestion tonight at home.
> > 
> > > Also, what version of gcc do you have on your
> G5?
> > 
> > Can't recall; I'll find out tonight.
> > 
> > > Apparently at the Apple developer site you can
> > > download a version specially tuned for the G5.
> > Dunno
> > > if it is better than just the latest gcc from
> the
> > web
> > > though. Never used a MAC.
> > 
> > I believe I'm using the gcc that came installed,
> so
> > it should already 
> > be the apple version.
> 
> Apparently it is not. According to the apple
> developer
> website, the one that comes with it is not the
> specially tuned one.
> 
> > 
> > Keep in mind, I'm using the same compiler settings
> > for profiling the 
> > old ssfft code too.
> 
> Apparently the G5 also likes you to access data as
> soon as possible before it is "used" (whatever that
> means). It likes data to be accessed sequentially in
> order and data that can be accessed outside a loop
> instead of inside the loop will speed things up. I
> think what this means is to have a variable which
> you
> load with the data from memory, then do the loop
> acting on the data from the variable rather than
> loading it from memory every time the loop executes.
> 
> You should not use type conversions unless you
> absolutely need to, nor global variables (though I
> am
> sure you don't have any of those). Also the G5 is
> particularly susceptible to slowdowns from branch
> mispredictions. It is much better to do:
> 
> do B;
> if (cond) 
> {
>   undo B;
>   do A;
> }
> 
> than to do:
> 
> if (cond)
> {
>    do A;
> } else 
> {
>    do B;
> }
> 
> if B should be done most of the time.
> 
> Apart from these things, I can't see any sensible
> guidelines for developing on the G5. 
> 
> The Apple website says that Apple has written
> special
> code for doing FFT's on the G5 because many
> developers
> have handwritten code for the G5 and been sorely
> disappointed that it runs way slower on the G5. I
> think there are quite a lot of assembly
> optimizations
> for the G5 not employed by gcc according to the
> documentation.
> 
> Bill.
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 17:47:53

--- Development list for FLINT
<fas...@li...> wrote:

> 
> On Apr 17, 2007, at 1:26 PM, Development list for
> FLINT wrote:
> 
> > Are you compiling with all the G5 compiler
> options:
> >
> > -mcpu=970 -mtune=970 -mpowerpc64
> 
> No. I've been using
> 
> -m64 -funroll-loops -fexpensive-optimizations -O3

You should still use -funroll-loops and -O3

> 
> I will try your suggestion tonight at home.
> 
> > Also, what version of gcc do you have on your G5?
> 
> Can't recall; I'll find out tonight.
> 
> > Apparently at the Apple developer site you can
> > download a version specially tuned for the G5.
> Dunno
> > if it is better than just the latest gcc from the
> web
> > though. Never used a MAC.
> 
> I believe I'm using the gcc that came installed, so
> it should already 
> be the apple version.

Apparently it is not. According to the apple developer
website, the one that comes with it is not the
specially tuned one.

> 
> Keep in mind, I'm using the same compiler settings
> for profiling the 
> old ssfft code too.

Apparently the G5 also likes you to access data as
soon as possible before it is "used" (whatever that
means). It likes data to be accessed sequentially in
order and data that can be accessed outside a loop
instead of inside the loop will speed things up. I
think what this means is to have a variable which you
load with the data from memory, then do the loop
acting on the data from the variable rather than
loading it from memory every time the loop executes.

You should not use type conversions unless you
absolutely need to, nor global variables (though I am
sure you don't have any of those). Also the G5 is
particularly susceptible to slowdowns from branch
mispredictions. It is much better to do:

do B;
if (cond) 
{
  undo B;
  do A;
}

than to do:

if (cond)
{
   do A;
} else 
{
   do B;
}

if B should be done most of the time.

Apart from these things, I can't see any sensible
guidelines for developing on the G5. 

The Apple website says that Apple has written special
code for doing FFT's on the G5 because many developers
have handwritten code for the G5 and been sorely
disappointed that it runs way slower on the G5. I
think there are quite a lot of assembly optimizations
for the G5 not employed by gcc according to the
documentation.

Bill.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 17:35:16

On Apr 17, 2007, at 1:26 PM, Development list for FLINT wrote:

> Are you compiling with all the G5 compiler options:
>
> -mcpu=970 -mtune=970 -mpowerpc64

No. I've been using

-m64 -funroll-loops -fexpensive-optimizations -O3

I will try your suggestion tonight at home.

> Also, what version of gcc do you have on your G5?

Can't recall; I'll find out tonight.

> Apparently at the Apple developer site you can
> download a version specially tuned for the G5. Dunno
> if it is better than just the latest gcc from the web
> though. Never used a MAC.

I believe I'm using the gcc that came installed, so it should already 
be the apple version.

Keep in mind, I'm using the same compiler settings for profiling the 
old ssfft code too.

david

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 17:33:11

Also try:

-falign-loops=16 
-falign-functions=16 
-falign-labels=16 
-falign-jumps=16 

which are apparently all useful on the G5.

Bill.

--- Development list for FLINT
<fas...@li...> wrote:

> Are you compiling with all the G5 compiler options:
> 
> -mcpu=970 -mtune=970 -mpowerpc64 
> 
> Also, what version of gcc do you have on your G5? 
> 
> Apparently at the Apple developer site you can
> download a version specially tuned for the G5. Dunno
> if it is better than just the latest gcc from the
> web
> though. Never used a MAC.
> 
> Bill.
> 
> --- Development list for FLINT
> <fas...@li...> wrote:
> 
> > 
> > On Apr 17, 2007, at 7:52 AM, Development list for
> > FLINT wrote:
> > 
> > > Have you got the L1 cache size set correctly for
> > the
> > > G5? Isn't it 32000.
> > 
> > That's not the issue. The G5 speed problems are
> > across the board,  
> > including cases that easily fit into L1.
> > 
> > I just ran a subset of test cases again on the G5
> > just as a sanity  
> > check. Got similar results.
> > 
> > Interestingly, on the G5, the case n = 3 is
> actually
> > pretty good. On  
> > the other platforms, n = 3 was generally quite
> poor.
> > 
> > David
> > 
> > 
> >
>
-------------------------------------------------------------------------
> > This SF.net email is sponsored by DB2 Express
> > Download DB2 Express C - the FREE version of DB2
> > express and take
> > control of your XML. No limits. Just data. Click
> to
> > get it now.
> > http://sourceforge.net/powerbar/db2/
> > _______________________________________________
> > Fastlibnt-devel mailing list
> > Fas...@li...
> >
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
> > 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 17:31:48

I've now also run the profiles on my old G3 laptop. It really 
screams........ typically 30-40% faster than the old code; and there 
are a number of regions where it's consistently 60-100% faster (i.e. 
1.6 - 2.0x faster). There don't appear to be any regions at all where 
it's consistently slower.

david

On Apr 16, 2007, at 11:41 PM, David Harvey wrote:

> I've been running some profiles of some of the new development ssfft 
> code (in the ssfft3 branch) against the trunk version.
>
> It's generally looking pretty good. I've done profiles on sage (= 
> sage.math), martinj (= jason martin's machine), bsd (= william stein's 
> xeon), and my G5 (haven't got results for that yet). The target 
> function is ssfft_fft_iterative(), which is designed to handle 
> L1-sized transforms, using a plain iterative FFT. In particular it's 
> intended for use with short coefficients where bitshift factoring is 
> inappropriate, and small enough transforms that FFT factoring is not 
> necessary. I've been running it for transform lengths M = 16, 32, 64, 
> 128, 256, 512, 1024, with a range of truncation parameters, and 
> coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs.
>
> For lengths >= 256, it is unconditonally faster than the old code, on 
> all platforms (modulo a few random data points). The speedups are 
> typically:
>
> sage: 5-15%
> bsd: 15-25%
> martinj: 15-25%
>
> Some combinations get speedups of up to 40%.
>
> For lengths 64 and 128, there are a few problem areas, particularly 
> for n = 3 on all platforms, although mostly it's still ahead of the 
> old code.
>
> Length 32 and below is really a mixed bag. I find this surprising. 
> This code should work particularly well on small problems.
>
> On all platforms, apart from a few outliers, the new code was never 
> worse than 10% slower than the old code.
>
> I still have some investigation to do to figure out what's going on in 
> the slower regions. But generally I'm pretty happy so far.
>
> david
>

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 17:26:26

Are you compiling with all the G5 compiler options:

-mcpu=970 -mtune=970 -mpowerpc64 

Also, what version of gcc do you have on your G5? 

Apparently at the Apple developer site you can
download a version specially tuned for the G5. Dunno
if it is better than just the latest gcc from the web
though. Never used a MAC.

Bill.

--- Development list for FLINT
<fas...@li...> wrote:

> 
> On Apr 17, 2007, at 7:52 AM, Development list for
> FLINT wrote:
> 
> > Have you got the L1 cache size set correctly for
> the
> > G5? Isn't it 32000.
> 
> That's not the issue. The G5 speed problems are
> across the board,  
> including cases that easily fit into L1.
> 
> I just ran a subset of test cases again on the G5
> just as a sanity  
> check. Got similar results.
> 
> Interestingly, on the G5, the case n = 3 is actually
> pretty good. On  
> the other platforms, n = 3 was generally quite poor.
> 
> David
> 
> 
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

[Fastlibnt-devel] another test

From: Development l. f. F. <fas...@li...> - 2007-04-17 12:03:25

hi I'm just testing I can send from my math.harvard account now....

david

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 12:00:44

On Apr 17, 2007, at 7:52 AM, Development list for FLINT wrote:

> Have you got the L1 cache size set correctly for the
> G5? Isn't it 32000.

That's not the issue. The G5 speed problems are across the board,  
including cases that easily fit into L1.

I just ran a subset of test cases again on the G5 just as a sanity  
check. Got similar results.

Interestingly, on the G5, the case n = 3 is actually pretty good. On  
the other platforms, n = 3 was generally quite poor.

David

Re: [Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 11:52:47

Have you got the L1 cache size set correctly for the
G5? Isn't it 32000.

Bill.

--- Development list for FLINT
<fas...@li...> wrote:

> I've been running some profiles of some of the new
> development ssfft  
> code (in the ssfft3 branch) against the trunk
> version.
> 
> The results are a little weird. The target function
> is  
> ssfft_fft_iterative(), which is designed to handle
> L1-sized  
> transforms, using a plain iterative FFT. In
> particular it's intended  
> for use with short coefficients where bitshift
> factoring is  
> inappropriate, and small enough transforms that FFT
> factoring is not  
> necessary. I've been running it for transform
> lengths M = 16, 32, 64,  
> 128, 256, 512, 1024, with a range of truncation
> parameters, and  
> coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs.
> 
> I've done profiles on sage (= sage.math), martinj (=
> jason martin's  
> machine), bsd (= william stein's xeon), and my G5.
> 
> ======== sage, martinj, bsd ========
> 
> For lengths >= 256, it is unconditonally faster than
> the old code, on  
> the above platforms (modulo a few random data
> points). The speedups  
> are typically:
> 
> sage: 5-15%
> bsd: 15-25%
> martinj: 15-25%
> 
> Some combinations get speedups of up to 40%. So this
> is great.
> 
> For lengths 64 and 128, there are a few problem
> areas, particularly  
> for n = 3, although mostly it's still ahead of the
> old code.
> 
> Length 32 and below is really a mixed bag. I find
> this surprising.  
> This code should work particularly well on small
> problems.
> 
> On the above platforms, apart from a few outliers,
> the new code was  
> never worse than 10% slower than the old code.
> 
> ======== G5 ========
> 
> On my powerpc g5 machine, things looked BAD. The new
> code is  
> typically 15-20% SLOWER than the old code. It's
> sometimes as much as  
> 10% faster, but usually it's slower.
> 
> I have absolutely no idea why this is happening.
> 
> david
> 
> 
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

[Fastlibnt-devel] ssfft progress report

From: Development l. f. F. <fas...@li...> - 2007-04-17 11:31:04

I've been running some profiles of some of the new development ssfft  
code (in the ssfft3 branch) against the trunk version.

The results are a little weird. The target function is  
ssfft_fft_iterative(), which is designed to handle L1-sized  
transforms, using a plain iterative FFT. In particular it's intended  
for use with short coefficients where bitshift factoring is  
inappropriate, and small enough transforms that FFT factoring is not  
necessary. I've been running it for transform lengths M = 16, 32, 64,  
128, 256, 512, 1024, with a range of truncation parameters, and  
coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs.

I've done profiles on sage (= sage.math), martinj (= jason martin's  
machine), bsd (= william stein's xeon), and my G5.

======== sage, martinj, bsd ========

For lengths >= 256, it is unconditonally faster than the old code, on  
the above platforms (modulo a few random data points). The speedups  
are typically:

sage: 5-15%
bsd: 15-25%
martinj: 15-25%

Some combinations get speedups of up to 40%. So this is great.

For lengths 64 and 128, there are a few problem areas, particularly  
for n = 3, although mostly it's still ahead of the old code.

Length 32 and below is really a mixed bag. I find this surprising.  
This code should work particularly well on small problems.

On the above platforms, apart from a few outliers, the new code was  
never worse than 10% slower than the old code.

======== G5 ========

On my powerpc g5 machine, things looked BAD. The new code is  
typically 15-20% SLOWER than the old code. It's sometimes as much as  
10% faster, but usually it's slower.

I have absolutely no idea why this is happening.

david

[Fastlibnt-devel] testing email

From: Development l. f. F. <fas...@li...> - 2007-04-17 00:35:22

can anyone hear me?

david

[Fastlibnt-devel] Test list message

From: Development l. f. F. <fas...@li...> - 2007-04-17 00:33:28

The list is working. All messages about FLINT
development should now be sent to the list with
appropriate subject lines describing the subject of
the message for future reference.

Bill Hart.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

3 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 27 28 29 (Page 29 of 29)