You can subscribe to this list here.
| 2007 |
Jan
|
Feb
|
Mar
|
Apr
(118) |
May
(140) |
Jun
(56) |
Jul
(86) |
Aug
(4) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2008 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(94) |
Aug
(86) |
Sep
|
Oct
(3) |
Nov
(18) |
Dec
(27) |
| 2009 |
Jan
(15) |
Feb
(15) |
Mar
(27) |
Apr
(2) |
May
(1) |
Jun
(6) |
Jul
(10) |
Aug
(4) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
|
|
From: David H. <dmh...@ma...> - 2007-04-17 18:19:51
|
On Apr 17, 2007, at 1:57 PM, Development list for FLINT wrote: > I just had a look at your code and one big performance > hit is the lack of static inline functions. I have some "inline" functions which are not "static inline". What is the difference between "static inline" and just "inline". I thought it was only a linkage issue. What does plain "inline" do? Of the remaining functions, I think the only candidates for inlining, which are not already marked inline, are the ones starting with "coeff". Some of these look a bit long to be inlined (code bloat), but I agree I should try inlining some of the shorter ones. But this doesn't answer the basic question about the slowness of the code on the G5. Perhaps I need to explain what's going on with the new code, so you can see why I am perplexed. The ssfft code has basically three layers. The bottom layer is the functions starting with "basic". These are very low level coefficient operations on raw blocks of memory, like rotations, and bitshifts with carry handling. The middle layer are the functions starting with "coeff". These are allowed to do things like swap buffers, they make decisions about how to decompose large rotations into bitshifts and limbshifts etc. Finally the top layer consists of functions that call the coefficient operations in some appropriate order to carry out FFTs. Now the bottom and middle layers have NOT changed between the trunk version and my new version. I am only fiddling with the top layer for this new code. (There is one minor change I want to make to some middle layer code at some point, but I haven't got to that yet.) In particular the old code has just as much inlining going on as the new code. In fact I would argue the new code is *better* inlined, for the following reason. The old code used a table lookup to decide which of the 16 variants of the radix-4 transform to call on each block. So it was using function pointers all over the place. Surely function pointers are the arch-nemesis of inlining. The new code being profiled is basically all in one function: ssfft_fft_iterative(). It doesn't call any other FFT functions, it calls directly into the middle and bottom layers to do everything. There should be much less function call overhead than before. david |
|
From: William H. <ha...@ya...> - 2007-04-17 18:04:50
|
Don't worry about it for now then. We eventually need to do something similar to what Victor does with different versions for processor which are susceptible to certain kinds of problems. But for now if it is better on most processors, put it into the trunk so we can start using it. Bill. --- David Harvey <dmh...@ma...> wrote: > > On Apr 17, 2007, at 1:47 PM, Development list for > FLINT wrote: > > > Apparently the G5 also likes you to access data as > > soon as possible before it is "used" (whatever > that > > .... > > Damn all that stuff you mention sounds really > annoying. If this kind of > thing is really the problem, I don't think I have > the time now to be > spending on reorganising loops for a specific > processor, *especially* > one that is probably going to disappear in less than > a few years. These > are mostly the kinds of things a compiler should be > able to work out > itself, if it knows the processor well enough. The > code is running much > better on every other chip. > > david > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: David H. <dmh...@ma...> - 2007-04-17 18:01:47
|
On Apr 17, 2007, at 1:47 PM, Development list for FLINT wrote: > Apparently the G5 also likes you to access data as > soon as possible before it is "used" (whatever that .... Damn all that stuff you mention sounds really annoying. If this kind of thing is really the problem, I don't think I have the time now to be spending on reorganising loops for a specific processor, *especially* one that is probably going to disappear in less than a few years. These are mostly the kinds of things a compiler should be able to work out itself, if it knows the processor well enough. The code is running much better on every other chip. david |
|
From: William H. <ha...@ya...> - 2007-04-17 18:00:16
|
Yet another test. Please ignore. --- Development list for FLINT <fas...@li...> wrote: > hi I'm just testing I can send from my math.harvard > account now.... > > david > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:57:08
|
I just had a look at your code and one big performance
hit is the lack of static inline functions.
Every function which does not need to be accessed
outside the module should be static. Probably the
compiler does this automatically where possible, but
in your case your test code probably wants to access
it, making static impossible. The difference is the
overhead in function calls. Functions which can be
made static have less overhead.
The way to get around this is to introduce wrapper
functions for your static functions. The wrapper
functions can be called by your test code and they in
turn will call your static functions. Obviously other
functions in the same module will not call the wrapper
functions, but call the functions directly, since they
are in the same file.
Everything that is short (say less than 10 lines)
should be static inlined. This is apparently
particularly important on the G5.
Bill.
--- Development list for FLINT
<fas...@li...> wrote:
>
> --- Development list for FLINT
> <fas...@li...> wrote:
>
> >
> > On Apr 17, 2007, at 1:26 PM, Development list for
> > FLINT wrote:
> >
> > > Are you compiling with all the G5 compiler
> > options:
> > >
> > > -mcpu=970 -mtune=970 -mpowerpc64
> >
> > No. I've been using
> >
> > -m64 -funroll-loops -fexpensive-optimizations -O3
>
> You should still use -funroll-loops and -O3
>
> >
> > I will try your suggestion tonight at home.
> >
> > > Also, what version of gcc do you have on your
> G5?
> >
> > Can't recall; I'll find out tonight.
> >
> > > Apparently at the Apple developer site you can
> > > download a version specially tuned for the G5.
> > Dunno
> > > if it is better than just the latest gcc from
> the
> > web
> > > though. Never used a MAC.
> >
> > I believe I'm using the gcc that came installed,
> so
> > it should already
> > be the apple version.
>
> Apparently it is not. According to the apple
> developer
> website, the one that comes with it is not the
> specially tuned one.
>
> >
> > Keep in mind, I'm using the same compiler settings
> > for profiling the
> > old ssfft code too.
>
> Apparently the G5 also likes you to access data as
> soon as possible before it is "used" (whatever that
> means). It likes data to be accessed sequentially in
> order and data that can be accessed outside a loop
> instead of inside the loop will speed things up. I
> think what this means is to have a variable which
> you
> load with the data from memory, then do the loop
> acting on the data from the variable rather than
> loading it from memory every time the loop executes.
>
> You should not use type conversions unless you
> absolutely need to, nor global variables (though I
> am
> sure you don't have any of those). Also the G5 is
> particularly susceptible to slowdowns from branch
> mispredictions. It is much better to do:
>
> do B;
> if (cond)
> {
> undo B;
> do A;
> }
>
> than to do:
>
> if (cond)
> {
> do A;
> } else
> {
> do B;
> }
>
> if B should be done most of the time.
>
> Apart from these things, I can't see any sensible
> guidelines for developing on the G5.
>
> The Apple website says that Apple has written
> special
> code for doing FFT's on the G5 because many
> developers
> have handwritten code for the G5 and been sorely
> disappointed that it runs way slower on the G5. I
> think there are quite a lot of assembly
> optimizations
> for the G5 not employed by gcc according to the
> documentation.
>
> Bill.
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam
> protection around
> http://mail.yahoo.com
>
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:47:53
|
--- Development list for FLINT
<fas...@li...> wrote:
>
> On Apr 17, 2007, at 1:26 PM, Development list for
> FLINT wrote:
>
> > Are you compiling with all the G5 compiler
> options:
> >
> > -mcpu=970 -mtune=970 -mpowerpc64
>
> No. I've been using
>
> -m64 -funroll-loops -fexpensive-optimizations -O3
You should still use -funroll-loops and -O3
>
> I will try your suggestion tonight at home.
>
> > Also, what version of gcc do you have on your G5?
>
> Can't recall; I'll find out tonight.
>
> > Apparently at the Apple developer site you can
> > download a version specially tuned for the G5.
> Dunno
> > if it is better than just the latest gcc from the
> web
> > though. Never used a MAC.
>
> I believe I'm using the gcc that came installed, so
> it should already
> be the apple version.
Apparently it is not. According to the apple developer
website, the one that comes with it is not the
specially tuned one.
>
> Keep in mind, I'm using the same compiler settings
> for profiling the
> old ssfft code too.
Apparently the G5 also likes you to access data as
soon as possible before it is "used" (whatever that
means). It likes data to be accessed sequentially in
order and data that can be accessed outside a loop
instead of inside the loop will speed things up. I
think what this means is to have a variable which you
load with the data from memory, then do the loop
acting on the data from the variable rather than
loading it from memory every time the loop executes.
You should not use type conversions unless you
absolutely need to, nor global variables (though I am
sure you don't have any of those). Also the G5 is
particularly susceptible to slowdowns from branch
mispredictions. It is much better to do:
do B;
if (cond)
{
undo B;
do A;
}
than to do:
if (cond)
{
do A;
} else
{
do B;
}
if B should be done most of the time.
Apart from these things, I can't see any sensible
guidelines for developing on the G5.
The Apple website says that Apple has written special
code for doing FFT's on the G5 because many developers
have handwritten code for the G5 and been sorely
disappointed that it runs way slower on the G5. I
think there are quite a lot of assembly optimizations
for the G5 not employed by gcc according to the
documentation.
Bill.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:35:16
|
On Apr 17, 2007, at 1:26 PM, Development list for FLINT wrote: > Are you compiling with all the G5 compiler options: > > -mcpu=970 -mtune=970 -mpowerpc64 No. I've been using -m64 -funroll-loops -fexpensive-optimizations -O3 I will try your suggestion tonight at home. > Also, what version of gcc do you have on your G5? Can't recall; I'll find out tonight. > Apparently at the Apple developer site you can > download a version specially tuned for the G5. Dunno > if it is better than just the latest gcc from the web > though. Never used a MAC. I believe I'm using the gcc that came installed, so it should already be the apple version. Keep in mind, I'm using the same compiler settings for profiling the old ssfft code too. david |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:33:11
|
Also try: -falign-loops=16 -falign-functions=16 -falign-labels=16 -falign-jumps=16 which are apparently all useful on the G5. Bill. --- Development list for FLINT <fas...@li...> wrote: > Are you compiling with all the G5 compiler options: > > -mcpu=970 -mtune=970 -mpowerpc64 > > Also, what version of gcc do you have on your G5? > > Apparently at the Apple developer site you can > download a version specially tuned for the G5. Dunno > if it is better than just the latest gcc from the > web > though. Never used a MAC. > > Bill. > > --- Development list for FLINT > <fas...@li...> wrote: > > > > > On Apr 17, 2007, at 7:52 AM, Development list for > > FLINT wrote: > > > > > Have you got the L1 cache size set correctly for > > the > > > G5? Isn't it 32000. > > > > That's not the issue. The G5 speed problems are > > across the board, > > including cases that easily fit into L1. > > > > I just ran a subset of test cases again on the G5 > > just as a sanity > > check. Got similar results. > > > > Interestingly, on the G5, the case n = 3 is > actually > > pretty good. On > > the other platforms, n = 3 was generally quite > poor. > > > > David > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by DB2 Express > > Download DB2 Express C - the FREE version of DB2 > > express and take > > control of your XML. No limits. Just data. Click > to > > get it now. > > http://sourceforge.net/powerbar/db2/ > > _______________________________________________ > > Fastlibnt-devel mailing list > > Fas...@li... > > > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:31:48
|
I've now also run the profiles on my old G3 laptop. It really screams........ typically 30-40% faster than the old code; and there are a number of regions where it's consistently 60-100% faster (i.e. 1.6 - 2.0x faster). There don't appear to be any regions at all where it's consistently slower. david On Apr 16, 2007, at 11:41 PM, David Harvey wrote: > I've been running some profiles of some of the new development ssfft > code (in the ssfft3 branch) against the trunk version. > > It's generally looking pretty good. I've done profiles on sage (= > sage.math), martinj (= jason martin's machine), bsd (= william stein's > xeon), and my G5 (haven't got results for that yet). The target > function is ssfft_fft_iterative(), which is designed to handle > L1-sized transforms, using a plain iterative FFT. In particular it's > intended for use with short coefficients where bitshift factoring is > inappropriate, and small enough transforms that FFT factoring is not > necessary. I've been running it for transform lengths M = 16, 32, 64, > 128, 256, 512, 1024, with a range of truncation parameters, and > coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs. > > For lengths >= 256, it is unconditonally faster than the old code, on > all platforms (modulo a few random data points). The speedups are > typically: > > sage: 5-15% > bsd: 15-25% > martinj: 15-25% > > Some combinations get speedups of up to 40%. > > For lengths 64 and 128, there are a few problem areas, particularly > for n = 3 on all platforms, although mostly it's still ahead of the > old code. > > Length 32 and below is really a mixed bag. I find this surprising. > This code should work particularly well on small problems. > > On all platforms, apart from a few outliers, the new code was never > worse than 10% slower than the old code. > > I still have some investigation to do to figure out what's going on in > the slower regions. But generally I'm pretty happy so far. > > david > |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:26:26
|
Are you compiling with all the G5 compiler options: -mcpu=970 -mtune=970 -mpowerpc64 Also, what version of gcc do you have on your G5? Apparently at the Apple developer site you can download a version specially tuned for the G5. Dunno if it is better than just the latest gcc from the web though. Never used a MAC. Bill. --- Development list for FLINT <fas...@li...> wrote: > > On Apr 17, 2007, at 7:52 AM, Development list for > FLINT wrote: > > > Have you got the L1 cache size set correctly for > the > > G5? Isn't it 32000. > > That's not the issue. The G5 speed problems are > across the board, > including cases that easily fit into L1. > > I just ran a subset of test cases again on the G5 > just as a sanity > check. Got similar results. > > Interestingly, on the G5, the case n = 3 is actually > pretty good. On > the other platforms, n = 3 was generally quite poor. > > David > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 12:03:25
|
hi I'm just testing I can send from my math.harvard account now.... david |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 12:00:44
|
On Apr 17, 2007, at 7:52 AM, Development list for FLINT wrote: > Have you got the L1 cache size set correctly for the > G5? Isn't it 32000. That's not the issue. The G5 speed problems are across the board, including cases that easily fit into L1. I just ran a subset of test cases again on the G5 just as a sanity check. Got similar results. Interestingly, on the G5, the case n = 3 is actually pretty good. On the other platforms, n = 3 was generally quite poor. David |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 11:52:47
|
Have you got the L1 cache size set correctly for the G5? Isn't it 32000. Bill. --- Development list for FLINT <fas...@li...> wrote: > I've been running some profiles of some of the new > development ssfft > code (in the ssfft3 branch) against the trunk > version. > > The results are a little weird. The target function > is > ssfft_fft_iterative(), which is designed to handle > L1-sized > transforms, using a plain iterative FFT. In > particular it's intended > for use with short coefficients where bitshift > factoring is > inappropriate, and small enough transforms that FFT > factoring is not > necessary. I've been running it for transform > lengths M = 16, 32, 64, > 128, 256, 512, 1024, with a range of truncation > parameters, and > coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs. > > I've done profiles on sage (= sage.math), martinj (= > jason martin's > machine), bsd (= william stein's xeon), and my G5. > > ======== sage, martinj, bsd ======== > > For lengths >= 256, it is unconditonally faster than > the old code, on > the above platforms (modulo a few random data > points). The speedups > are typically: > > sage: 5-15% > bsd: 15-25% > martinj: 15-25% > > Some combinations get speedups of up to 40%. So this > is great. > > For lengths 64 and 128, there are a few problem > areas, particularly > for n = 3, although mostly it's still ahead of the > old code. > > Length 32 and below is really a mixed bag. I find > this surprising. > This code should work particularly well on small > problems. > > On the above platforms, apart from a few outliers, > the new code was > never worse than 10% slower than the old code. > > ======== G5 ======== > > On my powerpc g5 machine, things looked BAD. The new > code is > typically 15-20% SLOWER than the old code. It's > sometimes as much as > 10% faster, but usually it's slower. > > I have absolutely no idea why this is happening. > > david > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 11:31:04
|
I've been running some profiles of some of the new development ssfft code (in the ssfft3 branch) against the trunk version. The results are a little weird. The target function is ssfft_fft_iterative(), which is designed to handle L1-sized transforms, using a plain iterative FFT. In particular it's intended for use with short coefficients where bitshift factoring is inappropriate, and small enough transforms that FFT factoring is not necessary. I've been running it for transform lengths M = 16, 32, 64, 128, 256, 512, 1024, with a range of truncation parameters, and coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs. I've done profiles on sage (= sage.math), martinj (= jason martin's machine), bsd (= william stein's xeon), and my G5. ======== sage, martinj, bsd ======== For lengths >= 256, it is unconditonally faster than the old code, on the above platforms (modulo a few random data points). The speedups are typically: sage: 5-15% bsd: 15-25% martinj: 15-25% Some combinations get speedups of up to 40%. So this is great. For lengths 64 and 128, there are a few problem areas, particularly for n = 3, although mostly it's still ahead of the old code. Length 32 and below is really a mixed bag. I find this surprising. This code should work particularly well on small problems. On the above platforms, apart from a few outliers, the new code was never worse than 10% slower than the old code. ======== G5 ======== On my powerpc g5 machine, things looked BAD. The new code is typically 15-20% SLOWER than the old code. It's sometimes as much as 10% faster, but usually it's slower. I have absolutely no idea why this is happening. david |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 00:35:22
|
can anyone hear me? david |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 00:33:28
|
The list is working. All messages about FLINT development should now be sent to the list with appropriate subject lines describing the subject of the message for future reference. Bill Hart. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |