|
From: Development l. f. F. <fas...@li...> - 2007-04-17 11:31:04
|
I've been running some profiles of some of the new development ssfft code (in the ssfft3 branch) against the trunk version. The results are a little weird. The target function is ssfft_fft_iterative(), which is designed to handle L1-sized transforms, using a plain iterative FFT. In particular it's intended for use with short coefficients where bitshift factoring is inappropriate, and small enough transforms that FFT factoring is not necessary. I've been running it for transform lengths M = 16, 32, 64, 128, 256, 512, 1024, with a range of truncation parameters, and coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs. I've done profiles on sage (= sage.math), martinj (= jason martin's machine), bsd (= william stein's xeon), and my G5. ======== sage, martinj, bsd ======== For lengths >= 256, it is unconditonally faster than the old code, on the above platforms (modulo a few random data points). The speedups are typically: sage: 5-15% bsd: 15-25% martinj: 15-25% Some combinations get speedups of up to 40%. So this is great. For lengths 64 and 128, there are a few problem areas, particularly for n = 3, although mostly it's still ahead of the old code. Length 32 and below is really a mixed bag. I find this surprising. This code should work particularly well on small problems. On the above platforms, apart from a few outliers, the new code was never worse than 10% slower than the old code. ======== G5 ======== On my powerpc g5 machine, things looked BAD. The new code is typically 15-20% SLOWER than the old code. It's sometimes as much as 10% faster, but usually it's slower. I have absolutely no idea why this is happening. david |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:31:48
|
I've now also run the profiles on my old G3 laptop. It really screams........ typically 30-40% faster than the old code; and there are a number of regions where it's consistently 60-100% faster (i.e. 1.6 - 2.0x faster). There don't appear to be any regions at all where it's consistently slower. david On Apr 16, 2007, at 11:41 PM, David Harvey wrote: > I've been running some profiles of some of the new development ssfft > code (in the ssfft3 branch) against the trunk version. > > It's generally looking pretty good. I've done profiles on sage (= > sage.math), martinj (= jason martin's machine), bsd (= william stein's > xeon), and my G5 (haven't got results for that yet). The target > function is ssfft_fft_iterative(), which is designed to handle > L1-sized transforms, using a plain iterative FFT. In particular it's > intended for use with short coefficients where bitshift factoring is > inappropriate, and small enough transforms that FFT factoring is not > necessary. I've been running it for transform lengths M = 16, 32, 64, > 128, 256, 512, 1024, with a range of truncation parameters, and > coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs. > > For lengths >= 256, it is unconditonally faster than the old code, on > all platforms (modulo a few random data points). The speedups are > typically: > > sage: 5-15% > bsd: 15-25% > martinj: 15-25% > > Some combinations get speedups of up to 40%. > > For lengths 64 and 128, there are a few problem areas, particularly > for n = 3 on all platforms, although mostly it's still ahead of the > old code. > > Length 32 and below is really a mixed bag. I find this surprising. > This code should work particularly well on small problems. > > On all platforms, apart from a few outliers, the new code was never > worse than 10% slower than the old code. > > I still have some investigation to do to figure out what's going on in > the slower regions. But generally I'm pretty happy so far. > > david > |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 11:52:47
|
Have you got the L1 cache size set correctly for the G5? Isn't it 32000. Bill. --- Development list for FLINT <fas...@li...> wrote: > I've been running some profiles of some of the new > development ssfft > code (in the ssfft3 branch) against the trunk > version. > > The results are a little weird. The target function > is > ssfft_fft_iterative(), which is designed to handle > L1-sized > transforms, using a plain iterative FFT. In > particular it's intended > for use with short coefficients where bitshift > factoring is > inappropriate, and small enough transforms that FFT > factoring is not > necessary. I've been running it for transform > lengths M = 16, 32, 64, > 128, 256, 512, 1024, with a range of truncation > parameters, and > coefficient lengths n = 1, 2, 3, 4, 6, 8 limbs. > > I've done profiles on sage (= sage.math), martinj (= > jason martin's > machine), bsd (= william stein's xeon), and my G5. > > ======== sage, martinj, bsd ======== > > For lengths >= 256, it is unconditonally faster than > the old code, on > the above platforms (modulo a few random data > points). The speedups > are typically: > > sage: 5-15% > bsd: 15-25% > martinj: 15-25% > > Some combinations get speedups of up to 40%. So this > is great. > > For lengths 64 and 128, there are a few problem > areas, particularly > for n = 3, although mostly it's still ahead of the > old code. > > Length 32 and below is really a mixed bag. I find > this surprising. > This code should work particularly well on small > problems. > > On the above platforms, apart from a few outliers, > the new code was > never worse than 10% slower than the old code. > > ======== G5 ======== > > On my powerpc g5 machine, things looked BAD. The new > code is > typically 15-20% SLOWER than the old code. It's > sometimes as much as > 10% faster, but usually it's slower. > > I have absolutely no idea why this is happening. > > david > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 12:00:44
|
On Apr 17, 2007, at 7:52 AM, Development list for FLINT wrote: > Have you got the L1 cache size set correctly for the > G5? Isn't it 32000. That's not the issue. The G5 speed problems are across the board, including cases that easily fit into L1. I just ran a subset of test cases again on the G5 just as a sanity check. Got similar results. Interestingly, on the G5, the case n = 3 is actually pretty good. On the other platforms, n = 3 was generally quite poor. David |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:26:26
|
Are you compiling with all the G5 compiler options: -mcpu=970 -mtune=970 -mpowerpc64 Also, what version of gcc do you have on your G5? Apparently at the Apple developer site you can download a version specially tuned for the G5. Dunno if it is better than just the latest gcc from the web though. Never used a MAC. Bill. --- Development list for FLINT <fas...@li...> wrote: > > On Apr 17, 2007, at 7:52 AM, Development list for > FLINT wrote: > > > Have you got the L1 cache size set correctly for > the > > G5? Isn't it 32000. > > That's not the issue. The G5 speed problems are > across the board, > including cases that easily fit into L1. > > I just ran a subset of test cases again on the G5 > just as a sanity > check. Got similar results. > > Interestingly, on the G5, the case n = 3 is actually > pretty good. On > the other platforms, n = 3 was generally quite poor. > > David > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:33:11
|
Also try: -falign-loops=16 -falign-functions=16 -falign-labels=16 -falign-jumps=16 which are apparently all useful on the G5. Bill. --- Development list for FLINT <fas...@li...> wrote: > Are you compiling with all the G5 compiler options: > > -mcpu=970 -mtune=970 -mpowerpc64 > > Also, what version of gcc do you have on your G5? > > Apparently at the Apple developer site you can > download a version specially tuned for the G5. Dunno > if it is better than just the latest gcc from the > web > though. Never used a MAC. > > Bill. > > --- Development list for FLINT > <fas...@li...> wrote: > > > > > On Apr 17, 2007, at 7:52 AM, Development list for > > FLINT wrote: > > > > > Have you got the L1 cache size set correctly for > > the > > > G5? Isn't it 32000. > > > > That's not the issue. The G5 speed problems are > > across the board, > > including cases that easily fit into L1. > > > > I just ran a subset of test cases again on the G5 > > just as a sanity > > check. Got similar results. > > > > Interestingly, on the G5, the case n = 3 is > actually > > pretty good. On > > the other platforms, n = 3 was generally quite > poor. > > > > David > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by DB2 Express > > Download DB2 Express C - the FREE version of DB2 > > express and take > > control of your XML. No limits. Just data. Click > to > > get it now. > > http://sourceforge.net/powerbar/db2/ > > _______________________________________________ > > Fastlibnt-devel mailing list > > Fas...@li... > > > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:35:16
|
On Apr 17, 2007, at 1:26 PM, Development list for FLINT wrote: > Are you compiling with all the G5 compiler options: > > -mcpu=970 -mtune=970 -mpowerpc64 No. I've been using -m64 -funroll-loops -fexpensive-optimizations -O3 I will try your suggestion tonight at home. > Also, what version of gcc do you have on your G5? Can't recall; I'll find out tonight. > Apparently at the Apple developer site you can > download a version specially tuned for the G5. Dunno > if it is better than just the latest gcc from the web > though. Never used a MAC. I believe I'm using the gcc that came installed, so it should already be the apple version. Keep in mind, I'm using the same compiler settings for profiling the old ssfft code too. david |
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:47:53
|
--- Development list for FLINT
<fas...@li...> wrote:
>
> On Apr 17, 2007, at 1:26 PM, Development list for
> FLINT wrote:
>
> > Are you compiling with all the G5 compiler
> options:
> >
> > -mcpu=970 -mtune=970 -mpowerpc64
>
> No. I've been using
>
> -m64 -funroll-loops -fexpensive-optimizations -O3
You should still use -funroll-loops and -O3
>
> I will try your suggestion tonight at home.
>
> > Also, what version of gcc do you have on your G5?
>
> Can't recall; I'll find out tonight.
>
> > Apparently at the Apple developer site you can
> > download a version specially tuned for the G5.
> Dunno
> > if it is better than just the latest gcc from the
> web
> > though. Never used a MAC.
>
> I believe I'm using the gcc that came installed, so
> it should already
> be the apple version.
Apparently it is not. According to the apple developer
website, the one that comes with it is not the
specially tuned one.
>
> Keep in mind, I'm using the same compiler settings
> for profiling the
> old ssfft code too.
Apparently the G5 also likes you to access data as
soon as possible before it is "used" (whatever that
means). It likes data to be accessed sequentially in
order and data that can be accessed outside a loop
instead of inside the loop will speed things up. I
think what this means is to have a variable which you
load with the data from memory, then do the loop
acting on the data from the variable rather than
loading it from memory every time the loop executes.
You should not use type conversions unless you
absolutely need to, nor global variables (though I am
sure you don't have any of those). Also the G5 is
particularly susceptible to slowdowns from branch
mispredictions. It is much better to do:
do B;
if (cond)
{
undo B;
do A;
}
than to do:
if (cond)
{
do A;
} else
{
do B;
}
if B should be done most of the time.
Apart from these things, I can't see any sensible
guidelines for developing on the G5.
The Apple website says that Apple has written special
code for doing FFT's on the G5 because many developers
have handwritten code for the G5 and been sorely
disappointed that it runs way slower on the G5. I
think there are quite a lot of assembly optimizations
for the G5 not employed by gcc according to the
documentation.
Bill.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|
|
From: Development l. f. F. <fas...@li...> - 2007-04-17 17:57:08
|
I just had a look at your code and one big performance
hit is the lack of static inline functions.
Every function which does not need to be accessed
outside the module should be static. Probably the
compiler does this automatically where possible, but
in your case your test code probably wants to access
it, making static impossible. The difference is the
overhead in function calls. Functions which can be
made static have less overhead.
The way to get around this is to introduce wrapper
functions for your static functions. The wrapper
functions can be called by your test code and they in
turn will call your static functions. Obviously other
functions in the same module will not call the wrapper
functions, but call the functions directly, since they
are in the same file.
Everything that is short (say less than 10 lines)
should be static inlined. This is apparently
particularly important on the G5.
Bill.
--- Development list for FLINT
<fas...@li...> wrote:
>
> --- Development list for FLINT
> <fas...@li...> wrote:
>
> >
> > On Apr 17, 2007, at 1:26 PM, Development list for
> > FLINT wrote:
> >
> > > Are you compiling with all the G5 compiler
> > options:
> > >
> > > -mcpu=970 -mtune=970 -mpowerpc64
> >
> > No. I've been using
> >
> > -m64 -funroll-loops -fexpensive-optimizations -O3
>
> You should still use -funroll-loops and -O3
>
> >
> > I will try your suggestion tonight at home.
> >
> > > Also, what version of gcc do you have on your
> G5?
> >
> > Can't recall; I'll find out tonight.
> >
> > > Apparently at the Apple developer site you can
> > > download a version specially tuned for the G5.
> > Dunno
> > > if it is better than just the latest gcc from
> the
> > web
> > > though. Never used a MAC.
> >
> > I believe I'm using the gcc that came installed,
> so
> > it should already
> > be the apple version.
>
> Apparently it is not. According to the apple
> developer
> website, the one that comes with it is not the
> specially tuned one.
>
> >
> > Keep in mind, I'm using the same compiler settings
> > for profiling the
> > old ssfft code too.
>
> Apparently the G5 also likes you to access data as
> soon as possible before it is "used" (whatever that
> means). It likes data to be accessed sequentially in
> order and data that can be accessed outside a loop
> instead of inside the loop will speed things up. I
> think what this means is to have a variable which
> you
> load with the data from memory, then do the loop
> acting on the data from the variable rather than
> loading it from memory every time the loop executes.
>
> You should not use type conversions unless you
> absolutely need to, nor global variables (though I
> am
> sure you don't have any of those). Also the G5 is
> particularly susceptible to slowdowns from branch
> mispredictions. It is much better to do:
>
> do B;
> if (cond)
> {
> undo B;
> do A;
> }
>
> than to do:
>
> if (cond)
> {
> do A;
> } else
> {
> do B;
> }
>
> if B should be done most of the time.
>
> Apart from these things, I can't see any sensible
> guidelines for developing on the G5.
>
> The Apple website says that Apple has written
> special
> code for doing FFT's on the G5 because many
> developers
> have handwritten code for the G5 and been sorely
> disappointed that it runs way slower on the G5. I
> think there are quite a lot of assembly
> optimizations
> for the G5 not employed by gcc according to the
> documentation.
>
> Bill.
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam
> protection around
> http://mail.yahoo.com
>
>
-------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2
> express and take
> control of your XML. No limits. Just data. Click to
> get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Fastlibnt-devel mailing list
> Fas...@li...
>
https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|
|
From: David H. <dmh...@ma...> - 2007-04-17 18:19:51
|
On Apr 17, 2007, at 1:57 PM, Development list for FLINT wrote: > I just had a look at your code and one big performance > hit is the lack of static inline functions. I have some "inline" functions which are not "static inline". What is the difference between "static inline" and just "inline". I thought it was only a linkage issue. What does plain "inline" do? Of the remaining functions, I think the only candidates for inlining, which are not already marked inline, are the ones starting with "coeff". Some of these look a bit long to be inlined (code bloat), but I agree I should try inlining some of the shorter ones. But this doesn't answer the basic question about the slowness of the code on the G5. Perhaps I need to explain what's going on with the new code, so you can see why I am perplexed. The ssfft code has basically three layers. The bottom layer is the functions starting with "basic". These are very low level coefficient operations on raw blocks of memory, like rotations, and bitshifts with carry handling. The middle layer are the functions starting with "coeff". These are allowed to do things like swap buffers, they make decisions about how to decompose large rotations into bitshifts and limbshifts etc. Finally the top layer consists of functions that call the coefficient operations in some appropriate order to carry out FFTs. Now the bottom and middle layers have NOT changed between the trunk version and my new version. I am only fiddling with the top layer for this new code. (There is one minor change I want to make to some middle layer code at some point, but I haven't got to that yet.) In particular the old code has just as much inlining going on as the new code. In fact I would argue the new code is *better* inlined, for the following reason. The old code used a table lookup to decide which of the 16 variants of the radix-4 transform to call on each block. So it was using function pointers all over the place. Surely function pointers are the arch-nemesis of inlining. The new code being profiled is basically all in one function: ssfft_fft_iterative(). It doesn't call any other FFT functions, it calls directly into the middle and bottom layers to do everything. There should be much less function call overhead than before. david |
|
From: William H. <ha...@ya...> - 2007-04-18 02:10:13
|
I just had a look at ssfft_fft_iterative() and I am wondering if you know which parts of it are taking longer than the old version. Have you profiled the individual parts to see which is taking all the time? The innermost double for loops worry me. The compiler cannot do much about unrolling these loops, since it has no idea how long any of those loops are going to be. If some of the loops are definitely multiples of 4 or something, it would be best to unroll them by hand by a factor of 4. Also, I am wondering if the double for loops might work better made into a single for loop, or even better, a do..while loop if it is known that at least one iteration must execute. The 2*half's that appear throughout also worry me. It would be better to make half twice as big and do a comparison (2*z <= half) so that the multiplication occurs on the outside of the loops. Probably there are almost no cycles on the outer layers anyway and all the cycles taken are inside those functions called inside the for loops. But they *have* to be made inlined. An inlined function incurs no function call overhead. I definitely wouldn't worry about code bloat here. You need the speed, not a smaller program. You are going to be going nowhere near the code cache size and it isn't as if it is going to take up all the memory on the machine having slightly larger functions. Definitely you should replace things like: &x[start+half+i], &x[start+i] with statements: mp_limb_t ** x_start = x + start; mp_limb_t ** xtart_half = x + start + half; outside the innermost for loop and: x_start_half+i, xstart+i inside the for loop. This could be a holdup for a G5 machine. Probably the compiler sees through your &x[i] notation, which should just be x+i, but loading start every iteration probably incurs some time. The only other thing I can suggest if the new version is still way slower than the old one on the G5 is to count the number of times each of the other functions is called from ssfft_fft_iterative() and make sure the proportion hasn't changed in a subtle way. Hopefully I didn't just analyse an out of date version of the code. I can't recall when I last updated the file. Bill. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: David H. <dmh...@ma...> - 2007-04-18 02:38:13
|
On Apr 17, 2007, at 10:10 PM, William Hart wrote: > I just had a look at ssfft_fft_iterative() and I am > wondering if you know which parts of it are taking > longer than the old version. Have you profiled the > individual parts to see which is taking all the time? No, unfortunately the question doesn't really make any sense since the strategy is completely different. I can compare the whole thing, but there aren't any parts to match up and compare separately. I'm halfway through another profile with all the compiler switches you suggested. Initial results look quite promising. We'll know more tomorrow. > The innermost double for loops worry me. The compiler > cannot do much about unrolling these loops, since it [...] These all sound like good ideas. I will add a note to myself in ssfft.c to come back to this email and think about them later on. For now I want to concentrate on getting some of the other functions written. The most interesting new idea (the bitshift factoring trick) will hopefully be deployed soon, and I'm feeling lucky :-) Only one comment I have for the moment: > Definitely you should replace things like: > > &x[start+half+i], &x[start+i] > > with statements: > > mp_limb_t ** x_start = x + start; > mp_limb_t ** xtart_half = x + start + half; > > outside the innermost for loop and: > > x_start_half+i, xstart+i > > inside the for loop. This could be a holdup for a G5 > machine. Probably the compiler sees through your &x[i] > notation, which should just be x+i, but loading start > every iteration probably incurs some time. In my heart of hearts I agree with this advice. It makes 100% perfect sense. HOWEVER.... I tried this kind of thing many times on sage.math, with various kinds of code (small prime fft, matrix transpose, etc), and every single time I tried it, the compiler laughed in my face and the code got slow. I never understood why. I still don't. So I have to admit I'm wary. david |
|
From: David H. <dmh...@ma...> - 2007-04-18 02:58:19
|
OK, well the G5 profile just finished and I'm still awake, and I'll simply say that the compiler flags well and truly made the problem go away. The flags slow down the old code a bit (hmm a little strange), and speed up the new code a lot. The new code with the flags is significantly faster than the old code with or without the flags, across all regimes being profiled. So in other words, it looks like the new code is highly susceptible to these kinds of optimisations, and the old code doesn't seem to be, actually it seems to suffer a bit. I'm going to make a note in todo.txt to remind us to put these flags into the makefile for the G5 architecture, whenever we get around to writing a proper build script. Bill you are a star. Thanks very much. david |
|
From: William H. <ha...@ya...> - 2007-04-18 03:15:43
|
Great news then. Now I'm wondering where else compiler flags might help things. As for static inline, presumably the static is redundant if the compiler is indeed able to inline it. But if not... But actually, my comment about these things was probably more about making functions static if they are *not* inlined, but not accessed from outside the module. If the only place they are accessed from outside the module is the test code, then you should still make them static and write wrapper functions as mentioned. The compiler will simply ignore static if a function is accessed from outside the module. Actually compilers these days probably universally ignore static. So it is the principle more than the actual word static that is important, i.e. make sure the test code is not the only thing outside the module which is accessing the function. Oh wait, I'm talking rubbish again. You don't need wrapper functions. If you aren't linking against the test code, there won't be something outside the module accessing the function. Therefore probably what I said about static is irrelevant. But definitely some of those functions need to be made inlined. If they were library functions you might only make the ones that are just a few lines long inlined. But for internal functions like this, sometimes half page functions should be inlined. I just had a sense of dejavu. I think we discussed this in detail before, and came to the same conclusions. Bill. --- David Harvey <dmh...@ma...> wrote: > OK, well the G5 profile just finished and I'm still > awake, and I'll > simply say that the compiler flags well and truly > made the problem go > away. > > The flags slow down the old code a bit (hmm a little > strange), and > speed up the new code a lot. The new code with the > flags is > significantly faster than the old code with or > without the flags, > across all regimes being profiled. So in other > words, it looks like > the new code is highly susceptible to these kinds of > optimisations, > and the old code doesn't seem to be, actually it > seems to suffer a bit. > > I'm going to make a note in todo.txt to remind us to > put these flags > into the makefile for the G5 architecture, whenever > we get around to > writing a proper build script. > > Bill you are a star. Thanks very much. > > david > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: David H. <dmh...@ma...> - 2007-04-17 18:01:47
|
On Apr 17, 2007, at 1:47 PM, Development list for FLINT wrote: > Apparently the G5 also likes you to access data as > soon as possible before it is "used" (whatever that .... Damn all that stuff you mention sounds really annoying. If this kind of thing is really the problem, I don't think I have the time now to be spending on reorganising loops for a specific processor, *especially* one that is probably going to disappear in less than a few years. These are mostly the kinds of things a compiler should be able to work out itself, if it knows the processor well enough. The code is running much better on every other chip. david |
|
From: William H. <ha...@ya...> - 2007-04-17 18:04:50
|
Don't worry about it for now then. We eventually need to do something similar to what Victor does with different versions for processor which are susceptible to certain kinds of problems. But for now if it is better on most processors, put it into the trunk so we can start using it. Bill. --- David Harvey <dmh...@ma...> wrote: > > On Apr 17, 2007, at 1:47 PM, Development list for > FLINT wrote: > > > Apparently the G5 also likes you to access data as > > soon as possible before it is "used" (whatever > that > > .... > > Damn all that stuff you mention sounds really > annoying. If this kind of > thing is really the problem, I don't think I have > the time now to be > spending on reorganising loops for a specific > processor, *especially* > one that is probably going to disappear in less than > a few years. These > are mostly the kinds of things a compiler should be > able to work out > itself, if it knows the processor well enough. The > code is running much > better on every other chip. > > david > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 > express and take > control of your XML. No limits. Just data. Click to > get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Fastlibnt-devel mailing list > Fas...@li... > https://lists.sourceforge.net/lists/listinfo/fastlibnt-devel > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: David H. <dmh...@ma...> - 2007-04-17 18:23:22
|
On Apr 17, 2007, at 2:04 PM, William Hart wrote: > Don't worry about it for now then. We eventually need > to do something similar to what Victor does with > different versions for processor which are susceptible > to certain kinds of problems. But for now if it is > better on most processors, put it into the trunk so we > can start using it. I'm not pushing it into the trunk for a while yet, for a couple of reasons: (1) the calling interface has changed slightly, and is not totally solid yet (2) it's only suitable for a specific type of transform, it will be horrible for larger transforms (3) it's only one piece of a larger coherent rewrite I have in mind I think it would be better to wait until I have written the other pieces. Otherwise it will get too confusing to maintain it all. Perhaps once I have finished all the forward transform components it will be ok to push into the trunk, and then I can work on the inverse transform separately. david |
|
From: William H. <ha...@ya...> - 2007-04-17 22:46:54
|
I can't believe this is an unsigned type. So when we use it for limbs in Zpoly_mpn_t the sign limb can't be compared to 0 to see if it is negative. Instead I have to specifically do a comparison with -1L. That's kind of crap because it means we can't just use any old negative number for a negative sign. It has to be a specific number so that we can do an easy comparison. Bill. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
|
From: David H. <dmh...@ma...> - 2007-04-18 00:43:37
|
On Apr 17, 2007, at 6:46 PM, William Hart wrote: > I can't believe this is an unsigned type. So when we > use it for limbs in Zpoly_mpn_t the sign limb can't be > compared to 0 to see if it is negative. Instead I have > to specifically do a comparison with -1L. That's kind > of crap because it means we can't just use any old > negative number for a negative sign. It has to be a > specific number so that we can do an easy comparison. Well, there's also mp_limb_signed_t (or perhaps it's mp_signed_limb_t? i can't remember), but that doesn't really help you, since the array has to be of a single type. Perhaps it's worth having macros COEFF(poly, n) which returns a pointer to the limbs of the nth coefficient of poly, and also SIGN(poly, n) which returns the sign limb of the nth coefficient, casted to a signed type. Not sure if those are the right names for the macros, but something like that would certainly simplify a lot of the Zpoly_mpn code. david |