From: Patrik J. <co...@fa...> - 2011-07-01 04:29:47
|
Hi everyone, There is a new page with loop plots for r1845 that I just pushed at http://governator.ucsc.edu/filer/blitzbench_r1845/blitzcomp.html. There are descriptions on those pages but the short summary is that blitz performance with icpc is now pretty much comparable to Fortran (compiled with ifort) across all sizes. It's still a bit slower for small arrays, but not outrageously so. I have also successfully used the vectorized version in my radiation-transfer code and the results are the same, so it's passing some decidedly nontrivial tests in addition to the test suite. I'm about to go on vacation, so this will be the last update from me for a while. cheers, /Patrik |
From: Patrick G. <pa...@mn...> - 2011-07-22 13:57:29
|
Hi Patrik, I am compiling with the latest update (4) of icpc 12 (12.0.4 20110427). Unfortunately I cannot use v11.1 of icpc any longer as I have the following compatibility problem with gcc described here http://origin-software.intel.com/en-us/forums/showthread.php?t=74691&p=2&o=d&s=lr I have now compared compilation of one of my simulation with the "old" separated ET for Array and TinyVectors and the "new" unified one. I think there are several problems to be addressed with the "new" ET machinery before it can be accepted. Compiling my code with the same (optimised) options using the "old" ET of blitz takes 2 mins while it takes 25 mins with the "new" one. As far as I remember it used to compile also in about a few minutes with icpc 11.0 and 11.1. There seems to be a big difference of the counts of loops that can and can't be vectorized according to the reported diagnostics between old and new ET: LOOP WAS VECTORIZED: old 70, new 946 loop was not vectorized: old 2192, new 4277 But I am not sure how to interpret that. Any idea? Running a simple test case with the new version returns different numbers compared to the old version. On another test case the new ET crashes while the old ET returns what is expected. I am looking into this with compiling with debug option but with -g -O2 takes even longer to compile, over 1.5 hours now and still not finished. As I reported previoysly another code which used to compile now exit with a fatal compilation error: Out of memory asking for 8200. Does anyone has similar experience? Patrik, do you have any idea how to proceed? Cheers, Patrick On 07/22/2011 01:06 AM, Patrik Jonsson wrote: > Hmm, this wouldn't be with icpc 12, would it? I've noticed that 12 > seems a lot slower than 11.1, to the point that the "multicomponent" > test case which takes 30s to compile w v11, had not completed after 18 > hours(!) on v12. > > But yes, I think the reason it takes longer is that currently it still > instantiates the full expression evaluation even for tinyvector-only > expressions that don't use it. This should be an easy improvement. > > /P. > > On Thu, Jul 21, 2011 at 6:02 PM, Patrick Guio <pa...@mn...> wrote: >> >> Ok using a constructor fix it but I have another problem. I got the >> following message when compiling >> >> Fatal compilation error: Out of memory asking for 8200. >> xiar: error #10014: problem during multi-file optimization compilation >> (code 1) >> xiar: error #10014: problem during multi-file optimization compilation >> (code 1) >> >> and the code used to compile earlier with the same options ("-xSSE4.2 >> -ansi -std=c++0x -O3 -ipo -restrict -vec-report1 -no-prec-div >> -no-ansi-alias"). I also noticed that the compilation time is much >> longer. Could there be more overhead with the new machinery? >> >> Best, >> Patrick >> >> On 07/21/2011 10:04 PM, Patrik Jonsson wrote: >>> On Thu, Jul 21, 2011 at 4:59 PM, Patrick Guio <pa...@mn...> wrote: >>>> >>>> Hi again Patrik, >>>> >>>> I have another problem with another code, it seems that the last >>>> constructor does not compile any longer >>>> >>>> blitz::Array<double,3> F(10,10,10); >>>> blitz::TinyVector<int, 3> size(F.shape()); >>>> blitz::Array<double,3> F1(size+1); >>>> >>>> Any idea? >>> >>> because size+1 is an expression and not a tinyvector, probably. >>> >>> you'll have to use TinyVector<int,3>(size+1) to evaluate it, just like >>> has already been necessary for arrays for a long time. the alternative >>> is to add versions of the Array constructor taking expressions, but >>> that'll be messy... especially since those expressions also can be >>> array expressions. >>> >>> /P. >> >> |
From: Patrik J. <co...@fa...> - 2011-07-22 16:20:18
|
On Fri, Jul 22, 2011 at 9:57 AM, Patrick Guio <pa...@mn...> wrote: > > Hi Patrik, > > I am compiling with the latest update (4) of icpc 12 (12.0.4 20110427). > Unfortunately I cannot use v11.1 of icpc any longer as I have the > following compatibility problem with gcc described here > http://origin-software.intel.com/en-us/forums/showthread.php?t=74691&p=2&o=d&s=lr > > I have now compared compilation of one of my simulation with the "old" > separated ET for Array and TinyVectors and the "new" unified one. > I think there are several problems to be addressed with the "new" ET > machinery before it can be accepted. > > Compiling my code with the same (optimised) options using the "old" ET > of blitz takes 2 mins while it takes 25 mins with the "new" one. > As far as I remember it used to compile also in about a few minutes with > icpc 11.0 and 11.1. Yes, there's some interaction between the new code and v12 that makes compilation very slow. It's not simply the new code, because compiling with v11 (which I used when developing it) does not have this problem. My hunch is that it has to do with inlining. Could you try commenting out all lines with '#pragma forceinline' and see how that changes your compilation times? In general, however, I think compilation times will go up, because the "new ET" machinery is more complicated than the old one that TinyVector used to use. > There seems to be a big difference of the counts of loops that can and > can't be vectorized according to the reported diagnostics between old > and new ET: > LOOP WAS VECTORIZED: old 70, new 946 > loop was not vectorized: old 2192, new 4277 > But I am not sure how to interpret that. Any idea? Well, there are many loops in the code, and the only one that's vectorized is the inner loop for unit stride stack evaluations, so it makes sense both that the number of loops has gone up and that most of them are not vectorized (because they are not inner loops or use the construct that allows the compiler to vectorize). The count has probably gone up because the new code has more paths depending on the alignment, stride and type of expression, and the vectorized array expressions redirects to TinyVector evaluations, so there are recursive loops. > Running a simple test case with the new version returns different > numbers compared to the old version. On another test case the new ET > crashes while the old ET returns what is expected. > I am looking into this with compiling with debug option but with -g -O2 > takes even longer to compile, over 1.5 hours now and still not finished. All tests in the testsuite pass for me (except storage but that's because the iterators don't work for arrays with non-ascending storage). If you have your own tests that don't pass, please add them to the testsuite. cheers, /Patrik |
From: Patrick G. <pa...@mn...> - 2011-07-22 14:56:14
|
On 07/22/2011 03:26 PM, Patrik Jonsson wrote: > On Fri, Jul 22, 2011 at 9:57 AM, Patrick Guio <pa...@mn...> wrote: >> Hi Patrik, >> >> I am compiling with the latest update (4) of icpc 12 (12.0.4 20110427). >> Unfortunately I cannot use v11.1 of icpc any longer as I have the >> following compatibility problem with gcc described here >> http://origin-software.intel.com/en-us/forums/showthread.php?t=74691&p=2&o=d&s=lr >> >> I have now compared compilation of one of my simulation with the "old" >> separated ET for Array and TinyVectors and the "new" unified one. >> I think there are several problems to be addressed with the "new" ET >> machinery before it can be accepted. >> >> Compiling my code with the same (optimised) options using the "old" ET >> of blitz takes 2 mins while it takes 25 mins with the "new" one. >> As far as I remember it used to compile also in about a few minutes with >> icpc 11.0 and 11.1. > > Yes, there's some interaction between the new code and v12 that makes > compilation very slow. It's not simply the new code, because compiling > with v11 (which I used when developing it) does not have this problem. > My hunch is that it has to do with inlining. Could you try commenting > out all lines with '#pragma forceinline' and see how that changes your > compilation times? Ok then you should perhaps contact intel about that problem? Or perhaps you have already done it? What about GNU g++? Is there also a large difference in compilation time between the two versions? > In general, however, I think compilation times will go up, because the > "new ET" machinery is more complicated than the old one that > TinyVector used to use. I am not sure what other people mean but I don't think it is acceptable to have a compilation time 10 x larger? Note that also after 2 hours the compilation I started with -g -O exits with the same fatal error concerning memory problem! So it is difficult to sort out why the code is crashing... > >> There seems to be a big difference of the counts of loops that can and >> can't be vectorized according to the reported diagnostics between old >> and new ET: >> LOOP WAS VECTORIZED: old 70, new 946 >> loop was not vectorized: old 2192, new 4277 >> But I am not sure how to interpret that. Any idea? > > Well, there are many loops in the code, and the only one that's > vectorized is the inner loop for unit stride stack evaluations, so it > makes sense both that the number of loops has gone up and that most of > them are not vectorized (because they are not inner loops or use the > construct that allows the compiler to vectorize). > > The count has probably gone up because the new code has more paths > depending on the alignment, stride and type of expression, and the > vectorized array expressions redirects to TinyVector evaluations, so > there are recursive loops. > >> Running a simple test case with the new version returns different >> numbers compared to the old version. On another test case the new ET >> crashes while the old ET returns what is expected. >> I am looking into this with compiling with debug option but with -g -O2 >> takes even longer to compile, over 1.5 hours now and still not finished. > > All tests in the testsuite pass for me (except storage but that's > because the iterators don't work for arrays with non-ascending > storage). If you have your own tests that don't pass, please add them > to the testsuite. Sorry what I meant by simple test case is a run of my simulation code in a simple configuration which I can compare with published results, and they differ in the new version with full optimization while it works fine with the old one. I am not sure really what to do Cheers, Patrick > cheers, > > /Patrik |
From: Patrik J. <co...@fa...> - 2011-07-22 14:58:25
|
By the way, in your tests that take very long to compile, are you by any chance using TinyMatrix? As of now, there is no code path that does simplified evaluation of the TinyMatrix class, so I think that's why the multicomponent test cases (which tests things like Arrays of TinyMatrix of TinyVector) take so long to compile. cheers, /Patrik On Fri, Jul 22, 2011 at 10:26 AM, Patrik Jonsson <co...@fa...> wrote: > On Fri, Jul 22, 2011 at 9:57 AM, Patrick Guio <pa...@mn...> wrote: >> >> Hi Patrik, >> >> I am compiling with the latest update (4) of icpc 12 (12.0.4 20110427). >> Unfortunately I cannot use v11.1 of icpc any longer as I have the >> following compatibility problem with gcc described here >> http://origin-software.intel.com/en-us/forums/showthread.php?t=74691&p=2&o=d&s=lr >> >> I have now compared compilation of one of my simulation with the "old" >> separated ET for Array and TinyVectors and the "new" unified one. >> I think there are several problems to be addressed with the "new" ET >> machinery before it can be accepted. >> >> Compiling my code with the same (optimised) options using the "old" ET >> of blitz takes 2 mins while it takes 25 mins with the "new" one. >> As far as I remember it used to compile also in about a few minutes with >> icpc 11.0 and 11.1. > > Yes, there's some interaction between the new code and v12 that makes > compilation very slow. It's not simply the new code, because compiling > with v11 (which I used when developing it) does not have this problem. > My hunch is that it has to do with inlining. Could you try commenting > out all lines with '#pragma forceinline' and see how that changes your > compilation times? > > In general, however, I think compilation times will go up, because the > "new ET" machinery is more complicated than the old one that > TinyVector used to use. > >> There seems to be a big difference of the counts of loops that can and >> can't be vectorized according to the reported diagnostics between old >> and new ET: >> LOOP WAS VECTORIZED: old 70, new 946 >> loop was not vectorized: old 2192, new 4277 >> But I am not sure how to interpret that. Any idea? > > Well, there are many loops in the code, and the only one that's > vectorized is the inner loop for unit stride stack evaluations, so it > makes sense both that the number of loops has gone up and that most of > them are not vectorized (because they are not inner loops or use the > construct that allows the compiler to vectorize). > > The count has probably gone up because the new code has more paths > depending on the alignment, stride and type of expression, and the > vectorized array expressions redirects to TinyVector evaluations, so > there are recursive loops. > >> Running a simple test case with the new version returns different >> numbers compared to the old version. On another test case the new ET >> crashes while the old ET returns what is expected. >> I am looking into this with compiling with debug option but with -g -O2 >> takes even longer to compile, over 1.5 hours now and still not finished. > > All tests in the testsuite pass for me (except storage but that's > because the iterators don't work for arrays with non-ascending > storage). If you have your own tests that don't pass, please add them > to the testsuite. > > cheers, > > /Patrik > |
From: Patrick G. <pa...@mn...> - 2011-07-22 14:58:33
|
The only place is use TinyMatrix is in another code that I am not able to compile due to these memory issues. On 07/22/2011 03:52 PM, Patrik Jonsson wrote: > By the way, in your tests that take very long to compile, are you by > any chance using TinyMatrix? As of now, there is no code path that > does simplified evaluation of the TinyMatrix class, so I think that's > why the multicomponent test cases (which tests things like Arrays of > TinyMatrix of TinyVector) take so long to compile. > > cheers, > > /Patrik > > On Fri, Jul 22, 2011 at 10:26 AM, Patrik Jonsson > <co...@fa...> wrote: >> On Fri, Jul 22, 2011 at 9:57 AM, Patrick Guio <pa...@mn...> wrote: >>> >>> Hi Patrik, >>> >>> I am compiling with the latest update (4) of icpc 12 (12.0.4 20110427). >>> Unfortunately I cannot use v11.1 of icpc any longer as I have the >>> following compatibility problem with gcc described here >>> http://origin-software.intel.com/en-us/forums/showthread.php?t=74691&p=2&o=d&s=lr >>> >>> I have now compared compilation of one of my simulation with the "old" >>> separated ET for Array and TinyVectors and the "new" unified one. >>> I think there are several problems to be addressed with the "new" ET >>> machinery before it can be accepted. >>> >>> Compiling my code with the same (optimised) options using the "old" ET >>> of blitz takes 2 mins while it takes 25 mins with the "new" one. >>> As far as I remember it used to compile also in about a few minutes with >>> icpc 11.0 and 11.1. >> >> Yes, there's some interaction between the new code and v12 that makes >> compilation very slow. It's not simply the new code, because compiling >> with v11 (which I used when developing it) does not have this problem. >> My hunch is that it has to do with inlining. Could you try commenting >> out all lines with '#pragma forceinline' and see how that changes your >> compilation times? >> >> In general, however, I think compilation times will go up, because the >> "new ET" machinery is more complicated than the old one that >> TinyVector used to use. >> >>> There seems to be a big difference of the counts of loops that can and >>> can't be vectorized according to the reported diagnostics between old >>> and new ET: >>> LOOP WAS VECTORIZED: old 70, new 946 >>> loop was not vectorized: old 2192, new 4277 >>> But I am not sure how to interpret that. Any idea? >> >> Well, there are many loops in the code, and the only one that's >> vectorized is the inner loop for unit stride stack evaluations, so it >> makes sense both that the number of loops has gone up and that most of >> them are not vectorized (because they are not inner loops or use the >> construct that allows the compiler to vectorize). >> >> The count has probably gone up because the new code has more paths >> depending on the alignment, stride and type of expression, and the >> vectorized array expressions redirects to TinyVector evaluations, so >> there are recursive loops. >> >>> Running a simple test case with the new version returns different >>> numbers compared to the old version. On another test case the new ET >>> crashes while the old ET returns what is expected. >>> I am looking into this with compiling with debug option but with -g -O2 >>> takes even longer to compile, over 1.5 hours now and still not finished. >> >> All tests in the testsuite pass for me (except storage but that's >> because the iterators don't work for arrays with non-ascending >> storage). If you have your own tests that don't pass, please add them >> to the testsuite. >> >> cheers, >> >> /Patrik >> |
From: Patrik J. <co...@fa...> - 2011-07-22 16:22:25
|
Hi Patrick, Can you try the latest update? I've now completed the compile-time selection of evaluation routines for TinyVector and TinyMatrix so it doesn't instantiate the full evaluation in those cases. I also removed some cases of "#pragma forceinline recursive" which apparently is not recognized by v11 but is by v12. With these changes, compile time for my code was cut in half and for the more complicated test cases from 18+hours to 2 minutes. Hopefully this will fix your out of memory error too. cheers, /Patrik |
From: Paul H. <pph...@gm...> - 2011-07-24 05:21:13
|
Hi Patrik, thanks for the update, I could now successfully compile blitz and my program (ifort 11.1 and ifort 12.0, excluding last update). I tried to reproduce some of my previous simulations with blitz-0.10 and compare to blitz-0.9, the results are slightly different, but surprisingly blitz-0.10 gives the better results :D. It seems that rounding errors are better handled with the new version. Do you have an idea why ? Here are the graphs : [ Roughly the graphs show the mode powers (FFT transformation) of the domain for a gyro-kinetic simulation, for blitz-10 the fluctuations are less dominant, I will try to find out why ... ] (blitz-0.9 , gcc-4.1, 14961s) http://www.2shared.com/photo/kUQKsMuK/110724_Blitz_blitz09_gcc4_1_Po.html (blitz-0.9 , ifort-11.1, 14567s) http://www.2shared.com/photo/y5NwLfbz/110724_Blitz_blitz09_ifort11_1.html (blitz-0.10, ifort-11.1, 16668s) http://www.2shared.com/photo/g-HHg0tb/110724_Blitz_blitz10_ifort11_1.html (bllitz-0.10, ifort-12.0, 17673s) http://www.2shared.com/photo/XTDBctsW/110724_Blitz_blitz10_ifort12_0.html It seems also that the new version is around 10% slower ! all programs where compiled with -O3 and for blitz-0.10 I had simd width of 16. Do you also have some real world benchmarking of your code ? I wonder if I chose some wrong compile parameters. thanks again and best wishes, Paul |
From: Patrick G. <pa...@mn...> - 2011-07-25 16:21:11
|
Hi Patrik, Thank you, it seems to have fixed the issue about extremely long compilation time. I need to do more thorough testing but the interprocedural optimization (-ip and -ipo) do not seem to work correctly any longer :-( I will let you know when I have done more testing. Cheers, Patrick On 07/22/2011 05:22 PM, Patrik Jonsson wrote: > Hi Patrick, > > Can you try the latest update? I've now completed the compile-time > selection of evaluation routines for TinyVector and TinyMatrix so it > doesn't instantiate the full evaluation in those cases. I also removed > some cases of "#pragma forceinline recursive" which apparently is not > recognized by v11 but is by v12. With these changes, compile time for > my code was cut in half and for the more complicated test cases from > 18+hours to 2 minutes. Hopefully this will fix your out of memory > error too. > > cheers, > > /Patrik |