From: Roy S. <roy...@ic...> - 2009-06-12 22:04:31
|
On Fri, 12 Jun 2009, Tim Kroeger wrote: > Roy: Did you have a look at the patch that I sent you yesterday? Just got to it now. I don't want to commit it until I can track down what looks like a bug - on ex6 I got it to report that build_inf_elem() takes 0.2663 seconds without subroutines included or 0.0000 seconds with them included. But also, I'd like to wait and see if any other developers have strong opinions. This patch makes the PerfLog much more useful, but also stretches the output to (typically) 107-110 characters. Personally I think we should just assume everyone has a 120+ char terminal, add still another column for "Percent of Active Time With Sub" (hopefully abbreviated better) and commit it. --- Roy |
From: Tim K. <tim...@ce...> - 2009-06-15 07:49:19
Attachments:
patch
|
Dear Roy, On Fri, 12 Jun 2009, Roy Stogner wrote: > On Fri, 12 Jun 2009, Tim Kroeger wrote: > >> Roy: Did you have a look at the patch that I sent you yesterday? > > Just got to it now. I don't want to commit it until I can track down > what looks like a bug - on ex6 I got it to report that > build_inf_elem() takes 0.2663 seconds without subroutines included or > 0.0000 seconds with them included. Good point. > Personally I > think we should just assume everyone has a 120+ char terminal, add > still another column for "Percent of Active Time With Sub" (hopefully > abbreviated better) and commit it. Good point. I cared about both good points now, see attachment. (The attachment also adds a log item in PetscVector::map_global_to_local_index(), that might help finding out whether my performance loss is really completely in that method. You might or might not want to commit that as well.) Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: John P. <pet...@cf...> - 2009-06-12 22:13:41
|
On Fri, Jun 12, 2009 at 5:03 PM, Roy Stogner<roy...@ic...> wrote: > > On Fri, 12 Jun 2009, Tim Kroeger wrote: > >> Roy: Did you have a look at the patch that I sent you yesterday? > > Just got to it now. I don't want to commit it until I can track down > what looks like a bug - on ex6 I got it to report that > build_inf_elem() takes 0.2663 seconds without subroutines included or > 0.0000 seconds with them included. > > But also, I'd like to wait and see if any other developers have strong > opinions. This patch makes the PerfLog much more useful, but also > stretches the output to (typically) 107-110 characters. Personally I > think we should just assume everyone has a 120+ char terminal, add > still another column for "Percent of Active Time With Sub" (hopefully > abbreviated better) and commit it. That's fine with me. There's no reason we need to stick to a certain number of columns... -- John |
From: Derek G. <fri...@gm...> - 2009-06-13 04:54:00
|
Agreed. Derek On Jun 12, 2009, at 4:10 PM, John Peterson wrote: > On Fri, Jun 12, 2009 at 5:03 PM, Roy > Stogner<roy...@ic...> wrote: >> >> On Fri, 12 Jun 2009, Tim Kroeger wrote: >> >>> Roy: Did you have a look at the patch that I sent you yesterday? >> >> Just got to it now. I don't want to commit it until I can track down >> what looks like a bug - on ex6 I got it to report that >> build_inf_elem() takes 0.2663 seconds without subroutines included or >> 0.0000 seconds with them included. >> >> But also, I'd like to wait and see if any other developers have >> strong >> opinions. This patch makes the PerfLog much more useful, but also >> stretches the output to (typically) 107-110 characters. Personally I >> think we should just assume everyone has a 120+ char terminal, add >> still another column for "Percent of Active Time With Sub" (hopefully >> abbreviated better) and commit it. > > That's fine with me. There's no reason we need to stick to a certain > number of columns... > > -- > John > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Libmesh-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmesh-devel |
From: Tim K. <tim...@ce...> - 2009-06-15 06:50:31
Attachments:
log.no-ghost-8-4-2
|
Dear Jed and Roy, Now the according result for non-ghosted vectors. It seems as if nearly all of the performance difference is in the assembly methods. Well, this is actually not very surprising, because the other big part of the application, that is solving the systems, does not deal with ghosted vectors at all. The performance loss in the system assembly due to ghosted vectors is quite exactly the factor of 3 that I win back by the possibility to use 6 cores per node rather than 2. Is there any idea what can be done? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-06-15 13:05:08
Attachments:
signature.asc
|
Tim Kroeger wrote: > Dear Jed and Roy, > > Now the according result for non-ghosted vectors. > > It seems as if nearly all of the performance difference is in the > assembly methods. Well, this is actually not very surprising, because > the other big part of the application, that is solving the systems, does > not deal with ghosted vectors at all. Yes, there is a bit more communication with ghosted vectors, which is not normally needed, but that isn't using nearly enough time to contribute. It looks like all the time difference is in some unprofiled events in assembly. Are you sure there is nothing in your code that branches based on whether ghosted vectors are being used? In situations like these, I fall back to using conventional profiling tools. Unfortunately gprof can produce very misleading output, so I almost never use it. Since the time seems to not be in communication, you could use callgrind (valgrind --tool=callgrind) and still have relevant numbers when looking at the assembly function. This option, viewed with kcachegrind, is probably the most user-friendly thing and should immediately show you where the time is spent. You get source annotation of the number of cycles spent on each line with a nice graphical browser. However, valgrind will be about 1000 times slower than normal so you have to make the problem size truly tiny (and this would disrupt things if there was e.g. an O(n^2) algorithm lurking somewhere). An alternative that requires no recompilation is Oprofile which you can just start on one compute node, run the code, and look at the output. Use opcontrol's --callgraph option. This profiler is based on kernel interrupts and has very little overhead (usually less than 5%). Because of the level it operates on, you need root to start the profiler. You can view the results in kcachegrind, but it's not as well integrated as callgrind. Regardless, you get very useful source annotation. I think Eclipse has plugins for both valgrind and oprofile if you use that. Many clusters have commercial profiling tools as well, you may want to ask the staff. Helping track down these sort of performance bugs is often part of their job description. Sun Studio is free and has a performance analyzer that is supposed to be pretty good, but I don't have any direct experience with it. Do yourself a favor and add a command-line option to stop the simulation after N steps. Do all your profiling with N<10 (a few minutes is plenty of time to get accurate profiling). Good luck. Jed |
From: Tim K. <tim...@ce...> - 2009-06-15 13:25:30
|
Dear Jed, On Mon, 15 Jun 2009, Jed Brown wrote: > Tim Kroeger wrote: >> >> It seems as if nearly all of the performance difference is in the >> assembly methods. Well, this is actually not very surprising, because >> the other big part of the application, that is solving the systems, does >> not deal with ghosted vectors at all. > > In situations like these, I fall back to using conventional profiling > tools. Thank you for your detailed hints about profiling tools. I'm somehow always very reluctant towards them for a number of reasons. Skaling the whole application down such that it is faster by a factor of 1000 could result in hard work -- and could also spoil the results. Being root will be practically impossible. The admin might be willing to help me, but he is located quite far away, so any help offered is based on email (or possible telephone), and he won't supply the root password to me. Well, at least I have now enabled the already existing option of stopping the simulation after a few number of steps. The problem is that even 10 steps already take about 5 hours. I've added quite fine-grained START_LOG()/STOP_LOG() pairs now inside one of my assembly functions (that is, the one that is most probably responsible for the poor performance) and configured the application to run 4 steps only (which I suppose to take about two hours). I'll keep you informed about the results. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-06-16 02:11:25
Attachments:
signature.asc
|
Tim Kroeger wrote: > Thank you for your detailed hints about profiling tools. I'm somehow > always very reluctant towards them for a number of reasons. Skaling the > whole application down such that it is faster by a factor of 1000 could > result in hard work -- and could also spoil the results. My experience is that this always pays off. You aren't running the small cases for the results it produces, you are running them to check for code correctness. My top three guidelines for developing parallel code are 1. Have a single runtime parameter for the size of the problem, it should scale between <1 second of runtime and a full production run. (This can be an input mesh or some other parameter controlling problem size.) 2. Make sure it works correctly in serial before trying in parallel. 3. Run in parallel on your workstation (small problem size) to check correctness before moving to the cluster. Also, I think this one is critical for any PDE solver: * Manufacture solutions so that you have exact solutions to compare to. This is really easy, even for very complex codes, if you write the problem as F(u)=0 or F(u',u,t) = 0 and your code can accomodate arbitrary forcing terms. Choose the solution u(x,t) *before* choosing the domain, boundary conditions or forcing. The only requirement is that it have sufficiently rich derivatives. Products of transcendental functions like tanh are good. Then, using a symbolic algebra package (Mathematica, Maple, Maxima, Sympy), apply your nonlinear differential operators to manufacture a forcing term and print this as C code (these packages can do this, don't worry if the expressions are pages long, you never have to read them). Paste the exact solutions and forcing terms into your code and use the exact solutions for inhomogeneous boundary conditions. Now you can compare to highly nontrivial exact solutions. Don't worry that they don't look anything like your real solutions because the forcing terms are highly nonphysical, they *will* test that your code is correct (converges to highly nontrivial exact solutions at the correct rate). This is way more useful than nearly degenerate (physical) exact solutions that are commonly used for testing correctness. [/pulpit] > Being root will be practically impossible. The admin might be willing > to help me, but he is located quite far away, so any help offered is > based on email (or possible telephone), and he won't supply the root > password to me. It's definitely worth asking him what profiling tools are available. Gprof is also an option. > Well, at least I have now enabled the already existing option of > stopping the simulation after a few number of steps. The problem is > that even 10 steps already take about 5 hours. This doesn't make sense to me. The full run was 576 steps in 18 hours (ghosted case), which works out to about 2 minutes per step. (Well, that many assemblies, are there many assemblies per step?) > I've added quite fine-grained START_LOG()/STOP_LOG() pairs now inside > one of my assembly functions (that is, the one that is most probably > responsible for the poor performance) This is the poor-man's profiler. It's a bit clumsy, but should work. Jed |
From: Tim K. <tim...@ce...> - 2009-06-18 08:40:49
Attachments:
patch
|
Dear Jed and Roy, On Tue, 16 Jun 2009, Jed Brown wrote: >> Anyway, the main question is: Is it inefficient to call PETSc's >> VecGetArray()/VecRestoreArray() (and, in the ghosted case, also >> VecGhostGetLocalForm()/VecGhostRestoreLocalForm()) for each call to >> PetscVector::operator() if a large number of such calls are being done? >> I guess it is not, and I suspect that doing a more efficient thing here >> could speed up my simulation considerably. Jed, do you agree? > > Yes, calling these in an inner loop is expensive. The VecGhost > functions do RTTI (string comparison). Calling them once per element > won't be a big deal, but it is definitely expensive when you use it for > every basis function at every quadrature point. You should really just > get these arrays (as STL vectors if you like). Hence, I have attached a patch for this purpose. It includes a change in ex9 that demonstrates how to use of the new API. I tested it, and it turned out to speed up ex9 by about 10% (one processor, devel mode, ghosted enabled). Roy, what do you think? I'll see how that speeds up my application, both with and without ghosted. I expect it to speed up by much more than 10% since I have numerous loops of this type. (This, of course, also increases the work to replace all the loops...) Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-06-18 16:01:26
|
On Thu, 18 Jun 2009, Tim Kroeger wrote: > On Tue, 16 Jun 2009, Jed Brown wrote: > >> Yes, calling these in an inner loop is expensive. The VecGhost >> functions do RTTI (string comparison). Calling them once per element >> won't be a big deal, but it is definitely expensive when you use it for >> every basis function at every quadrature point. You should really just >> get these arrays (as STL vectors if you like). > > Hence, I have attached a patch for this purpose. It includes a change in ex9 > that demonstrates how to use of the new API. I tested it, and it turned out > to speed up ex9 by about 10% (one processor, devel mode, ghosted enabled). > Roy, what do you think? I definitely like the results, but the API feels like "giving up". Ideally I'd like to fix the speed problems with operator(); in theory by doing that (and by adding a compile-time option to make basic NumericVector calls inline instead of virtual; more on that later) we ought to be able to get rid of the operator() overhead entirely without requiring users to change their code. If we can't do that, then your patch is certainly better than nothing, but I'd like to wait before adding it - I don't want to encourage users to change their code until we're certain it's necessary. --- Roy |
From: Roy S. <roy...@ic...> - 2009-06-19 15:43:23
|
On Fri, 19 Jun 2009, Tim Kroeger wrote: > I tried to do this now, see attached patch. At first glance, the core of this patch looks exactly like what I had in mind. I'll want to spend some time testing (and ideally get some reassurance from Jed) before making such a major change to such a critical class, but this looks good. > Also, I fixed something that looked like a bug in > PetscVector::swap(). Ah - we weren't swapping the status flags! That never bit me since I always swapped vectors of the same type and in the same state, but it could definitely have behaved horribly in future code. Good catch! I don't like making NumericVector::swap a non-pure-virtual, though, since it's incomplete as is and has to be rederived in all subclasses. I'd prefer to make failure-to-rederive into an obvious compile-time error rather than a subtle run-time error, even if that means a couple redundant lines of code here and there. On the other hand, how often do we add a whole new linear algebra package? I guess it's fine as is; we just need to remember to start calling it from the Trilinos and Laspack subclasses too. > Also, inside PetscVector::map_global_to_local_index(), I called > VecGetOwnershipRange() direclty once (rather than calling it via an > additional virtual function call twice), which makes a small, but > observable speedup (I tested that separately). Good idea. > However, I currently don't see how you want to get rid of the virtual > function call overhead. I kept my new API for the while, as it is > still somehow faster (although the differce is less pronounced). Just an idea I ran by Ben recently over the phone; I hadn't mentioned it on the list yet. Although I don't want to get rid of the ability to do (limited) mixing and matching of different linear algebra packages in the same code, most users want to just pick one package and use it for every relevant object. We could move the current base classes to NumericVectorBase and SparseMatrixBase, make NumericVector and SparseMatrix into compile-time-selected typedefs of e.g. PetscVector and PetscMatrix, and then declare all the leaf class functions non-virtual. AFAIK this still allows a C++ compiler to call those functions virtually from pointers/references to the base class but also allows the compiler to inline them when they're called from pointers/references to the leaf class. This idea will need testing to make sure that it works and it's worth anything, naturally. --- Roy |
From: Jed B. <je...@59...> - 2009-06-19 22:56:33
Attachments:
signature.asc
|
Tim Kroeger wrote: > I assumed that functions like VecGetOwnershipRange(), VecGetSize(), and > VecGetLocalSize() are allowed to be called between VecGetArray() and > VecRestoreArray() (or, likewise, between VecGetLocalForm() and > VecRestoreLocalForm()), although I'm not completely sure about that. > Jed, can you comment on that? These functions do not mutate any state and do not look at the values in the array, so I can't think of any case where this would fail. (I haven't read the patch carefully, but the assumptions you mention above are fine.) Calling functions like VecScale or VecNorm while the array/local form is gotten may not be safe. Jed |
From: Tim K. <tim...@ce...> - 2009-06-22 12:06:22
|
Dear Roy/Jed, I'm just running my complete application with the new patch installed, and it seems to me that it has become *slower*. (It's not finished yet, but I can extrapolate the time.) My question: Could it be that --enable-perflog and/or -log_summary reduce the performance essentially (that is, by 10% or more)? Otherwise, things seem to me similar to those in George Orwell's "1984", where the weekly chocolate ration is *increased* from 30 grams to 25 grams. (-: Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-06-22 13:13:28
|
On Mon, 22 Jun 2009, Tim Kroeger wrote: > I'm just running my complete application with the new patch installed, and it > seems to me that it has become *slower*. (It's not finished yet, but I can > extrapolate the time.) > > My question: Could it be that --enable-perflog and/or -log_summary reduce the > performance essentially (that is, by 10% or more)? 10% seems a little high, but yes, --enable-perflog reduces performance; you're doing string matching every time you call it and that can be expensive if you're calling it in a frequently used and otherwise fast function. --- Roy |
From: Roy S. <roy...@ic...> - 2009-06-22 14:17:32
|
On Mon, 22 Jun 2009, Tim Kroeger wrote: > Okay. In any case, what will happen to the new method NumericVector::get() > that is still in the patch? Will you keep it or not? I'll probably leave out the ex3 usage, or maybe move that to a more advanced example, but there's no reason not to keep the method in the library itself. It's noticeably more efficient in every use case right now and it still would be in some use cases in the future even if my wacky idea worked. > I'm not quite sure about this. In particular, I would think that inlining is > only possible if you're calling from a local instance of the leaf class, but > not from a pointer/reference to such a class. The reason is that I don't see > any possibility for the compiler to see that your leaf class is really a leaf > class, i.e. any user code could inherit further down and overload virtual > functions and pass a pointer/refernce to such a class to the basic library. > C++ is missing a syntax for explicitly disallowing inheritance from a class, > isn't it? Yeah, it looks like you're right. I'd assumed that removing the "virtual" keyword in a leaf class would be enough to tell the compiler to use non-virtual dispatch when calling that method through a pointer to the leaf class, but when actually testing that out, it doesn't work that way. Of course, we could get the same behavior by making a configure-time option that in addition to renaming classes turns some of their virtual keywords on or off as appropriate, but that's much more work than what I'd originally planned, so it'll go on the back burner. > Even if it works theoretically, I see another problem: Of what type will, > say, LinearImplicitSystem::solution be? Will it be (an AutoPtr to) > NumericVectorBase or NumericVector? In either case, you can't have both the > flexibility to mix solver algebra packaes and the possibility to inline > virtual functions. Not at the same time; you'd have to choose one or the other from ./configure. --- Roy |
From: Roy S. <roy...@ic...> - 2009-07-02 21:05:14
|
On Mon, 29 Jun 2009, Tim Kroeger wrote: > Now, I have performed a couple of test runs of my applications with the new > implementation of PetscVector, both with and without ghosted vectors. All > runs were performed in devel mode and without performance logging (to make it > comparable to the runs that were performed earlier). I prepared a list for > you here of these runs as well as the earliear runs (with old PetscVector > implementation): Sorry to get back to you so late on this. I think my subconscious was stalling, because I can still hardly write the necessary reply without wincing: Can you run all those again, in opt mode? METHOD=devel is intended to be a compromise between the "libmesh assertions are disabled and debugging the code takes forever" behavior of METHOD=opt and the "g++ STL checking is enabled and running the code takes forever" behavior of METHOD=dbg. It's great for catching issues that trip a libmesh_assert() before they segfault in an out of bounds vector access. But it's probably worthless for performance timing. It's not nearly as slow as dbg mode, but it's significantly slower than normal and the degree of slowdown depends in part on just how paranoid the asserts in a particular code path are. > It seems as if the new PetscVector implementation makes the non-ghosted code > slower. I have no explanation for this. I'll bet this behavior goes away with METHOD=opt. If so, then I'd agree that we should make the ghosted code the default. --- Roy |
From: Roy S. <roy...@ic...> - 2009-07-04 01:21:35
|
On Fri, 3 Jul 2009, Tim Kroeger wrote: > Dear Roy, > > On Thu, 2 Jul 2009, Roy Stogner wrote: > >> Sorry to get back to you so late on this. I think my subconscious was >> stalling, because I can still hardly write the necessary reply without >> wincing: >> >> Can you run all those again, in opt mode? > > Seems as if you are right. I will do this, but it will of course take some > time. Luckily enough, I didn't carry on working on my application in the > meantime. Thanks. Oh, and just to make sure: you'll need to build PETSc in optimized mode too to get reasonable timing results. Just this afternoon I discovered that my code has been running very slowly on one system because I'd been mistakenly defaulting to a debug mode PETSc module. --- Roy |
From: Tim K. <tim...@ce...> - 2009-07-06 06:41:32
|
Dear Roy, On Fri, 3 Jul 2009, Roy Stogner wrote: > On Fri, 3 Jul 2009, Tim Kroeger wrote: > >> Seems as if you are right. I will do this, but it will of course take some >> time. Luckily enough, I didn't carry on working on my application in the >> meantime. First result: Ghosted enabled, new PetscVector implementation, 18 cores on 3 nodes, opt mode: 8:03:03. (Was 8:31:03 in devel mode.) > Oh, and just to make sure: you'll need to build PETSc in optimized > mode too to get reasonable timing results. Just this afternoon I > discovered that my code has been running very slowly on one system > because I'd been mistakenly defaulting to a debug mode PETSc module. Good that you mention it. Of course I had forgotten that. Actually, I have never so far used PETSc in optimized mode. My application is far from being in a final state, and during development the debug version of PETSc is somehow useful for backtracking crashes. Anyway, for sensible timing results, you're probably right that PETSc's optimized mode should be prefered (although, as we have seen, my application does actually not spend too much time inside PETSc at all). Anything else that I might have forgotten? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-07-20 08:31:13
|
Dear Roy, Now, the more-or-less-ultimate results. Both PETSc and libMesh were compiled in optimized mode. PetVec ghost #cores #nodes hh:mm:ss ---------------------------------------- new ghost 18 3 07:02:54 new ghost 9 3 08:19:55 new no 9 3 08:35:28 old no 9 3 08:21:52 old ghost 9 3 crashed old ghost 18 3 crashed Hence, for ghosted disabled, new and old PetscVector implementations are about equally fast. Also, with the new PetscVector implementation, enabling ghosted on the same number of cores and nodes keeps the performance comparable, and increasing the number of cores speeds things up essentially. Strange is the obervation that the old implementation with ghosted enabled now crashes my application. The crash occurs in the post-processing step at the end (which I had temporarily disabled, but I now re-enabled it). I'll try to track down what happened. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |