From: Tim K. <tim...@ce...> - 2009-06-05 09:06:20
|
Dear Jed, On Tue, 26 May 2009, Jed Brown wrote: > Tim Kroeger wrote: > >> Not really nice. In particular not essentially faster than the >> unghosted version on a comparable number of nodes. > > It is important to know what preconditioners are being used > (preconditioners always change in parallel, though not as much when > there is a coarse level as in multigrid). Seems as if I'm using BLOCK_JACOBI in parallel. Also, the preconditioner is not applied to the original matrix (it's the issue with the ShellMatrix, which you certainly remember), but this can't be the problem here because the matrix is independent of the number of CPUs. > Also, memory performance > (especially bandwidth) is usually the overwhelming issue for implicit > solvers (e.g. you are very lucky to get 4% of peak FPU performance for > MatVec on Core 2 Quad). Thus using more cores frequently does not help, > and you need more sockets to improve performance. I must admit that I'm not familiar with these hardware issues at all, so your explanations are all Greek to me. Well, since you were talking about sockets, I tried this (on one of the nodes): tkroeger@node033:~> cat /proc/net/sockstat sockets: used 239 TCP: inuse 31 orphan 0 tw 21 alloc 39 mem 2 UDP: inuse 19 RAW: inuse 0 FRAG: inuse 0 memory 0 Is this the kind of information that can help finding what is going on? If not, please let me know how I can find the required information. >> I'll perfom the log output that Jed suggested and see what he says. >> Perhaps there is some easy possibility to make it faster. > > FWIW, I always run nontrivial jobs with -log_summary. It does not add > measurable overhead and that output is really useful. I've enabled it now and started the application. I have not yet reduced the complexity (number of time steps) because, for the moment, it doesn't bother me if I get the result on Monday. If it should turn out that more tests have to be made, I will think about that again. (BTW, in the last two weeks, I was very busy with some other work that had a tight deadline, but now I should have more time again to work on this issue.) Anyway, I'll send you the output of -log_summary as soon as I have it. Thank you very much in advance for your assistance! Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-06-05 11:37:16
Attachments:
signature.asc
|
Tim Kroeger wrote: > Dear Jed, > > On Tue, 26 May 2009, Jed Brown wrote: > >> Tim Kroeger wrote: >> >>> Not really nice. In particular not essentially faster than the >>> unghosted version on a comparable number of nodes. >> >> It is important to know what preconditioners are being used >> (preconditioners always change in parallel, though not as much when >> there is a coarse level as in multigrid). > > Seems as if I'm using BLOCK_JACOBI in parallel. This is the default and it changes a lot when you change the number of processors. You could easily be needing a lot more iterations in the linear solve when you use more processors. Comparing -log_summary output would tell us if this is so. Unless you use a direct solve or have multilivel domain decomposition (e.g. multigrid or iterative substructuring like FETI-DP or BDDC) working for you, the number of iterations will increase when you add processors. There is an asymptotic theory, but in practice this effect is highly problem-dependent. > Also, the preconditioner is not applied to the original matrix (it's the > issue with the ShellMatrix, which you certainly remember), but this > can't be the problem here because the matrix is independent of the > number of CPUs. This is normal, I never have the same Jacobian as preconditioning matrix. >> Also, memory performance >> (especially bandwidth) is usually the overwhelming issue for implicit >> solvers (e.g. you are very lucky to get 4% of peak FPU performance for >> MatVec on Core 2 Quad). Thus using more cores frequently does not help, >> and you need more sockets to improve performance. > > I must admit that I'm not familiar with these hardware issues at all, so > your explanations are all Greek to me. Well, since you were talking > about sockets, I tried this (on one of the nodes): > > tkroeger@node033:~> cat /proc/net/sockstat > sockets: used 239 > TCP: inuse 31 orphan 0 tw 21 alloc 39 mem 2 > UDP: inuse 19 > RAW: inuse 0 > FRAG: inuse 0 memory 0 > > Is this the kind of information that can help finding what is going on? > If not, please let me know how I can find the required information. Heh, I was referring to hardware sockets, not network sockets. So most CPUs these days are multicore, but the cores share the same memory bus (at least partially). When you say 8 CPUs on 3 nodes or 24 CPUs on 4 nodes, I assume you mean 8 cores on 3 nodes, 24 cores on 4 nodes. This is a different number of cores per node, but the configuration of these nodes is relevant. For instance, 4 dual-core is almost always faster than 2 quad-core (although the latter is probably cheaper and uses less power). Sparse matrix kernels are typically much faster per process if there is only one process per socket (i.e. per CPU, not per core). The floating point unit (FPU) on a single core is so much faster than memory that even if you only use one core (on Core 2) you will be happy to see 10% of peak FPU throughput. If you use all 4 cores and have custom sparse matrix kernels, you might see 4% of peak. Now 4*4% = 16% is more than 10% so you are getting more work done by using those extra cores, but nowhere near four times more work. Getting 4% requires a huge amount of effort. In particular, threads need to be aware of what the other cores have in L1 cache. This doesn't work in MPI and is not portable, so you'll actually see somewhat less performance when using any portable library. This is why CPU speed is nearly irrelevant for sparse linear algebra and you are much better off to buy memory bandwidth. (Explicit methods, residual evaluation, and Jacobian assembly often benefit from a faster CPU so it's not always a waste to buy a fast CPU, it just doesn't matter much inside the linear solver.) Knowing your hardware (/proc/cpuinfo and network) and MPI configuration (which MPI and which device it is using) is useful. Jed |
From: Jed B. <je...@59...> - 2009-06-16 15:16:05
Attachments:
signature.asc
|
Tim Kroeger wrote: > Well, there are in my case two or three parameters for this, but it's > always again difficult to figure out where they have to be set, and it's > always difficult not to forget to restore them once the profiling has > been finished. *runtime* parameters. There is nothing to restore. > I principally know this idea, but in my case it's really difficult, > because I have fixed boundary values on a parts of a complicated > geometry. Parts of the geometry are hardcoded, and changing them for > test purposes would mean to replace the code with a different one and > could introduce additional bugs. Also, external forcing terms are not > supported in my application since they do not physically arise. Rather, > things are driven by the boundary conditions (which are constant > Dirichlet conditions on a complicated geometry), non-constant > coefficients, and the fact that one of the equation gets a right hand > side that is a function of the result of another equation. It doesn't matter what the domain is. You have to be able to accommodate external forcing, but that is usually not so much work. The method is plenty general and can definitely be applied to your problem, although I understand that it may be a nontrivial amount of work due to legacy. FWIW, I have no confidence in my own codes until this step has been performed. I'll stop trying to convince you now. > In my assemble method, I have an element loop, and inside this a loop > over the quadrature points. main-loop-01-get-params is performed once > for each quadrature point and will compute the value of the old solution > at this point as well as the value of a couple of material parameters > and other quantities (all of which depend on the solution of this or > other systems but have been precomputed and stored in ExplicitSystem > objects on their own). This works mainly as follows: > > Number val1 = 0.0; > for (unsigned int l=0; l<phi_val1.size(); l++) > { > val1 += > phi_val1[l][qp]*(*system1.current_local_solution)(dof_indices_val1[l]); > } > > There are about ten loops of this type. [snip] > Anyway, the main question is: Is it inefficient to call PETSc's > VecGetArray()/VecRestoreArray() (and, in the ghosted case, also > VecGhostGetLocalForm()/VecGhostRestoreLocalForm()) for each call to > PetscVector::operator() if a large number of such calls are being done? > I guess it is not, and I suspect that doing a more efficient thing here > could speed up my simulation considerably. Jed, do you agree? Yes, calling these in an inner loop is expensive. The VecGhost functions do RTTI (string comparison). Calling them once per element won't be a big deal, but it is definitely expensive when you use it for every basis function at every quadrature point. You should really just get these arrays (as STL vectors if you like). Jed |
From: Roy S. <roy...@ic...> - 2009-06-16 16:09:46
|
On Tue, 16 Jun 2009, Jed Brown wrote: >> Anyway, the main question is: Is it inefficient to call PETSc's >> VecGetArray()/VecRestoreArray() (and, in the ghosted case, also >> VecGhostGetLocalForm()/VecGhostRestoreLocalForm()) for each call to >> PetscVector::operator() if a large number of such calls are being done? >> I guess it is not, and I suspect that doing a more efficient thing here >> could speed up my simulation considerably. Jed, do you agree? > > Yes, calling these in an inner loop is expensive. The VecGhost > functions do RTTI (string comparison). Calling them once per element > won't be a big deal, but it is definitely expensive when you use it for > every basis function at every quadrature point. You should really just > get these arrays (as STL vectors if you like). I think the PetscVector code is just being paranoid about guessing what the get/restore "rules" are - knowing that we should call a restore function "once one is finished using the object" or "when you no longer need access to the array" seems unhelpful - *those* conditions generally don't occur until the program is about to exit (or at least until the vector is about to be destroyed) anyway. It wouldn't be hard to cache VecGetArray and VecGhostGetLocalForm values, if we knew how long we can safely cache them for. Does the cache need to be renewed when VecAssembly/GhostUpdate Begin/End are called? I presume it wouldn't be invalidated by VecSetValues, VecAXPY, VecCopy, etc.? --- Roy |
From: Jed B. <je...@59...> - 2009-06-16 23:59:08
Attachments:
signature.asc
|
Roy Stogner wrote: > > On Tue, 16 Jun 2009, Jed Brown wrote: > >>> Anyway, the main question is: Is it inefficient to call PETSc's >>> VecGetArray()/VecRestoreArray() (and, in the ghosted case, also >>> VecGhostGetLocalForm()/VecGhostRestoreLocalForm()) for each call to >>> PetscVector::operator() if a large number of such calls are being done? >>> I guess it is not, and I suspect that doing a more efficient thing here >>> could speed up my simulation considerably. Jed, do you agree? >> >> Yes, calling these in an inner loop is expensive. The VecGhost >> functions do RTTI (string comparison). Calling them once per element >> won't be a big deal, but it is definitely expensive when you use it for >> every basis function at every quadrature point. You should really just >> get these arrays (as STL vectors if you like). > > I think the PetscVector code is just being paranoid about guessing > what the get/restore "rules" are - knowing that we should call a > restore function "once one is finished using the object" or "when you > no longer need access to the array" seems unhelpful - *those* > conditions generally don't occur until the program is about to exit > (or at least until the vector is about to be destroyed) anyway. It > wouldn't be hard to cache VecGetArray and VecGhostGetLocalForm values, > if we knew how long we can safely cache them for. Does the cache need > to be renewed when VecAssembly/GhostUpdate Begin/End are called? I > presume it wouldn't be invalidated by VecSetValues, VecAXPY, VecCopy, > etc.? --- After modifying values, you absolutely MUST restore the arrays. The main reason for this is that some values (like norms) can be cached (it makes the solvers much simpler to write without unnecessarily recomputing norms). With PETSc native vector types, the vectors are backed by an array and the array doesn't move, but PETSc is extensible and you cannot guarantee that any Vec you end up with is backed by an array. I could also imagine Vecs that are array-backed, but the array is on a GC'd heap and could thus be relocated. You should get the array at the beginning of residual evaluation/matrix assembly/whatever and restore it at the end. Between getting and restoring the array, you should not be calling Vec* functions (although it is technically not a problem as long as you are not modifying values). Following these guidelines are normally not hard because you have to update ghost values at these points anyway. Jed |
From: Tim K. <tim...@ce...> - 2009-06-05 14:15:39
Attachments:
cpuinfo
|
Dear Jed, On Fri, 5 Jun 2009, Jed Brown wrote: >>> Also, memory performance >>> (especially bandwidth) is usually the overwhelming issue for implicit >>> solvers (e.g. you are very lucky to get 4% of peak FPU performance for >>> MatVec on Core 2 Quad). Thus using more cores frequently does not help, >>> and you need more sockets to improve performance. >> >> I must admit that I'm not familiar with these hardware issues at all, so >> your explanations are all Greek to me. Well, since you were talking >> about sockets, I tried this (on one of the nodes): >> >> tkroeger@node033:~> cat /proc/net/sockstat >> sockets: used 239 >> TCP: inuse 31 orphan 0 tw 21 alloc 39 mem 2 >> UDP: inuse 19 >> RAW: inuse 0 >> FRAG: inuse 0 memory 0 >> > Heh, I was referring to hardware sockets, not network sockets. Well, I told you I have no idea what I'm talking about. (-: > So most CPUs these days are multicore, but the cores share the same > memory bus (at least partially). When you say 8 CPUs on 3 nodes or 24 > CPUs on 4 nodes, I assume you mean 8 cores on 3 nodes, 24 cores on 4 > nodes. This is a different number of cores per node, but the > configuration of these nodes is relevant. For instance, 4 dual-core is > almost always faster than 2 quad-core (although the latter is probably > cheaper and uses less power). /proc/cpuinfo is attached. Seems as if I've got 2 quad-core CPUs on each node, right? > Knowing your hardware (/proc/cpuinfo and network) and MPI configuration > (which MPI and which device it is using) is useful. Hmm, how do I find out about the network hardware? There is a pre-installed MPI version, that is: ParaStation 5.0.9-1 (Wed Nov 12 18:27:53 CET 2008) config args: '--prefix=/opt/parastation/mpi2' '--with-confset=default' mpich2 version: 1.0.7 But for some cumbersome reasons I can't use it. Well, perhaps I just haven't tried hard enough. I installed another MPI myself, and that is mpich2 1.0.8, without any options (other than the installation path). I'll send you its config.log off-list (180KB). Well, now that I think about it, I wonder whether I could simply supply the above stated config arguments manually to my local MPI installation. Would that be a promising idea? Also, it might be useful to note that the cluster has two different queues, one of which is called "ethernet" and the other "infiniband". They use different nodes. These names seem to indicate some information about the hardware (let's assume they have been chosen on purpose). I understand that "infiniband" refers to the "better" hardware in some sense, but the "infiniband" queue has a much smaller number of nodes. Also, the admin told me that using my manually installed MPI, I would not be able to benefit from the advantages that the "inifiniband" queue has. Hence, I was using the "ethernet" queue all the time. /proc/cpuinfo does not differ between the nodes of the two queues (except for changes in some minor decimal places of the values of "cpu MHz" and "bogomips"). Anyway, I'll send you my log output on Monday. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-06-05 14:28:18
Attachments:
signature.asc
|
Tim Kroeger wrote: > There is a pre-installed MPI version, that is: > > ParaStation 5.0.9-1 (Wed Nov 12 18:27:53 CET 2008) > config args: '--prefix=/opt/parastation/mpi2' '--with-confset=default' > mpich2 version: 1.0.7 > > But for some cumbersome reasons I can't use it. Well, perhaps I just > haven't tried hard enough. I installed another MPI myself, and that is > mpich2 1.0.8, without any options (other than the installation path). > I'll send you its config.log off-list (180KB). Well, now that I think > about it, I wonder whether I could simply supply the above stated config > arguments manually to my local MPI installation. Would that be a > promising idea? I don't know what confset is, but your install is probably fine. So the default device through 1.0.8 was ch3:sock which uses sockets (the low-level API for moving information over the network, these are the sockets that sockstat referred to ;-) for all communication. This is fine, but you would probably get better performance with ch3:nemesis (default now in 1.1, released on Wed) which uses shared memory locally and sockets (or Myrinet if you have it) to talk to remote processes. > Also, it might be useful to note that the cluster has two different > queues, one of which is called "ethernet" and the other "infiniband". > They use different nodes. These names seem to indicate some information > about the hardware (let's assume they have been chosen on purpose). I > understand that "infiniband" refers to the "better" hardware in some > sense, but the "infiniband" queue has a much smaller number of nodes. > Also, the admin told me that using my manually installed MPI, I would > not be able to benefit from the advantages that the "inifiniband" queue > has. Hence, I was using the "ethernet" queue all the time. > /proc/cpuinfo does not differ between the nodes of the two queues > (except for changes in some minor decimal places of the values of "cpu > MHz" and "bogomips"). MPICH2 doesn't support IB, you would need MVAPICH2 or Open MPI or a vendor-supplied MPI (likely to be a derivative of one of these). With gigabit Ethernet, we expect latencies around 100 microseconds which is a lot more than IB (< 2us). Depending on the decomposition and size of the subdomains, this latency could already be an issue. But since you were having issues running out of memory, I'm assuming that your subdomains are pretty big in which case network latency shouldn't be crippling (there is an algorithmic bug if it is). Do you have any of the usual libmesh profiling output? The granularity is pretty coarse, but it would indicate if most of your time is spent in the linear solve or assembly or whatever. Jed |
From: Tim K. <tim...@ce...> - 2009-06-08 06:57:41
|
Dear Jed, To start with the bad news: There is no log output yet. The application hasn't finished. It reduced the time step size to 2e-6 for some reason. I can't understand this because I'm not aware of any changes I made since the last run -- except for enabling "-log_summary", but that should not change the behaviour, should it? I will try to find out today what happend. On Fri, 5 Jun 2009, Jed Brown wrote: > Tim Kroeger wrote: >> There is a pre-installed MPI version, that is: >> >> ParaStation 5.0.9-1 (Wed Nov 12 18:27:53 CET 2008) >> config args: '--prefix=/opt/parastation/mpi2' '--with-confset=default' >> mpich2 version: 1.0.7 >> >> But for some cumbersome reasons I can't use it. Well, perhaps I just >> haven't tried hard enough. I installed another MPI myself, and that is >> mpich2 1.0.8, without any options (other than the installation path). >> I'll send you its config.log off-list (180KB). Well, now that I think >> about it, I wonder whether I could simply supply the above stated config >> arguments manually to my local MPI installation. Would that be a >> promising idea? > > I don't know what confset is, but your install is probably fine. Good to hear. > So the > default device through 1.0.8 was ch3:sock which uses sockets (the > low-level API for moving information over the network, these are the > sockets that sockstat referred to ;-) for all communication. This is > fine, but you would probably get better performance with ch3:nemesis > (default now in 1.1, released on Wed) which uses shared memory locally > and sockets (or Myrinet if you have it) to talk to remote processes. So you would recommend to update to mpich2 1.1? (If I do, I think I will have to recompile PETSc, libmesh, and my application as well.) >> Also, it might be useful to note that the cluster has two different >> queues, one of which is called "ethernet" and the other >> "infiniband". [...] > > MPICH2 doesn't support IB, you would need MVAPICH2 or Open MPI or a > vendor-supplied MPI (likely to be a derivative of one of these). With > gigabit Ethernet, we expect latencies around 100 microseconds which is a > lot more than IB (< 2us). Depending on the decomposition and size of > the subdomains, this latency could already be an issue. But since you > were having issues running out of memory, I'm assuming that your > subdomains are pretty big in which case network latency shouldn't be > crippling (there is an algorithmic bug if it is). Okay, so I will continue using the "ethernet" queue. > Do you have any of the usual libmesh profiling output? No, I haven't at the moment. Does it suffice to do "--enable-perflog" as a configure option for that? I'll give it a try. But first, I think I should find out why the application behaves so much different now from what it did a couple of weeks ago. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-06-09 13:05:52
|
Dear Jed, On Mon, 8 Jun 2009, Tim Kroeger wrote: > To start with the bad news: There is no log output yet. The > application hasn't finished. It reduced the time step size to 2e-6 > for some reason. I can't understand this because I'm not aware of any > changes I made since the last run -- except for enabling > "-log_summary", but that should not change the behaviour, should it? Today's good news: It's running "correctly" now. Today's fair news: No log output yet, but to be expected tomorrow. Today's bad news: The term "correctly" in the above sentence means "wrong". The problem that I stated yesterday was due to a bugfix I performed in the meantime. I reverted that fix now, but without that fix, the result that my application gets is likely to not have anything to do with the thing it was supposed to compute. Actually, I can't understand why the application keeps producing results that look sensible. Anyway, I decided to postpone that problem for a while and concentrate now on the performance issues. I haven't updated to mpich2 1.1 yet; let's leave that as an option for later. However, I did --enable-perflog and -log_summary. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-06-10 08:48:40
Attachments:
log
|
Dear Jed, On Tue, 9 Jun 2009, Tim Kroeger wrote: > Today's fair news: No log output yet, but to be expected tomorrow. Please find attached the log output now. As I see it, there are two obvious things that I could do. That is (1) update to mpich2 1.1, and (2) recompile PETSc with "--with-debugging=no". Would you recommend to do so? On the other hand, a main issue was that the ghosted code is by about a factor of 1.5 slower than the non-ghosted code. That shouldn't have to do with debugging options, because they were enabled for both versions. Let me know what you think and what you can read from the logs. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-06-10 13:09:10
Attachments:
signature.asc
log.ex4.n2
|
Tim Kroeger wrote: > Please find attached the log output now. First, I'm confused about why all the detailed output is missing. See attached for the sort of output I expect. In particular, there is fine-grained output on vector operations and solver internals with details about load balancing. I see about 30% of the time spent in assemble which about what I'd expect, but solve time is relatively low, especially when compared to some of the FE ops. I don't recall, is a significant part of your code explicit? inverse_map is getting called a ton. > As I see it, there are two obvious things that I could do. That is (1) > update to mpich2 1.1, and (2) recompile PETSc with > "--with-debugging=no". Would you recommend to do so? 1. Time for MPI_Barrier/MPI_Send looks a bit high, that might be improved with ch3:nemesis, but would mainly be due to network configuration. 2. This can make a significant difference on the PETSc operations, but it's not clear that much time is being spent in PETSc anyway. If you do rebuild, maybe you could also upgrade to PETSc-3.0.0. > On the other hand, a main issue was that the ghosted code is by about a > factor of 1.5 slower than the non-ghosted code. That shouldn't have to > do with debugging options, because they were enabled for both versions. Do you have a log from an equivalent non-ghosted run? Jed |
From: Tim K. <tim...@ce...> - 2009-06-22 06:53:29
|
On Fri, 19 Jun 2009, Roy Stogner wrote: > On Fri, 19 Jun 2009, Tim Kroeger wrote: > >> I tried to do this now, see attached patch. > > At first glance, the core of this patch looks exactly like what I had > in mind. I'll want to spend some time testing (and ideally get some > reassurance from Jed) before making such a major change to such a > critical class, but this looks good. Okay. In any case, what will happen to the new method NumericVector::get() that is still in the patch? Will you keep it or not? I'm asking because I already use it in part of my application. If you will not include it in the library, I should revert the changes in my application as soon as possible (since in a couple of weeks I might have forgotten what I did exactly). On the other hand, if you keep it, I see no reason for reverting my code since the new method is in any case faster (although, if you manage to avoid the virtual function call overhead, the difference will reduce to the check that _array_is_present is already true, which should really be negligible). >> However, I currently don't see how you want to get rid of the virtual >> function call overhead. I kept my new API for the while, as it is >> still somehow faster (although the differce is less pronounced). > > Just an idea I ran by Ben recently over the phone; I hadn't mentioned > it on the list yet. > > Although I don't want to get rid of the ability to do (limited) mixing > and matching of different linear algebra packages in the same code, > most users want to just pick one package and use it for every relevant > object. We could move the current base classes to NumericVectorBase > and SparseMatrixBase, make NumericVector and SparseMatrix into > compile-time-selected typedefs of e.g. PetscVector and PetscMatrix, > and then declare all the leaf class functions non-virtual. AFAIK this > still allows a C++ compiler to call those functions virtually from > pointers/references to the base class but also allows the compiler to > inline them when they're called from pointers/references to the leaf > class. I'm not quite sure about this. In particular, I would think that inlining is only possible if you're calling from a local instance of the leaf class, but not from a pointer/reference to such a class. The reason is that I don't see any possibility for the compiler to see that your leaf class is really a leaf class, i.e. any user code could inherit further down and overload virtual functions and pass a pointer/refernce to such a class to the basic library. C++ is missing a syntax for explicitly disallowing inheritance from a class, isn't it? Even if it works theoretically, I see another problem: Of what type will, say, LinearImplicitSystem::solution be? Will it be (an AutoPtr to) NumericVectorBase or NumericVector? In either case, you can't have both the flexibility to mix solver algebra packaes and the possibility to inline virtual functions. Anyway, I'd like to be proven wrong. Also, compile-time selection of the linear algebra package would be fine for me since I'm only using PETSc anyway. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-06-11 08:00:28
|
Dear Jed and Lorenzo, On Wed, 10 Jun 2009, Jed Brown wrote: > Tim Kroeger wrote: > >> Please find attached the log output now. > > First, I'm confused about why all the detailed output is missing. See > attached for the sort of output I expect. In particular, there is > fine-grained output on vector operations and solver internals with > details about load balancing. I have no idea what happend here. I should note that I supplied the option "-log_summary" using PetscOptionsSetValue rather than directly on the command line, but that shouldn't be the point, should it? > I see about 30% of the time spent in assemble which about what I'd > expect, but solve time is relatively low, especially when compared to > some of the FE ops. I don't recall, is a significant part of your code > explicit? There are some explicit parts, but it wouldn't expect them to be significant. When I watch the output while the application is running, it feels as if it is "most of the time" either inside assemble/solve of one of the implicit parts or inside EquationSystems::reinit(). > inverse_map is getting called a ton. Yes. I use a PointLocator, but that is at then end, after the main computation has finished. The PointLocator is used very heavily then, and this part of the code runs in serial (only on core #0) and takes about 35 minutes. I see that this contradicts the fact that over 90 minutes are spent inside inverse_map(). I have currently no idea where else I am using that. >> As I see it, there are two obvious things that I could do. That is (1) >> update to mpich2 1.1, and (2) recompile PETSc with >> "--with-debugging=no". Would you recommend to do so? > > 1. Time for MPI_Barrier/MPI_Send looks a bit high, that might be > improved with ch3:nemesis, but would mainly be due to network > configuration. > > 2. This can make a significant difference on the PETSc operations, but > it's not clear that much time is being spent in PETSc anyway. If you do > rebuild, maybe you could also upgrade to PETSc-3.0.0. Okay, so I won't rebuild now but keep this in mind. >> On the other hand, a main issue was that the ghosted code is by about a >> factor of 1.5 slower than the non-ghosted code. That shouldn't have to >> do with debugging options, because they were enabled for both versions. > > Do you have a log from an equivalent non-ghosted run? Good point. (-: I don't have one right now. Actually, running on 24 cores on 4 nodes won't be possible without ghosted (due to memory shortage). I will perform two computations (one ghosted and one non-ghosted) with 8 cores on 4 nodes and send you the logs. That will probably be tomorrow. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-06-11 10:29:20
Attachments:
signature.asc
|
Tim Kroeger wrote: > Dear Jed and Lorenzo, > > On Wed, 10 Jun 2009, Jed Brown wrote: > >> Tim Kroeger wrote: >> >>> Please find attached the log output now. >> >> First, I'm confused about why all the detailed output is missing. See >> attached for the sort of output I expect. In particular, there is >> fine-grained output on vector operations and solver internals with >> details about load balancing. > > I have no idea what happend here. I should note that I supplied the > option "-log_summary" using PetscOptionsSetValue rather than directly on > the command line, but that shouldn't be the point, should it? This is not okay, that just registers the option in the options database, but the profiling functions don't check there (for performance reasons). The option either needs to be present at PetscInitialize, i.e. any of 1. Passed in on the command-line 2. Specified in the file that is the third argument of PetscInitialize 3. Put in $HOME/.petscrc 4. Put in $PWD/.petscrc OR, if you must, call PetscLogBegin() to start logging and make sure that -log_summary is in the options database by the time PetscFinalize is called. >> I see about 30% of the time spent in assemble which about what I'd >> expect, but solve time is relatively low, especially when compared to >> some of the FE ops. I don't recall, is a significant part of your code >> explicit? > > There are some explicit parts, but it wouldn't expect them to be > significant. > > When I watch the output while the application is running, it feels as if > it is "most of the time" either inside assemble/solve of one of the > implicit parts or inside EquationSystems::reinit(). According to the libmesh profiling, the sum of these steps is only half the runtime. It should be essentially everything (especially since function evaluation is not a separate step). But, it appears that the final profiling output for, say, assemble(), does not include sub-events (so there is no way for us to know the total amount of time spent in "assembly" without defining PerfLog objects for the various stages). This is different than PETSc's convention so keep that in mind. >> inverse_map is getting called a ton. > > Yes. I use a PointLocator, but that is at then end, after the main > computation has finished. The PointLocator is used very heavily then, > and this part of the code runs in serial (only on core #0) and takes > about 35 minutes. > > I see that this contradicts the fact that over 90 minutes are spent > inside inverse_map(). I have currently no idea where else I am using that. It's also called in assembly somewhere and seems to be a significant bottleneck. Roy et al., could you remind me why this is required? >>> As I see it, there are two obvious things that I could do. That is (1) >>> update to mpich2 1.1, and (2) recompile PETSc with >>> "--with-debugging=no". Would you recommend to do so? >> >> 1. Time for MPI_Barrier/MPI_Send looks a bit high, that might be >> improved with ch3:nemesis, but would mainly be due to network >> configuration. >> >> 2. This can make a significant difference on the PETSc operations, but >> it's not clear that much time is being spent in PETSc anyway. If you do >> rebuild, maybe you could also upgrade to PETSc-3.0.0. > > Okay, so I won't rebuild now but keep this in mind. > >>> On the other hand, a main issue was that the ghosted code is by about a >>> factor of 1.5 slower than the non-ghosted code. That shouldn't have to >>> do with debugging options, because they were enabled for both versions. >> >> Do you have a log from an equivalent non-ghosted run? > > Good point. (-: > > I don't have one right now. Actually, running on 24 cores on 4 nodes > won't be possible without ghosted (due to memory shortage). I will > perform two computations (one ghosted and one non-ghosted) with 8 cores > on 4 nodes and send you the logs. That will probably be tomorrow. Great. Jed |
From: Tim K. <tim...@ce...> - 2009-06-29 08:50:07
|
Dear Roy and Jed, Now, I have performed a couple of test runs of my applications with the new implementation of PetscVector, both with and without ghosted vectors. All runs were performed in devel mode and without performance logging (to make it comparable to the runs that were performed earlier). I prepared a list for you here of these runs as well as the earliear runs (with old PetscVector implementation): PetVec ghost #cores #nodes hh:mm:ss ---------------------------------------- old no 8 3 11:34:10 old no 8 3 11:35:54 old ghost 8 3 17:25:28 old ghost 8 3 16:33:23 old ghost 24 4 10:42:34 old ghost 24 4 11:05:30 old ghost 24 4 11:24:17 new no 8 3 12:10:44 new no 9 3 12:01:35 new ghost 8 3 11:59:33 new ghost 9 3 11:21:11 new ghost 18 3 08:31:03 new ghost 24 4 08:37:35 I also looked at the results, and they coincide up to the usual numerical differences. It seems as if the new PetscVector implementation makes the non-ghosted code slower. I have no explanation for this. On the other hand, the ghosted code is much faster now than before and compares well with the non-ghosted plus old PetscVector. And for applications where the number of employed cores per node is limited due to memory requirements and where a major part of the time is spent in system assembly (which both is true for my application), an essential speedup is possible now using ghosted plus new PetscVector (and thus more cores per node). I would vote for making both default soon. Any opinions? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-06-11 14:07:37
|
On Thu, 11 Jun 2009, Jed Brown wrote: > Tim Kroeger wrote: > >> I see that this contradicts the fact that over 90 minutes are spent >> inside inverse_map(). I have currently no idea where else I am using that. > > It's also called in assembly somewhere and seems to be a significant > bottleneck. Roy et al., could you remind me why this is required? It's used in assembly of DG systems - when taking jump terms on element sides we take an inverse map of the physical coordinates found on one element to get to the master coordinates on the neighbor element. Likewise for CG systems using jump error estimators. It's used in elem->contains_point() which a PointLocator or MeshFunction would be hitting a lot. We're using it unnecessarily and dangerously in find_point_neighbors; that won't be an issue for Tim but I ought to fix that before I start heavily using patch recovery estimators again... And it's used a lot in System::project_vector, which is probably where Tim's code is hitting it. I'm still stunned that it would be taking up 90 minutes, though! On his affine hex meshes that call ought to be converging in one step after a couple dozen flops... --- Roy |
From: Tim K. <tim...@ce...> - 2009-06-11 14:15:16
|
Dear Jed, On Thu, 11 Jun 2009, Jed Brown wrote: > Tim Kroeger wrote: >> >> I have no idea what happend here. I should note that I supplied the >> option "-log_summary" using PetscOptionsSetValue rather than directly on >> the command line, but that shouldn't be the point, should it? > > This is not okay, that just registers the option in the options > database, but the profiling functions don't check there (for performance > reasons). The option either needs to be present at PetscInitialize, > i.e. any of > > 1. Passed in on the command-line > 2. Specified in the file that is the third argument of PetscInitialize > 3. Put in $HOME/.petscrc > 4. Put in $PWD/.petscrc > > OR, if you must, call PetscLogBegin() to start logging and make sure > that -log_summary is in the options database by the time PetscFinalize > is called. Ah, I didn't know that. I will call PetscLogBegin() directly and restart my application. This will postpone the logfiles until Monday (except for a small chance that I might go to work on the weekend). >> When I watch the output while the application is running, it feels as if >> it is "most of the time" either inside assemble/solve of one of the >> implicit parts or inside EquationSystems::reinit(). > > According to the libmesh profiling, the sum of these steps is only half > the runtime. It should be essentially everything (especially since > function evaluation is not a separate step). But, it appears that the > final profiling output for, say, assemble(), does not include sub-events > (so there is no way for us to know the total amount of time spent in > "assembly" without defining PerfLog objects for the various stages). > This is different than PETSc's convention so keep that in mind. Yes, that's the point. I think this was different in earlier times, and they changed it then, because in some cases it's more useful the way it is done now. Unfortunately not for all. I've now manipulated the PerfLog class such that it does *both*. Roy: I have attached a patch concerning this point. What do you think about it. I would appreciate if it would be committed. You see, it doesn't remove functionality; however, I understand that it makes the output of PerfLog wider by 24 charaters and thus eventually less well to read. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-06-11 14:16:54
Attachments:
patch
|
Sorry, forgot the patch. Here it is. -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Tim K. <tim...@ce...> - 2009-06-11 14:32:26
|
Dear Roy, On Thu, 11 Jun 2009, Roy Stogner wrote: > On Thu, 11 Jun 2009, Jed Brown wrote: > >> Tim Kroeger wrote: >> >>> I see that this contradicts the fact that over 90 minutes are spent >>> inside inverse_map(). I have currently no idea where else I am using >>> that. >> >> It's also called in assembly somewhere and seems to be a significant >> bottleneck. Roy et al., could you remind me why this is required? > > It's used in assembly of DG systems - when taking jump terms on > element sides we take an inverse map of the physical coordinates found > on one element to get to the master coordinates on the neighbor > element. Likewise for CG systems using jump error estimators. Okay, I'm not using those systems. (Actually, I don't even know what they do. Sometimes I think I should use them...) > It's used in elem->contains_point() which a PointLocator or > MeshFunction would be hitting a lot. I do that after the computation (as described before), but that's only 35 minutes total time. > We're using it unnecessarily and dangerously in find_point_neighbors; > that won't be an issue for Tim but I ought to fix that before I start > heavily using patch recovery estimators again... I can't say anything about this since it seems to be some libMesh internal thing. Well, Roy seems to know that I don't use that, so it's fine. (-: > And it's used a lot in System::project_vector, which is probably where > Tim's code is hitting it. I'm still stunned that it would be taking > up 90 minutes, though! On his affine hex meshes that call ought to be > converging in one step after a couple dozen flops... Hmm. I've now removed that part of the code in my application that uses the MeshFunction after the main computation (see above). The new log outputs should give us more insight. In particular, we will see exactly how much time is spent inside those inverse_map() calls that are performed in System::project_vector(). Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Jed B. <je...@59...> - 2009-06-11 14:49:09
Attachments:
signature.asc
|
Tim Kroeger wrote: > Dear Roy, > > On Thu, 11 Jun 2009, Roy Stogner wrote: > >> On Thu, 11 Jun 2009, Jed Brown wrote: >> >>> Tim Kroeger wrote: >>> >>>> I see that this contradicts the fact that over 90 minutes are spent >>>> inside inverse_map(). I have currently no idea where else I am >>>> using that. >>> >>> It's also called in assembly somewhere and seems to be a significant >>> bottleneck. Roy et al., could you remind me why this is required? >> >> It's used in assembly of DG systems - when taking jump terms on >> element sides we take an inverse map of the physical coordinates found >> on one element to get to the master coordinates on the neighbor >> element. Likewise for CG systems using jump error estimators. > > Okay, I'm not using those systems. (Actually, I don't even know what > they do. Sometimes I think I should use them...) > >> It's used in elem->contains_point() which a PointLocator or >> MeshFunction would be hitting a lot. > > I do that after the computation (as described before), but that's only > 35 minutes total time. > >> We're using it unnecessarily and dangerously in find_point_neighbors; >> that won't be an issue for Tim but I ought to fix that before I start >> heavily using patch recovery estimators again... > > I can't say anything about this since it seems to be some libMesh > internal thing. Well, Roy seems to know that I don't use that, so it's > fine. (-: > >> And it's used a lot in System::project_vector, which is probably where >> Tim's code is hitting it. I'm still stunned that it would be taking >> up 90 minutes, though! On his affine hex meshes that call ought to be >> converging in one step after a couple dozen flops... > > Hmm. It looks like it is called for every quadrature point, probably the most expensive part of FE::reinit. (gdb) bt #0 FE<3u, (libMeshEnums::FEFamily)0>::inverse_map (elem=0x20f00a0, physical_point=@0x255b820, tolerance=9.9999999999999995e-07, secure=true) at src/fe/fe_map.C:635 #1 0x00007f48c7ac893e in FE<3u, (libMeshEnums::FEFamily)0>::inverse_map (elem=0x20f00a0, physical_points=@0x252dbf0, reference_points=@0x7fffd0b06bf0, tolerance=9.9999999999999995e-07, secure=true) at src/fe/fe_map.C:985 #2 0x00007f48c79e9fcf in FE<3u, (libMeshEnums::FEFamily)0>::reinit (this=0x252dbe0, elem=0x20f00a0, s=0, tolerance=9.9999999999999995e-07) at src/fe/fe_boundary.C:135 #3 0x000000000044717f in assemble_poisson (es=@0x7fffd0b088f0, system_name=@0x215d578) at ex4.C:521 #4 0x00007f48c8229d6f in System::user_assembly (this=0x215d520) at src/solvers/system.C:1054 #5 0x00007f48c8223661 in System::assemble (this=0x215d520) at src/solvers/system.C:339 #6 0x00007f48c820b824 in ImplicitSystem::assemble (this=0x215d520) at src/solvers/implicit_system.C:163 #7 0x00007f48c821072b in LinearImplicitSystem::solve (this=0x215d520) at src/solvers/linear_implicit_system.C:82 #8 0x000000000044560a in main (argc=5, argv=0x7fffd0b08cf8) at ex4.C:239 But this is not dependent on whether ghosted vectors are used, so I'm really curious to see that output. Jed |
From: Tim K. <tim...@ce...> - 2009-07-03 06:36:20
|
Dear Roy, On Thu, 2 Jul 2009, Roy Stogner wrote: > Sorry to get back to you so late on this. I think my subconscious was > stalling, because I can still hardly write the necessary reply without > wincing: > > Can you run all those again, in opt mode? Seems as if you are right. I will do this, but it will of course take some time. Luckily enough, I didn't carry on working on my application in the meantime. Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |
From: Roy S. <roy...@ic...> - 2009-06-11 15:35:02
|
On Thu, 11 Jun 2009, Jed Brown wrote: > It looks like it is called for every quadrature point, probably the most > expensive part of FE::reinit. But look at which overloaded function is getting called: > #2 0x00007f48c79e9fcf in FE<3u, (libMeshEnums::FEFamily)0>::reinit (this=0x252dbe0, elem=0x20f00a0, s=0, tolerance=9.9999999999999995e-07) at src/fe/fe_boundary.C:135 That's every quadrature point on reinit(elem, side), not on the interior reinit(elem). The inverse map gets used on sides to transform a quadrature rule on a d-1 dimensional side element into coordinates on the d-dimensional master element. That may get expensive for DG/jump calculations where every internal side is being integrated on, but I'd be quite surprised if it was a major expense for a CG assembly where the only side terms are on the domain boundary. --- Roy |
From: Jed B. <je...@59...> - 2009-06-11 15:57:48
Attachments:
signature.asc
log.ex4.n2
|
Roy Stogner wrote: > > On Thu, 11 Jun 2009, Jed Brown wrote: > >> It looks like it is called for every quadrature point, probably the most >> expensive part of FE::reinit. > > But look at which overloaded function is getting called: > >> #2 0x00007f48c79e9fcf in FE<3u, (libMeshEnums::FEFamily)0>::reinit >> (this=0x252dbe0, elem=0x20f00a0, s=0, >> tolerance=9.9999999999999995e-07) at src/fe/fe_boundary.C:135 > > That's every quadrature point on reinit(elem, side), not on the > interior reinit(elem). The inverse map gets used on sides to > transform a quadrature rule on a d-1 dimensional side element into > coordinates on the d-dimensional master element. That may get > expensive for DG/jump calculations where every internal side is being > integrated on, but I'd be quite surprised if it was a major expense > for a CG assembly where the only side terms are on the domain > boundary. Run ex4 and look at the profiling output. An example run is attached. This shows 1 second total spent in element setup with almost half of that in inverse_map. It's remarkably expensive, and called multiple times (not an even multiple) per boundary face. That mesh has 6*32^2 = 6144 boundary faces, inverse_map is called more than 4 times this much (in one assembly). It still only amounts to 10% of assembly time, so this isn't an issue anywhere near the scale Tim is running into. Note that inverse_map could easily be a major bottleneck for CG in complex geometry, say bone structure analysis. In any case, I doubt this has anything to do with ghosted vectors. Jed |
From: Roy S. <roy...@ic...> - 2009-06-11 19:09:55
|
On Thu, 11 Jun 2009, Jed Brown wrote: >> That's every quadrature point on reinit(elem, side), not on the >> interior reinit(elem). The inverse map gets used on sides to >> transform a quadrature rule on a d-1 dimensional side element into >> coordinates on the d-dimensional master element. That may get >> expensive for DG/jump calculations where every internal side is being >> integrated on, but I'd be quite surprised if it was a major expense >> for a CG assembly where the only side terms are on the domain >> boundary. > > Run ex4 and look at the profiling output. An example run is attached. > This shows 1 second total spent in element setup with almost half of > that in inverse_map. It's remarkably expensive, and called multiple > times (not an even multiple) per boundary face. That mesh has 6*32^2 = > 6144 boundary faces, inverse_map is called more than 4 times this much > (in one assembly). It's called once per boundary quadrature point (note that the vectorize inverse_map() just calls the one-point inverse_map() in a loop) - that's 9 times per boundary face by default in ex4, I think, which seems to square with your observations of ~4.5 local calls per global boundary face when running on 2 processors. > It still only amounts to 10% of assembly time, so this isn't an > issue anywhere near the scale Tim is running into. No. Tim's also running some larger scale problems, where any inefficiency on the boundary will scale up more slowly than the rest of the assembly. > Note that inverse_map could easily be a major bottleneck for CG in > complex geometry, say bone structure analysis. Possibly. But it's been years since Ben decided that using inverse_map for boundary quadrature was a good way to optimize for programmer time at the expense of CPU time; IIRC Lorenzo just recently became the first person to really disagree. > In any case, I doubt this has anything to do with ghosted vectors. No. --- Roy |
From: Tim K. <tim...@ce...> - 2009-06-12 09:52:34
Attachments:
log.ghost-8-4-2
|
Dear Jed and Roy, Please find attached now the result for 8 cores on 4 nodes. Ghosted was still enabled. I will now start the non-ghosted equivalent to this. Roy: Did you have a look at the patch that I sent you yesterday? Best Regards, Tim -- Dr. Tim Kroeger tim...@me... Phone +49-421-218-7710 tim...@ce... Fax +49-421-218-4236 Fraunhofer MEVIS, Institute for Medical Image Computing Universitaetsallee 29, 28359 Bremen, Germany |