Screenshot instructions:
Windows
Mac
Red Hat Linux
Ubuntu
Click URL instructions:
Rightclick on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)
From: David Fuentes <fuentesdt@gm...>  20100624 13:56:01

Hello, My initial work with libMesh was based on the ex13 approach to solving my nonlinear system of equations. I really like the FEMSystem design and am thinking of switching to the FEMSystem approach seen in ex18 with petsc's nonlinear solvers. Have you seen any significant performance differences between matrix vector assembly in the ex13 and ex18 type approaches ? Most computationally intensive portions seen in profiles ? Seems that there may be a few more functions calls in FEMSystem approach but may be irrelevant ? thank you, David 
From: Roy Stogner <roystgnr@ic...>  20100624 17:39:28

On Thu, 24 Jun 2010, David Fuentes wrote: > My initial work with libMesh was based on the ex13 approach to solving > my nonlinear system of equations. I really like the FEMSystem design and > am thinking of switching to the FEMSystem approach seen in ex18 with > petsc's nonlinear solvers. Have you seen any significant performance > differences between matrix vector assembly in the ex13 and ex18 type > approaches ? Most computationally intensive portions seen in profiles ? > Seems that there may be a few more functions calls in FEMSystem approach > but may be irrelevant ? It was completely irrelevant for the problems where I cared enough about performance to repeatedly oprofile everything... once you're up to a dozen or two DoFs per cell, calculating the element jacobian dwarfs a few extra virtual function calls along the way. I'm told by another ICES person who tried some optimization work that the situation for ex18 is similar. I suspect that if you're doing a few lowapproximationorder variables per System (whether that's because your nonsmooth solutions don't have many variables or because you're doing things decoupled) then FEMSystem might be noticeably slower. The FEMSystem numeric jacobians are definitely slower than they need to be. That was partly intentional design decision (I only use them for prototyping and verifying analytic jacobians, and in both cases central differencing is worth the extra evaluations) and partly laziness (I'm not going to code faster alternatives that I wouldn't use myself). You mostly run stuff realtime, though, right? In that case you wouldn't even be using efficient FDM jacobians either. I wish there was a way to make these decisions irrelevant. I'd like to be able to tell code that a function is virtual, have it usually behave that way for flexibility, but then compile it with some slightlychanged header that specifies a single concrete subclass. E.g. for most of our users for most of the time it would be reasonable to tell the code "every NumericVector is a PetscVector". For some of our users some of the time it would be worth going to the effort of building a whole library as "every FEMSystem is a MySpecificSystem". I don't know how to get C++ to do that, though... and it probably wouldn't be worth it just to get rid of virtual function lookups; you'd need to compile with intercompilationunit inlining as well.  Roy 
From: David Fuentes <fuentesdt@gm...>  20100624 18:23:40

Thank you Roy, I typically use 12, p=1, variables per system. I'll try it out. df On Thu, 24 Jun 2010, Roy Stogner wrote: > > On Thu, 24 Jun 2010, David Fuentes wrote: > >> My initial work with libMesh was based on the ex13 approach to solving >> my nonlinear system of equations. I really like the FEMSystem design and >> am thinking of switching to the FEMSystem approach seen in ex18 with >> petsc's nonlinear solvers. Have you seen any significant performance >> differences between matrix vector assembly in the ex13 and ex18 type >> approaches ? Most computationally intensive portions seen in profiles ? >> Seems that there may be a few more functions calls in FEMSystem approach >> but may be irrelevant ? > > It was completely irrelevant for the problems where I cared enough > about performance to repeatedly oprofile everything... once you're up > to a dozen or two DoFs per cell, calculating the element jacobian > dwarfs a few extra virtual function calls along the way. I'm told by > another ICES person who tried some optimization work that the > situation for ex18 is similar. > > I suspect that if you're doing a few lowapproximationorder variables > per System (whether that's because your nonsmooth solutions don't > have many variables or because you're doing things decoupled) then > FEMSystem might be noticeably slower. > > The FEMSystem numeric jacobians are definitely slower than they need > to be. That was partly intentional design decision (I only use them > for prototyping and verifying analytic jacobians, and in both cases > central differencing is worth the extra evaluations) and partly > laziness (I'm not going to code faster alternatives that I wouldn't > use myself). You mostly run stuff realtime, though, right? In that > case you wouldn't even be using efficient FDM jacobians either. > > > I wish there was a way to make these decisions irrelevant. I'd like > to be able to tell code that a function is virtual, have it usually > behave that way for flexibility, but then compile it with some > slightlychanged header that specifies a single concrete subclass. > E.g. for most of our users for most of the time it would be reasonable to > tell the code "every NumericVector is a PetscVector". For some of our > users some of the time it would be worth going to the effort of > building a whole library as "every FEMSystem is a MySpecificSystem". > I don't know how to get C++ to do that, though... and it probably > wouldn't be worth it just to get rid of virtual function lookups; > you'd need to compile with intercompilationunit inlining as well. >  > Roy > 
From: Roy Stogner <roystgnr@ic...>  20100624 18:29:58

On Thu, 24 Jun 2010, David Fuentes wrote: > I typically use 12, p=1, variables per system. That is just about the worstcase scenario. ;) > I'll try it out. Thanks! Let me know what the profiling says. There are a couple ugly template tricks we could experiment with if the virtual function calls actually do turn out to be a problem.  Roy 
From: David Fuentes <fuentesdt@gm...>  20100624 18:51:41

cool thanks. I'll keep you posted. I'll move my current assembly routines and overwrite the FEMSystem::assembly in a derived class to compare. df On Thu, 24 Jun 2010, Roy Stogner wrote: > > On Thu, 24 Jun 2010, David Fuentes wrote: > >> I typically use 12, p=1, variables per system. > > That is just about the worstcase scenario. ;) > >> I'll try it out. > > Thanks! Let me know what the profiling says. There are a couple ugly > template tricks we could experiment with if the virtual function calls > actually do turn out to be a problem. >  > Roy > 
From: Jed Brown <jed@59...>  20100624 18:45:29

On Thu, 24 Jun 2010 13:23:37 0500 (CDT), David Fuentes <fuentesdt@...> wrote: > Thank you Roy, > > > I typically use 12, p=1, variables per system. In 2D or 3D? Virtual calls compile to "mov, mov, jmp", the indirect call typically costs 5 or 6 cycles (because data dependence interferes with OoO), compared to 2 (partly hidden by OoO) for a static call. Evaluating a 3D basis function still costs enough that the indirect call should not be a very serious hit. It probably looks nontrivial compared to evaluating 2D basis functions. None of this matters if your quadrature involves nontrivial physics or the solver isn't very good. Jed 
From: David Fuentes <fuentesdt@gm...>  20100624 18:59:29

I typically use Petsc Nonlinear Solvers in 3D and my bottle neck is typically in the assembly with Petsc SNESSolve taking ~10% of the time about ~50% in the jacobian, ~30% in the residual, and the rest is distributed. On Thu, 24 Jun 2010, Jed Brown wrote: > On Thu, 24 Jun 2010 13:23:37 0500 (CDT), David Fuentes <fuentesdt@...> wrote: >> Thank you Roy, >> >> >> I typically use 12, p=1, variables per system. > > In 2D or 3D? Virtual calls compile to "mov, mov, jmp", the indirect > call typically costs 5 or 6 cycles (because data dependence interferes > with OoO), compared to 2 (partly hidden by OoO) for a static call. > Evaluating a 3D basis function still costs enough that the indirect call > should not be a very serious hit. It probably looks nontrivial compared > to evaluating 2D basis functions. None of this matters if your > quadrature involves nontrivial physics or the solver isn't very good. > > Jed > 
From: Jed Brown <jed@59...>  20100624 19:13:31

On Thu, 24 Jun 2010 13:59:21 0500 (CDT), David Fuentes <fuentesdt@...> wrote: > > I typically use Petsc Nonlinear Solvers in 3D and my bottle neck is > typically in the assembly with Petsc SNESSolve taking ~10% of the time > about ~50% in the jacobian, ~30% in the residual, and the > rest is distributed. So the linear solves are really easy. Are you caching a lot of stuff in the residual evaluation, it's not normal for it to be so much compared to Jacobian assembly unless you don't use an analytic Jacobian (e.g. snes_mf_operator). If residual evaluation can be made cheaper or you are already only assembling a preconditioner, you could probably lag it (snes_mf_operator snes_lag_jacobian) to amortize assembly costs. This of course won't change the cost of residuals, but it's usually a good sign (to those of us working on solvers) if you spend most of your time in "physics". Jed 
From: David Fuentes <fuentesdt@gm...>  20100624 19:28:10

On 6/24/10, Jed Brown <jed@...> wrote: > On Thu, 24 Jun 2010 13:59:21 0500 (CDT), David Fuentes > <fuentesdt@...> wrote: >> >> I typically use Petsc Nonlinear Solvers in 3D and my bottle neck is >> typically in the assembly with Petsc SNESSolve taking ~10% of the time >> about ~50% in the jacobian, ~30% in the residual, and the >> rest is distributed. > > So the linear solves are really easy. Are you caching a lot of stuff in > the residual evaluation, it's not normal for it to be so much compared > to Jacobian assembly unless you don't use an analytic Jacobian > (e.g. snes_mf_operator). Not sure, what do you typically cache ? > If residual evaluation can be made cheaper or > you are already only assembling a preconditioner, you could probably lag > it (snes_mf_operator snes_lag_jacobian) to amortize assembly costs. > This of course won't change the cost of residuals, but it's usually a > good sign (to those of us working on solvers) if you spend most of your > time in "physics". > > Jed > 
From: Jed Brown <jed@59...>  20100624 19:43:26

On Thu, 24 Jun 2010 14:28:03 0500, David Fuentes <fuentesdt@...> wrote: > On 6/24/10, Jed Brown <jed@...> wrote: > > On Thu, 24 Jun 2010 13:59:21 0500 (CDT), David Fuentes > > <fuentesdt@...> wrote: > >> > >> I typically use Petsc Nonlinear Solvers in 3D and my bottle neck is > >> typically in the assembly with Petsc SNESSolve taking ~10% of the time > >> about ~50% in the jacobian, ~30% in the residual, and the > >> rest is distributed. > > > > So the linear solves are really easy. Are you caching a lot of stuff in > > the residual evaluation, it's not normal for it to be so much compared > > to Jacobian assembly unless you don't use an analytic Jacobian > > (e.g. snes_mf_operator). > > Not sure, what do you typically cache ? I cache a local linearization at quadrature points, but that is for fast matrixfree Jacobian application of highorder operators. It doesn't pay off in terms of time or storage for Q1 or P1, even in 3D. But if you had e.g. an expensive constitutive relation involving lookup tables, and you weren't overly concerned about using the minimum possible memory, but didn't want to do the extra work to integrate and insert the element matrices (because there was a high chance of the line search shortening the step), then you might cache the local linearization even for lowestorder elements. It sounds like this is not the case. What does log_summary show? Are you doing a lot more function evaluations than Jacobian assemblies? It's surprising to me that they would cost almost the same amount per call, perhaps there is a hot spot somewhere in your residual evaluation. Jed 
From: David Fuentes <fuentesdt@gm...>  20100625 00:09:24

thanks jed. I can't seem to find a stored profile. I'd have to recreate one. But i'm thinking roughly twice as many functions evaluations as jacobian evaluations. On Thu, 24 Jun 2010, Jed Brown wrote: > On Thu, 24 Jun 2010 14:28:03 0500, David Fuentes <fuentesdt@...> wrote: >> On 6/24/10, Jed Brown <jed@...> wrote: >>> On Thu, 24 Jun 2010 13:59:21 0500 (CDT), David Fuentes >>> <fuentesdt@...> wrote: >>>> >>>> I typically use Petsc Nonlinear Solvers in 3D and my bottle neck is >>>> typically in the assembly with Petsc SNESSolve taking ~10% of the time >>>> about ~50% in the jacobian, ~30% in the residual, and the >>>> rest is distributed. >>> >>> So the linear solves are really easy. Are you caching a lot of stuff in >>> the residual evaluation, it's not normal for it to be so much compared >>> to Jacobian assembly unless you don't use an analytic Jacobian >>> (e.g. snes_mf_operator). >> >> Not sure, what do you typically cache ? > > I cache a local linearization at quadrature points, but that is for fast > matrixfree Jacobian application of highorder operators. It doesn't > pay off in terms of time or storage for Q1 or P1, even in 3D. But if > you had e.g. an expensive constitutive relation involving lookup tables, > and you weren't overly concerned about using the minimum possible > memory, but didn't want to do the extra work to integrate and insert the > element matrices (because there was a high chance of the line search > shortening the step), then you might cache the local linearization even > for lowestorder elements. It sounds like this is not the case. > > What does log_summary show? Are you doing a lot more function > evaluations than Jacobian assemblies? It's surprising to me that they > would cost almost the same amount per call, perhaps there is a hot spot > somewhere in your residual evaluation. > > Jed > 
From: Kirk, Benjamin (JSCEG311) <benjamin.kirk1@na...>  20100625 00:30:33

In my case I cache what I can, but still the behavior is as follows: 1) calculate residual 2) calculate jacobian. solve for update using (1) as rhs. 3) compute residual again and check against (1) In my case it is common to do this once at each time step  that is, solve the nonlinear problem very approximately. In this case, why bother with the second residual evaluation? Is (3) necessary? (Yes, Jed, I'm asking a question on the libMesh mailing list.) Thanks, Ben On Jun 24, 2010, at 7:09 PM, "David Fuentes" 9<fuentesdt@...> wrote: > > > thanks jed. I can't seem to find a stored profile. I'd have to > recreate > one. But i'm thinking roughly twice as many functions evaluations as > jacobian evaluations. > > > > > > On Thu, 24 Jun 2010, Jed Brown wrote: > >> On Thu, 24 Jun 2010 14:28:03 0500, David Fuentes <fuentesdt@... >> > wrote: >>> On 6/24/10, Jed Brown <jed@...> wrote: >>>> On Thu, 24 Jun 2010 13:59:21 0500 (CDT), David Fuentes >>>> <fuentesdt@...> wrote: >>>>> >>>>> I typically use Petsc Nonlinear Solvers in 3D and my bottle neck >>>>> is >>>>> typically in the assembly with Petsc SNESSolve taking ~10% of >>>>> the time >>>>> about ~50% in the jacobian, ~30% in the residual, and the >>>>> rest is distributed. >>>> >>>> So the linear solves are really easy. Are you caching a lot of >>>> stuff in >>>> the residual evaluation, it's not normal for it to be so much >>>> compared >>>> to Jacobian assembly unless you don't use an analytic Jacobian >>>> (e.g. snes_mf_operator). >>> >>> Not sure, what do you typically cache ? >> >> I cache a local linearization at quadrature points, but that is for >> fast >> matrixfree Jacobian application of highorder operators. It doesn't >> pay off in terms of time or storage for Q1 or P1, even in 3D. But if >> you had e.g. an expensive constitutive relation involving lookup >> tables, >> and you weren't overly concerned about using the minimum possible >> memory, but didn't want to do the extra work to integrate and >> insert the >> element matrices (because there was a high chance of the line search >> shortening the step), then you might cache the local linearization >> even >> for lowestorder elements. It sounds like this is not the case. >> >> What does log_summary show? Are you doing a lot more function >> evaluations than Jacobian assemblies? It's surprising to me that >> they >> would cost almost the same amount per call, perhaps there is a hot >> spot >> somewhere in your residual evaluation. >> >> Jed >> > >  >  >  >  > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeekpromo > _______________________________________________ > Libmeshusers mailing list > Libmeshusers@... > https://lists.sourceforge.net/lists/listinfo/libmeshusers 
From: Jed Brown <jed@59...>  20100625 09:13:10

On Thu, 24 Jun 2010 19:28:10 0500, "Kirk, Benjamin (JSCEG311)" <benjamin.kirk1@...> wrote: > In my case I cache what I can, but still the behavior is as follows: > > 1) calculate residual > 2) calculate jacobian. solve for update using (1) as rhs. > 3) compute residual again and check against (1) > > In my case it is common to do this once at each time step  that is, > solve the nonlinear problem very approximately. > > In this case, why bother with the second residual evaluation? Is (3) > necessary? It is only necessary if you want to know whether it converged (even to your weak tolerance). The default is of course to check, but you can snes_max_it 1 # Do a maximum of one linear solve snes_ls basicnonorms # Don't do a line search and don't even # compute the residual (3) snes_convergence_test skip # Skip the convergence test before # evaluating the Jacobian. SNES will # normally bail out early if it starts # with a sufficiently small residual. or with petscdev, snes_type ksponly. Note that linearly implicit methods may have significantly worse stability and accuracy properties, but hopefully you understand your system well enough to know how it behaves. Jed 
From: David Fuentes <fuentesdt@gm...>  20100625 14:09:43

Hi Ben, Could you please give an example of some of the things you typically cache in libMesh ? thanks, df On Thu, 24 Jun 2010, Kirk, Benjamin (JSCEG311) wrote: > In my case I cache what I can, but still the behavior is as follows: > > 1) calculate residual > 2) calculate jacobian. solve for update using (1) as rhs. > 3) compute residual again and check against (1) > > In my case it is common to do this once at each time step  that is, > solve the nonlinear problem very approximately. > > In this case, why bother with the second residual evaluation? Is (3) > necessary? > > (Yes, Jed, I'm asking a question on the libMesh mailing list.) > > Thanks, > > Ben > > On Jun 24, 2010, at 7:09 PM, "David Fuentes" 9<fuentesdt@...> > wrote: > >> >> >> thanks jed. I can't seem to find a stored profile. I'd have to >> recreate >> one. But i'm thinking roughly twice as many functions evaluations as >> jacobian evaluations. >> >> >> >> >> >> On Thu, 24 Jun 2010, Jed Brown wrote: >> >>> On Thu, 24 Jun 2010 14:28:03 0500, David Fuentes <fuentesdt@... >>>> wrote: >>>> On 6/24/10, Jed Brown <jed@...> wrote: >>>>> On Thu, 24 Jun 2010 13:59:21 0500 (CDT), David Fuentes >>>>> <fuentesdt@...> wrote: >>>>>> >>>>>> I typically use Petsc Nonlinear Solvers in 3D and my bottle neck >>>>>> is >>>>>> typically in the assembly with Petsc SNESSolve taking ~10% of >>>>>> the time >>>>>> about ~50% in the jacobian, ~30% in the residual, and the >>>>>> rest is distributed. >>>>> >>>>> So the linear solves are really easy. Are you caching a lot of >>>>> stuff in >>>>> the residual evaluation, it's not normal for it to be so much >>>>> compared >>>>> to Jacobian assembly unless you don't use an analytic Jacobian >>>>> (e.g. snes_mf_operator). >>>> >>>> Not sure, what do you typically cache ? >>> >>> I cache a local linearization at quadrature points, but that is for >>> fast >>> matrixfree Jacobian application of highorder operators. It doesn't >>> pay off in terms of time or storage for Q1 or P1, even in 3D. But if >>> you had e.g. an expensive constitutive relation involving lookup >>> tables, >>> and you weren't overly concerned about using the minimum possible >>> memory, but didn't want to do the extra work to integrate and >>> insert the >>> element matrices (because there was a high chance of the line search >>> shortening the step), then you might cache the local linearization >>> even >>> for lowestorder elements. It sounds like this is not the case. >>> >>> What does log_summary show? Are you doing a lot more function >>> evaluations than Jacobian assemblies? It's surprising to me that >>> they >>> would cost almost the same amount per call, perhaps there is a hot >>> spot >>> somewhere in your residual evaluation. >>> >>> Jed >>> >> >>  >>  >>  >>  >> ThinkGeek and WIRED's GeekDad team up for the Ultimate >> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the >> lucky parental unit. See the prize list and enter to win: >> http://p.sf.net/sfu/thinkgeekpromo >> _______________________________________________ >> Libmeshusers mailing list >> Libmeshusers@... >> https://lists.sourceforge.net/lists/listinfo/libmeshusers > 