## [Libmesh-devel] Efficiency concerns

 [Libmesh-devel] Efficiency concerns From: Lei Shi - 2012-08-29 06:01:48 Attachments: Message as HTML ```Dear All, I just got some concerns about the efficiency after reading the example misc.ex5. It may be due to my misunderstanding of the libmesh or the DG scheme. Would you guys please help me clear it. In this example, the Laplace equation is solved using DG with interior penalty method. So the main job is to compute the mass matrix from the element volume integration and surface jump integration. It is done element by element with the help of element iterator. I know all of the elements organized as a quad-forest and only the active items need to be looped. so does this mean when we need to reach to the next active element, const_element_iterator++ need to go through the quad-forest and pass several non-active parent elements then get to the correct element pointer to the next active element? If the answer is true. Is it too expensive to find an element in a solver? The other thing is the metrics computing such as Jacobian, xi_x xi_y eta_x eta_y and face normal etc. Do those metrics computed several times using FE::FEMap? If the mesh is fixed in space, we only need to compute them once and store them somewhere. Same thing happens when computing quadrature points in physical domain and value of shape function/ its derivation at those points in computational domain. I think those are computed in ---------------------- fe->reinit(elem) ---------------------- Those values do not change during the calculation and should be computed only once. I think they are in the deepest loop and will eat a lot of cpu time. Last one, to compute the flux jump at flux points on the common surface of two cells with different h levels. The coordinates and normals of the flux quadrature points on the both side of element are computed several times if we use a residual based scheme. It does not change at each refinement level. Is there any efficient way to handle those kind of situation: flux jump calculation at a surface with hanging node? Or those are just for demo only. Those efficiency problem get even worse for a scheme need residual jacobian at each time step. Do we need to use FEMap explicitly and store the metrics for each element/face? So we don't need to compute them more than once. Thanks a lot for your time. Sincerely Yours, Lei Shi ---------- ```

 [Libmesh-devel] Efficiency concerns From: Lei Shi - 2012-08-29 06:01:48 Attachments: Message as HTML ```Dear All, I just got some concerns about the efficiency after reading the example misc.ex5. It may be due to my misunderstanding of the libmesh or the DG scheme. Would you guys please help me clear it. In this example, the Laplace equation is solved using DG with interior penalty method. So the main job is to compute the mass matrix from the element volume integration and surface jump integration. It is done element by element with the help of element iterator. I know all of the elements organized as a quad-forest and only the active items need to be looped. so does this mean when we need to reach to the next active element, const_element_iterator++ need to go through the quad-forest and pass several non-active parent elements then get to the correct element pointer to the next active element? If the answer is true. Is it too expensive to find an element in a solver? The other thing is the metrics computing such as Jacobian, xi_x xi_y eta_x eta_y and face normal etc. Do those metrics computed several times using FE::FEMap? If the mesh is fixed in space, we only need to compute them once and store them somewhere. Same thing happens when computing quadrature points in physical domain and value of shape function/ its derivation at those points in computational domain. I think those are computed in ---------------------- fe->reinit(elem) ---------------------- Those values do not change during the calculation and should be computed only once. I think they are in the deepest loop and will eat a lot of cpu time. Last one, to compute the flux jump at flux points on the common surface of two cells with different h levels. The coordinates and normals of the flux quadrature points on the both side of element are computed several times if we use a residual based scheme. It does not change at each refinement level. Is there any efficient way to handle those kind of situation: flux jump calculation at a surface with hanging node? Or those are just for demo only. Those efficiency problem get even worse for a scheme need residual jacobian at each time step. Do we need to use FEMap explicitly and store the metrics for each element/face? So we don't need to compute them more than once. Thanks a lot for your time. Sincerely Yours, Lei Shi ---------- ```
 Re: [Libmesh-devel] [Libmesh-users] Efficiency concerns From: Lei Shi - 2012-08-29 20:29:56 Attachments: Message as HTML ```Lorenzo, Roy, Derek and Jed Thanks a lot for your comments. To be honest, I didn't think about the cache missing before. I know it is very important for a scientific code. I just don't know how to think about it when I write the code. I will do some experiments to compare the results based on computing every cheap thing when needed and store everything needed. The other question is the element iterator. When we write ++iter, what happened behind this operator? Does the current element need to climb along the quad forest and go down to the lowest child? Can we just construct a map with all active elementIndex -> element * mappings? and directly use this map or vector to loop the elements. I know there is no anisotropic refinement in the lib. How much effort I need to put to add anisotropic? I mean how many percent the current code based on those isotropic assumption? Thanks a lot. I really appreciate the opportunity I can talk to you guys on those topics. Sincerely Yours, Lei Shi ---------- On Wed, Aug 29, 2012 at 12:52 PM, Jed Brown wrote: > I'll also comment that aggressive caching is pessimal from a > modern-hardware and energy perspective. Memory and memory bandwidth take an > increasing part of the power and acquisition budget. On modern hardware, > for operations that can be vectorized, you should usually plan to recompute > everything that takes less than 50 flops per scalar value. If you have a > computation that takes more than 50 flops to recompute, you may want to > store it, but be aware that reloading it may displace more useful cache > lines and if you aren't careful about cache locality (e.g. if the element > is visited by a thread running on a different socket later), performance > results may be very bad and/or not reproducible. > > Lei, I suggest being very conservative when selecting what may be worth > caching. Also, depending on your application, there may be much larger > gains to be had by looking elsewhere. > > > On Wed, Aug 29, 2012 at 12:04 PM, Derek Gaston wrote: > >> With MOOSE we have the full gamut of options. The default is to >> recompute everything. We also have the option to cache one FE reinit >> in the case where you have a regular grid. We also have the option of >> caching EVERY FE reinit for every element so that you can reuse them >> until the mesh changes. >> >> I'll say that that last one is not used very often... it eats up SO >> much damn memory... especially in the cases where it would be useful >> (like on higher order elements in 3D). So it's kind of a >> self-defeating optimization... >> >> Derek >> >> Sent from my iPhone >> >> On Aug 29, 2012, at 5:56 PM, Roy Stogner >> wrote: >> >> > >> > On Wed, 29 Aug 2012, Lorenzo Alessio Botti wrote: >> > >> >> You can always store quantities that you need several times but this >> >> might eat up a lot of memory. Accessing to memory might be expensive >> >> and sometimes it might be even faster to recompute. >> > >> > This can absolutely be true - on modern processors some problems end >> > up, even in the assembly, bottlenecked by memory bandwidth rather than >> > CPU speed. At that point anything that you can compute based on data >> > that's already in the processor cache is "free". >> > >> >> The libMesh approach is to recompute everything. >> > >> > That's the libMesh examples approach, anyway, but that's intended to >> > keep the examples simple more than anything else, I thought. When >> > profiling real applications I've found that finding ways to cache >> > transcendental function evaluations (e.g. by doing nodal quadrature or >> > reinterpolation of such terms) can give a decent savings in runtime. >> > >> > Of course efficiency discussion is influenced by the fact that all the >> > main libMesh developers are profiling based on our own app codes: >> > implicit solves with intricate constitutive models. If you're doing >> > an explicit calculation then it probably makes sense to cache even >> > mappings and shape functions; if you're solving a linear problem then >> > you could do much better with Rob Kirby's element matrix >> > transformation tricks than with plain quadrature; etc. >> > --- >> > Roy >> > >> > >> ------------------------------------------------------------------------------ >> > Live Security Virtual Conference >> > Exclusive live event will cover all the ways today's security and >> > threat landscape has changed and how IT managers can respond. >> Discussions >> > will include endpoint security, mobile security and the latest in >> malware >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> > _______________________________________________ >> > Libmesh-users mailing list >> > Libmesh-users@... >> > https://lists.sourceforge.net/lists/listinfo/libmesh-users >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Libmesh-users mailing list >> Libmesh-users@... >> https://lists.sourceforge.net/lists/listinfo/libmesh-users >> > > ```
 Re: [Libmesh-devel] [Libmesh-users] Efficiency concerns From: Roy Stogner - 2012-08-29 21:45:16 ```On Wed, 29 Aug 2012, Lei Shi wrote: > I will do some experiments to compare the results based on computing > every cheap thing when needed and store everything needed. Let us know the results! Speculating about optimization ideas is a poor substitute for profiling optimization ideas. > The other question is the element iterator. When we write ++iter, > what happened behind this operator? Does the current element need to > climb along the quad forest and go down to the lowest child? Can we > just construct a map with all active elementIndex -> element * > mappings? and directly use this map or vector to loop the elements. In a SerialMesh, ++iter just increments through an STL vector, potentially testing predicates (e.g. whether an element is active) along the way. ParallelMesh is similar but with a STL map under the hood. Searches through a quadtree/octree only happen when using a PointLocator object, and most library code doesn't use those. The big exception is with PeriodicBoundary support; Derek or the other INL people could tell you whether or not that turns out to be a performance hit. > I know there is no anisotropic refinement in  the lib. How much > effort I need to put to add anisotropic? I mean how many percent the > current code based on those isotropic assumption? For starters, "egrep '(->|\.)level\(*/*/*.{C,h} | wc" shows about a hundred lines of code, the majority of which are implicitly assuming that the idea of an isotropic "refinement level" makes sense. Fixing that to enable non-conforming anisotropic refinement would be a serious effort. Conforming anisotropic refinement might not be too bad. Add the right edge splitting algorithms for triangles and tets, don't bother creating a tree structure, and you could get solutions without changing anything else. You'd want to create a new project_vector() codepath to get efficient solutions. --- Roy```
 Re: [Libmesh-devel] [Libmesh-users] Efficiency concerns From: Lei Shi - 2012-08-30 23:19:27 Attachments: Message as HTML ```Dear Roy, Thanks a lot for your reply. I will try to write a basic solver first and test those approaches. Then I can discuss with you guys what I find there. Sincerely Yours, Lei Shi ---------- On Wed, Aug 29, 2012 at 4:45 PM, Roy Stogner wrote: > > On Wed, 29 Aug 2012, Lei Shi wrote: > > I will do some experiments to compare the results based on computing >> every cheap thing when needed and store everything needed. >> > > Let us know the results! Speculating about optimization ideas is a > poor substitute for profiling optimization ideas. > > > The other question is the element iterator. When we write ++iter, >> what happened behind this operator? Does the current element need to >> climb along the quad forest and go down to the lowest child? Can we >> just construct a map with all active elementIndex -> element * >> mappings? and directly use this map or vector to loop the elements. >> > > In a SerialMesh, ++iter just increments through an STL vector, > potentially testing predicates (e.g. whether an element is active) > along the way. ParallelMesh is similar but with a STL map under the > hood. > > Searches through a quadtree/octree only happen when using a > PointLocator object, and most library code doesn't use those. The big > exception is with PeriodicBoundary support; Derek or the other INL > people could tell you whether or not that turns out to be a > performance hit. > > > I know there is no anisotropic refinement in the lib. How much >> effort I need to put to add anisotropic? I mean how many percent the >> current code based on those isotropic assumption? >> > > For starters, "egrep '(->|\.)level\(*/*/*.{C,h} | wc" shows about a > hundred lines of code, the majority of which are implicitly assuming > that the idea of an isotropic "refinement level" makes sense. Fixing > that to enable non-conforming anisotropic refinement would be a > serious effort. > > Conforming anisotropic refinement might not be too bad. Add the right > edge splitting algorithms for triangles and tets, don't bother > creating a tree structure, and you could get solutions without > changing anything else. You'd want to create a new project_vector() > codepath to get efficient solutions. > --- > Roy ```