As long as we're on this subject let me throw a few things out there....

I've recently spent a good amount of time optimizing our variable value / gradient / second derivative evaluation code.  I've been trying to get as much of it to vectorize as possible using the Intel compiler.  This is important to us because we have a _lot_ of variables and a lot of quadrature points (because we sometimes use high order FEs).  When I say a lot... I just had one of our users come in and say that he is going to be using 4000 variables soon.  Also, I just ran a simulation with 1,024 variables... with the whole simulation totaling over 100 Million DoFs.  When you have this many variables, just computing their values and gradients at the quadrature points becomes a significant burden...

What I've run into though is that there is no good way to get gradient evaluation vectorized because we are working with vectors of Gradient objects... meaning that the values themselves are not contiguous in memory.  So when we're doing something like this:

  for (unsigned int i=0; i < num_dofs; i++)
    soln_local = current_solution(dof_indices[i]);

    for (unsigned int qp=0; qp < nqp; qp++)
      grad_u_qp->add_scaled(grad_phi[i][qp], soln_local);

that inner for loop can't be vectorized because those grad_phi[i][qp] objects aren't contiguous memory.

I tried reading them out into a flat vector and then vectorizing that operation... and it did work (the actual multiply by the solution only took microseconds) but the time to read the grads out into a flat vector and then put them back into Gradient objects outweighed the fact that the normal code can't be vectorized (although, not by as much as you would expect!).

If we're redesigning this stuff... is there any way we could come up with an interface that would give us flat vectors (I would love it if I could get back one long vector that contained all of the gradients evaluated at each quadrature point for each shape function... this would make it much more amenable to vectorization AND to GPU computation....)?  Vectorization is becoming much more important these days.... the new Sandybridge architecture from Intel and Interlagos from AMD can both double the number of FLOPs they do per cycle compared to previous gen processors for multiply-add operations (which is exactly what we're doing!) but only if you're utilizing the vector ops...

Another thing:

We are now supporting multiply dimensioned pieces of mesh in the same mesh (ie we have 1D elements and 3D elements in the same mesh... but not sharing nodes (although they do transfer information back and forth)).  In order to do that seamlessly we have to copy the shape function values in and out of internal data structures.  It would be awesome if we could _pass in_ the vectors to be filled by libMesh instead of having libMesh own the data and pass us back references.  Then we could avoid a lot of copying by passing the same vector to be filled into multiple FE objects (that are of different dimension) and asking them to fill those vectors when we're on the corresponding elements...

Just some junk to think about while we're redesigning some of this stuff.


On Thu, May 10, 2012 at 4:37 PM, Roy Stogner <> wrote:

Paul Bauman has an application that's motivating him to put
vector-valued FE capabilities into libMesh, and I've got a
shame-this-wasnt-done-years-ago that's motivating me to help, so we're
hashing out design ideas now.

Two design decisions that've come up, submitted for general commentary:

1.:  For vector-valued data, it would be good to return shape values as
a vector-of-vector-of-Gradients instead of
vector-of-vector-of-Numbers, and shape derivatives would be a
vector-of-vector-of-Tensors.  (we'll probably postpone second
derivative support and add a Rank3Tensor class or some such for it).

All agreed?  The alternative (returning
vector-of-vector-of-vector-of-Numbers, etc) would be much uglier for
user code.

The catch with this is that the current FEBase already stores
vector-of-vector-of-Numbers.  And returns it, directly, via inline

So, 2.:  Do we create FEAbstract, which doesn't include *phi* data or
get_*phi* methods, then derive FEBase and FEVectorBase from it?  Or do
we just add get_vec_*phi* methods (and vec_*phi data) to FEBase?

With the latter option, we'd have a bunch of redundant (albeit empty)
vectors in every FEBase, and trying to access scalar data from a
vector element type or vice-versa would be a runtime error (or
possibly only an assert), not very type safe.

With the former option, we'd end up having to turn get_*phi* from
inline functions into virtual functions.  This is what we're leaning
towards.  The cost of a virtual function is relatively trivial when
amortized.  I benchmark a few percent overhead when compared to even
dead simple bilinear-fe linear residual-type loops, a few thousandths
of a percent overhead when compared to biquadratic slightly nonlinear
jacobian-type loops.

The cost of preventing crazy optimizations can be huge: in the inlined
case icpc originally detected that my fakey benchmark couldn't be
changing phi in-between get_phi calls, reordered my loops to make the
element loop the inner loop, then vectorized and combined my equations
to chop out 90% of the FLOPs... but even a slight added code
complication (a do-nothing but non-const method called on the fake FE
object) took things back to normal; I can't imagine anything similar
affecting a real code where we call FE::reinit on every element.

Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
Libmesh-devel mailing list