Re: [Libmesh-devel] Weak scaling updates for GMG post NBX

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 3/20/19 2:13 PM, Stogner, Roy H wrote:
> On Mon, 18 Mar 2019, Boris Boutkov wrote:
>
>> Out of some curiosity I recently rebased my GMG implementation on to the
>> upcoming NBX changes in PR #1965 to do some weak scaling analysis.
>>
>> I ran an np 256 Poisson problem using GMG with ~10k dofs/proc, and in short
>> it, seems like the NBX changes provide some solid improvement bringing my
>> total runtime from something like ~19s to ~16s, so a nice step on the road
>> to  weak GMG scaling.  Pre-NBX I had a good chunk (30% total w/sub) of my
>> time being spent in the alltoall() which post-NBX is down to 2%! This came
>> with a fairly large amount of calls to possible_receive() and now 15% of my
>> total time being spent in there, but the overall timing seems to be a win so
>> thanks much for this work!
> Thanks for the update!
>
> Greedy question: could you try the same timings at, say, np 16?  I was
> pretty confident np 256 would be a big win, since the asymptotic
> scaling is improved, but it'd be nice to have data points at lower
> processor counts too.

Sure. I've updated to include the np16 results which can be found at:

https://drive.google.com/file/d/1X8U1XcZNNEAOK-z33jFFfKuM6zYRjsji/view?usp=sharing

The short of it is that the overall timing is nearly indistinguishable 
at np16. Also similar to before, the 10% of time spent in alltoall() got 
offloaded to possibly_receive(), and basically, the heavy performance 
hits are still the same culprits - but its worth nothing that they are 
slightly 'heavier' at np256 than at np16 which eventually manifests in 
the total time increase. Anyways, I'd say at np16 the changes are 
neutral for this use case.

>
>> Despite these improvements, the weak scaling for the GMG implementation is
>> still a bit lacking unfortunately as np1=~1s. I ran these tests through gperf
>> in order to gain some more insight and it looks to me that major components
>> slowing down the setup time are still refining/coarsening/distributing_dofs
>> which in turn do a lot of nodal parallel consistency adjusting and setting
>> nonlocal_dof_objects and am wondering if there are maybe some low hanging
>> fruit to improve on around those calls.
> There almost certainly is.  Could I get comparable results from your
> new fem_system_ex1 settings (with more coarse refinements, I mean) to
> test with?

I ran these studies on a Poisson problem with quad4s, so I think outside 
of the increased cost of the projections and refinements of the second 
order information, and if we ignore the solve time increase, the 
relatively expensive functions in init_and_attach_petscdm() will 
similarly show up for fem_system_ex1  under increasing mg levels. The 
other option would be the direct comparison using the soon-to-be-merged 
multigrid examples in GRINS which is basically whats presented in the 
attachment.

Either way, I'd certainly be interested to learn how this all behaves on 
other machines because in the past I've seen situations where MPI 
related optimizations were more pessimistic on my local cluster than on 
other systems.

- Boris