|
From: Boris B. <bor...@bu...> - 2019-03-21 23:52:57
|
On 3/20/19 2:13 PM, Stogner, Roy H wrote: > On Mon, 18 Mar 2019, Boris Boutkov wrote: > >> Out of some curiosity I recently rebased my GMG implementation on to the >> upcoming NBX changes in PR #1965 to do some weak scaling analysis. >> >> I ran an np 256 Poisson problem using GMG with ~10k dofs/proc, and in short >> it, seems like the NBX changes provide some solid improvement bringing my >> total runtime from something like ~19s to ~16s, so a nice step on the road >> to weak GMG scaling. Pre-NBX I had a good chunk (30% total w/sub) of my >> time being spent in the alltoall() which post-NBX is down to 2%! This came >> with a fairly large amount of calls to possible_receive() and now 15% of my >> total time being spent in there, but the overall timing seems to be a win so >> thanks much for this work! > Thanks for the update! > > Greedy question: could you try the same timings at, say, np 16? I was > pretty confident np 256 would be a big win, since the asymptotic > scaling is improved, but it'd be nice to have data points at lower > processor counts too. Sure. I've updated to include the np16 results which can be found at: https://drive.google.com/file/d/1X8U1XcZNNEAOK-z33jFFfKuM6zYRjsji/view?usp=sharing The short of it is that the overall timing is nearly indistinguishable at np16. Also similar to before, the 10% of time spent in alltoall() got offloaded to possibly_receive(), and basically, the heavy performance hits are still the same culprits - but its worth nothing that they are slightly 'heavier' at np256 than at np16 which eventually manifests in the total time increase. Anyways, I'd say at np16 the changes are neutral for this use case. > >> Despite these improvements, the weak scaling for the GMG implementation is >> still a bit lacking unfortunately as np1=~1s. I ran these tests through gperf >> in order to gain some more insight and it looks to me that major components >> slowing down the setup time are still refining/coarsening/distributing_dofs >> which in turn do a lot of nodal parallel consistency adjusting and setting >> nonlocal_dof_objects and am wondering if there are maybe some low hanging >> fruit to improve on around those calls. > There almost certainly is. Could I get comparable results from your > new fem_system_ex1 settings (with more coarse refinements, I mean) to > test with? I ran these studies on a Poisson problem with quad4s, so I think outside of the increased cost of the projections and refinements of the second order information, and if we ignore the solve time increase, the relatively expensive functions in init_and_attach_petscdm() will similarly show up for fem_system_ex1 under increasing mg levels. The other option would be the direct comparison using the soon-to-be-merged multigrid examples in GRINS which is basically whats presented in the attachment. Either way, I'd certainly be interested to learn how this all behaves on other machines because in the past I've seen situations where MPI related optimizations were more pessimistic on my local cluster than on other systems. - Boris |