On Tue, 16 Mar 2010, John Peterson wrote:
> Is there something up with our Parallel::max() implementation? In a
> recent code I ran on 256 processors, each call to Parallel::max
> apparently required 24 seconds, orders of magnitude longer than
> something like gather, with presumably way more communication?!
> (You may want to view this PerfLog table snippet in fixed-width fonts.)
> | allgather() 8 0.4039 0.050487
> 0.4286 0.053570 0.00 0.00 |
> | broadcast() 251 0.4242 0.001690
> 0.4242 0.001690 0.00 0.00 |
> | gather() 481 0.3723 0.000774
> 0.3723 0.000774 0.00 0.00 |
> | max() 125 3050.9712
> 24.407770 3050.9712 24.407770 11.78 11.78 |
> I search briefly on the devel message list but didn't see this issue
> discussed previously.
"Each call to Parallel::max" is unlikely, too, since the templated
nature of that function means that some calls (the ones with vector
args) should take much longer than others.
Not sure where that occurs (except in debug mode in places like
MeshTools::libmesh_assert_valid_blah*), though. Let me see...
No, hunting through the code I don't see a single other place that's
doing a max() on more than a tiny vector. And the implementation is
utterly straightforward; max() shouldn't be taking longer than
anything else that requires all-to-all communication.
Any chance you can add some timing output with stack traces and rerun
to see where the offending max() call(s) are? 24 seconds is