From: Roy S. <roy...@ic...> - 2012-03-31 02:24:45
|
Ha - I'm not the only one doing libMesh stuff on a Friday night, huh? Any thoughts on our default and overridden threading grainsizes? I'm playing with them on MIC, and even for a 2D problem with Q2/Q1 Navier-Stokes, I can go from 1000 all the way down to 100 with no performance penalty. Even going down to 10 elements in the smallest range only adds about 10% overhead worst case. I'm wondering if I ought to just check n_local_elem() each time we build a range and pick a grain size of n_local_elem()/n_threads()/N for some small integer N. --- Roy |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-03-31 02:46:31
|
> Ha - I'm not the only one doing libMesh stuff on a Friday night, huh? > > Any thoughts on our default and overridden threading grainsizes? I'm > playing with them on MIC, and even for a 2D problem with Q2/Q1 > Navier-Stokes, I can go from 1000 all the way down to 100 with no > performance penalty. Even going down to 10 elements in the smallest > range only adds about 10% overhead worst case. > > I'm wondering if I ought to just check n_local_elem() each time we > build a range and pick a grain size of n_local_elem()/n_threads()/N > for some small integer N. And therein lies the rub - what to pick?? We had some issues with the original implementation because our predicated iterators are not random access, so splitting a range was nontrivial. I can't remember if anything in the underlying storage would change to make this better, but 10 is surprising and certainly would be no good for linear triangles on a scalar problem. Maybe its because of this automake exercise, but I've been thinking about libmesh as becoming pretty mature - hey there is a debian package! It we can get it in to yum then we'll be in business! But I say that because the thought came to mind that we could create a libMesh::DefautAlgorithmAttirbutes (or something more verbose ;-)) where we stash things like this and then provde a method during initialization so that they can be overriden from environment variables too? I could see a case for this when someone does a make install on a system - there is really no telling what will be the optimal size for each problem. -Ben |
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2012-03-31 03:31:37
|
>> Ha - I'm not the only one doing libMesh stuff on a Friday night, huh? >> >> Any thoughts on our default and overridden threading grainsizes? I'm >> playing with them on MIC, and even for a 2D problem with Q2/Q1 >> Navier-Stokes, I can go from 1000 all the way down to 100 with no >> performance penalty. Even going down to 10 elements in the smallest >> range only adds about 10% overhead worst case. >> >> I'm wondering if I ought to just check n_local_elem() each time we >> build a range and pick a grain size of n_local_elem()/n_threads()/N >> for some small integer N. > > And therein lies the rub - what to pick?? > > We had some issues with the original implementation because our predicated > iterators are not random access, so splitting a range was nontrivial. I > can't remember if anything in the underlying storage would change to make > this better, but 10 is surprising and certainly would be no good for linear > triangles on a scalar problem. And I think it depends too on the total number of elements versus how small you are splitting. Splitting 10,000 into a chunksize of 10 is not as bad as 1,000,000 I seem to recall. But anyway, I think the environment variable approach would be pretty cool. -Ben |