|
From: Derek G. <fri...@gm...> - 2013-08-16 23:10:49
|
Guys, We're seeing hard locks on some machines in Singleton::Setup::Setup()! The problem is that it's trying to create a scoped_lock using a mutex that is defined in that file. Apparently that mutex is not guaranteed to have been initialized at the point where we're calling that function (or something) and it is just hanging while trying to acquire that lock! We've seen this on two different machines... one of them was an Intel Phi card though so it was hard to diagnose - but we just had it happen again on a regular Xeon box. I commented out that scoped_lock line and then the binary runs just fine. Why do we need to lock in those functions? Surely the Singleton::Setup stuff is NOT going to get called in a loop. How do we want to proceed? Derek |
|
From: Roy S. <roy...@ic...> - 2013-08-17 02:13:53
|
On Fri, 16 Aug 2013, Derek Gaston wrote: > We're seeing hard locks on some machines in Singleton::Setup::Setup()! > > The problem is that it's trying to create a scoped_lock using a mutex that is defined in that file. > > Apparently that mutex is not guaranteed to have been initialized at the point where we're calling that function (or > something) and it is just hanging while trying to acquire that lock! Hmm... remote_elem_mtx should only get constructed at static initialization time before main() gets called, and RemoteElem::create() should only get called from LibMeshInit::LibMeshInit() afterwards. You're not creating a global LibMeshInit object, are you? > I commented out that scoped_lock line and then the binary runs just fine. Hmmm... would you replace that global mutex with two locals? Maybe there's some problem with a mutex constructor being called before we init TBB? > Why do we need to lock in those functions? Surely the > Singleton::Setup stuff is NOT going to get called in a loop. You're right; the Setup constructor should be called at static init time and the setup() call should be at LibMeshInit constructor time. > How do we want to proceed? It looks like we've got redundant locks that we can safely get rid of... but I'd like to actually *understand* the problem too, and that hasn't happened for me yet. --- Roy |
|
From: Derek G. <fri...@gm...> - 2013-08-17 05:35:36
|
A couple of things: 1. This only happens with pthreads (ie it doesn't happen with TBB). 2. I can confirm that it is the same symptom on the Intel Phi card. 3. I can confirm that it is alleviated by commenting out those scoped_lock lines (for both machines). Pthread locks work by being 0 in their unlocked state and anything else when they are locked. I can only guess that the constructor for those mutexes hasn't yet been called to set the initial value of the lock... therefore it's waiting there. I won't have access to either of those machines to do more testing until next week. For now, I vote for removing those locks as they are unnecessary. BTW - we're not creating a global LibmeshInit. It gets created in main like normal. Derek Sent from my iPad On Aug 16, 2013, at 8:13 PM, Roy Stogner <roy...@ic...> wrote: > > On Fri, 16 Aug 2013, Derek Gaston wrote: > >> We're seeing hard locks on some machines in Singleton::Setup::Setup()! >> The problem is that it's trying to create a scoped_lock using a mutex that is defined in that file. >> Apparently that mutex is not guaranteed to have been initialized at the point where we're calling that function (or >> something) and it is just hanging while trying to acquire that lock! > > Hmm... remote_elem_mtx should only get constructed at static > initialization time before main() gets called, and > RemoteElem::create() should only get called from > LibMeshInit::LibMeshInit() afterwards. > > You're not creating a global LibMeshInit object, are you? > >> I commented out that scoped_lock line and then the binary runs just fine. > > Hmmm... would you replace that global mutex with two locals? Maybe > there's some problem with a mutex constructor being called before we > init TBB? > >> Why do we need to lock in those functions? �Surely the >> Singleton::Setup stuff is NOT going to get called in a loop. > > You're right; the Setup constructor should be called at static init > time and the setup() call should be at LibMeshInit constructor time. > >> How do we want to proceed? > > It looks like we've got redundant locks that we can safely get rid > of... but I'd like to actually *understand* the problem too, and that > hasn't happened for me yet. > --- > Roy |
|
From: Kirk, B. (JSC-EG311) <ben...@na...> - 2013-08-17 13:20:56
|
On Aug 17, 2013, at 12:35 AM, Derek Gaston <fri...@gm...> wrote: > Pthread locks work by being 0 in their unlocked state and anything > else when they are locked. I can only guess that the constructor for > those mutexes hasn't yet been called to set the initial value of the > lock... therefore it's waiting there. > > I won't have access to either of those machines to do more testing > until next week. Agreed that sounds like the issue. Could you check that in down the line when you get access? > For now, I vote for removing those locks as they are unnecessary. Agreed on this point too. Please confirm though this is only the Singleton::Setup::Setup() mutex, and not also the one a few lines below in void Singleton::setup ()? -Ben |
|
From: Roy S. <roy...@ic...> - 2013-08-19 21:50:15
|
On Fri, 16 Aug 2013, Derek Gaston wrote: > I won't have access to either of those machines to do more testing > until next week. > > For now, I vote for removing those locks as they are unnecessary. Apologies if I stepped on your toes, but I just went ahead and pushed a patch which removed the most gratuitous offender. That was enough to fix the problem which I managed to reproduce in a pthreads configuration here; not sure if it's the only lock that was killing you with MOOSE. --- Roy |
|
From: Derek G. <fri...@gm...> - 2013-08-20 01:15:55
|
Thanks! I'm traveling at the moment so I hadn't gotten around to it. We'll double check to see if that is sufficient when I get back (like Thursday). Thanks again! Derek Sent from my iPad On Aug 19, 2013, at 5:50 PM, Roy Stogner <roy...@ic...> wrote: > > On Fri, 16 Aug 2013, Derek Gaston wrote: > >> I won't have access to either of those machines to do more testing >> until next week. >> >> For now, I vote for removing those locks as they are unnecessary. > > Apologies if I stepped on your toes, but I just went ahead and pushed > a patch which removed the most gratuitous offender. That was enough > to fix the problem which I managed to reproduce in a pthreads > configuration here; not sure if it's the only lock that was killing > you with MOOSE. > --- > Roy |