SourceForge has been redesigned. Learn more.
Close

#358 winpthreads: busy-wait + deadlocks since r5491

v1.0 (example)
closed-fixed
winpthreads (2)
5
2013-11-13
2013-11-07
Nils Maier
No

r5491 removed some special handling of EINVAL from sem.c, which turned
out to cause something what looks like a period of busy waiting followed
by a deadlock in some code using pthreads. Code in question is a library
using ffmpeg as a pthreads caller, without using pthreads itself.

After git-bisecting this to r5491, and adding back the removed lines,
the error goes way.

I have to admit that I was too lazy to actually understand the code in question... I just bisected and added the lines back. I don't see any reason given why they were removed in the first place... Maybe they were removed in error?
I also am not currently set up to produce stacks and/or give simple steps-to-reproduce, unfortunately.

FWIW, here is some information about my stack:

  • GCC 4.8.2 cross-compiler on Debian lenny (self-compiled according to my Makefile):
$ i686-w64-mingw32-gcc -v
Using built-in specs.
COLLECT_GCC=i686-w64-mingw32-gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/i686-w64-mingw32/4.8.2/lto-wrapper
Target: i686-w64-mingw32
Configured with: ../../gcc-4.8.2/configure --target=i686-w64-mingw32 --enable-languages=c,c++,objc,obj-c++ --disable-nls --disable-multilib --enable-lto --with-isl=/usr/local --with-cloog=/usr/local CC=gcc-4.7 CXX=g++-4.7 CXXCPP='g++-4.7 -E ' CFLAGS='-O3 -msse2 -mfpmath=sse -mtune=generic -funroll-loops -funswitch-loops -fomit-frame-pointer' CXXFLAGS='-O3 -msse2 -mfpmath=sse -mtune=generic -funroll-loops -funswitch-loops -fomit-frame-pointer'
Thread model: win32
gcc version 4.8.2 (GCC) 
  • Binutils 2.23.2:
$ i686-w64-mingw32-as --version
GNU assembler (GNU Binutils) 2.23.2
  • configure line for winpthreads (trunk):
$ ../../mingw-w64/mingw-w64-libraries/winpthreads/configure --host=i686-w64-mingw32 --prefix=/usr/local/i686-w64-mingw32 --disable-shared CFLAGS="-O3 -mtune=generic"
1 Attachments

Discussion

  • Nils Maier

    Nils Maier - 2013-11-07

    I spoke too fast. I only tried a test application before. The real application still goes into infinite busy-waiting.

    Bisecting makes r5724 the culprit, in particular _spin_lite_lock, which looks like this when compiled ('-O2'/trunk):

    000001a0 <__spin_lite_lock>:
     1a0:   55                      push   %ebp
     1a1:   b8 16 00 00 00          mov    $0x16,%eax
     1a6:   89 e5                   mov    %esp,%ebp
     1a8:   57                      push   %edi
     1a9:   56                      push   %esi
     1aa:   53                      push   %ebx
     1ab:   83 ec 1c                sub    $0x1c,%esp
     1ae:   8b 5d 08                mov    0x8(%ebp),%ebx
     1b1:   85 db                   test   %ebx,%ebx
     1b3:   74 72                   je     227 <__spin_lite_lock+0x87>
     1b5:   8d 7b 0c                lea    0xc(%ebx),%edi
     1b8:   31 d2                   xor    %edx,%edx
     1ba:   f0 83 6b 0c 01          lock subl $0x1,0xc(%ebx)
     1bf:   0f 94 c2                sete   %dl
     1c2:   84 d2                   test   %dl,%dl
     1c4:   8b 35 00 00 00 00       mov    0x0,%esi
     1ca:   75 26                   jne    1f2 <__spin_lite_lock+0x52>
     1cc:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
     1d0:   31 d2                   xor    %edx,%edx
     1d2:   f0 83 07 01             lock addl $0x1,(%edi)
     1d6:   0f 94 c2                sete   %dl
     1d9:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
     1e0:   ff d6                   call   *%esi
     1e2:   31 d2                   xor    %edx,%edx
     1e4:   83 ec 04                sub    $0x4,%esp
     1e7:   f0 83 2f 01             lock subl $0x1,(%edi)
     1eb:   0f 94 c2                sete   %dl
     1ee:   84 d2                   test   %dl,%dl
     1f0:   74 de                   je     1d0 <__spin_lite_lock+0x30>
     1f2:   8b 3b                   mov    (%ebx),%edi
     1f4:   8b 03                   mov    (%ebx),%eax
     1f6:   83 c0 01                add    $0x1,%eax
     1f9:   89 03                   mov    %eax,(%ebx)
     1fb:   31 c0                   xor    %eax,%eax
     1fd:   f0 83 43 0c 01          lock addl $0x1,0xc(%ebx)
     202:   0f 94 c0                sete   %al
     205:   8b 43 04                mov    0x4(%ebx),%eax
     208:   8b 35 00 00 00 00       mov    0x0,%esi
     20e:   39 c7                   cmp    %eax,%edi
     210:   76 13                   jbe    225 <__spin_lite_lock+0x85>
     212:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
     219:   ff d6                   call   *%esi
     21b:   8b 43 04                mov    0x4(%ebx),%eax
     21e:   83 ec 04                sub    $0x4,%esp
     221:   39 c7                   cmp    %eax,%edi
     223:   77 ed                   ja     212 <__spin_lite_lock+0x72>
     225:   31 c0                   xor    %eax,%eax
     227:   8d 65 f4                lea    -0xc(%ebp),%esp
     22a:   5b                      pop    %ebx
     22b:   5e                      pop    %esi
     22c:   5f                      pop    %edi
     22d:   5d                      pop    %ebp
     22e:   c3                      ret    
     22f:   90                      nop
    

    Stuffing a single OutputDebugString in this loop makes the problem go way, for whatever reason.

    I wonder why there is a custom spinlock implementation in the first place and why is this not just a wrapper around CRITIICAL_SECTION instead, which is tried and tested a spinlock as well?

     
    Last edit: Nils Maier 2013-11-07
  • Kai Tietz

    Kai Tietz - 2013-11-07

    This looks interesting. Well, I had choosen the implementation via inline-assembler for making code light-weight. But indeed there is a chance of dead-locking due race.
    I will give you patch some testing.

     
  • Kai Tietz

    Kai Tietz - 2013-11-07

    Hmm, your patch seems to be not ok. I guess it is related to the fact that you are using pthread-api at low-level part locking, which needs to be independent of it.

    See for tests the tests_pthread (eg run via 'make CROSS=x86_64-w64-mingw32 GC'). There are even deadlocks, which aren't there without this patch.

     
  • Nils Maier

    Nils Maier - 2013-11-07

    Yeah, the previous patch was a mess :p

    Here is a polished, new one:

    Implement spinlocks using CRITICAL_SECTION.

    This fixes some issues on i686, where there were instanced observed where
    the spinlock incorrectly got into an infinite, busy loop.

    Also, this fixes code generation issues on i686 when compiling with
    -msse2 or similar yielded incorrect register allocation in the inline
    assembler code (the code is gone now).

    The test_pthread test suite passes (tested x86_64 GC and GC-static), with
    one change to fix the bogus spin3 test.
    The create2 test is still wonky, but it was so even before this change.

     
  • snarfle

    snarfle - 2013-11-08

    I just did a quick read-through of your second patch. Did you consider using InitializeCriticalSectionAndSpinCount instead of InitializeCriticalSection?

     
  • Kai Tietz

    Kai Tietz - 2013-11-10

    I would avoid InitializeCriticalSectionAndSpinCount ... at least it isn't to me absolutely clear, if this API is present on servers, too.

    Niels: I tested your patch for 32-bit and 64-bit, and applied it at rev 6367. Thanks for the work.

     
  • snarfle

    snarfle - 2013-11-10

    According to msdn, this has been available on servers since S2003. They also explicitly mention S2008 and S2012. Does your concern arise from any actual experience or docs?

    Using spin instead of a kernel transition can be a real performance win when locks are acquired and released quickly. It seems an odd decision to not use them in an object named spinlock.

     
  • Nils Maier

    Nils Maier - 2013-11-11

    First of all, critical sections are more lightweight than the other, kernel mode synchronization primitives, and e.g. usually outperform a Windows kernel mutex by a few orders of magnitude.

    But snarfle is right, of course: It would be highly preferable to make use of CS spin counts where available, as e.g. recommended by Microsoft itself and Intel. This could have a significant performance especially in cases where there is a lot of lock contention.

    Here is a patch that will detect the presence of SetCriticalSectionSpinCount at runtime (in the constructor function) and use it where available.
    I choose 4000 as the spin count value, as http://msdn.microsoft.com/en-us/library/windows/desktop/ms686197%28v=vs.85%29.aspx states that this is a good value for spin counts in general and most worst case scenarios, and MS uses it itself. It is impossible to come up with good benchmarks to determine a value, as there are simply too many different workloads, so short of writing a lot of code to auto-tune the value, using an MS recommended value seems reasonable.

    If there is any need to implement the same for platforms lacking a working SetCriticalSectionSpinCount, then the same could be achieved quite easily by spinning manually, calling TryEnterCriticialSection in the spin loop until success, and falling back to EnterCriticialSection if the spin loop doesn't yield a lock. However, I concur with snarfle that it is unlikely to encounter such platforms in the wild, so I'm not quite sure if it is worth implementing that at all? What you do think?

     
  • snarfle

    snarfle - 2013-11-12

    A few points:

    • When this code calls GetProcAddress, it looks for "djSetCriticalSectionSpinCount". What does the "dj" prefix mean?
    • Using TryEnterCriticalSection for earlier platforms wouldn't help, since it also requires XP. I don't know what the minimum supported platform for mingw-w64 is.
    • Since InitializeCriticalSection has the same minimum requirement as SetCriticalSectionSpinCount, does using the GetProcAddress really buy us anything? In fact, if we are using critical sections at all, why not just use InitializeCriticalSectionAndSpinCount in the first place, since it (like InitializeCriticalSection itself) also requires XP?
    • To clarify: the reason I made reference to kernel mode transitions is that's what EnterCriticalSection does if it can't get the lock. Adding the spincount allows you to delay/avoid that expensive operation.
     
  • Nils Maier

    Nils Maier - 2013-11-12

    When this code calls GetProcAddress, it looks for "djSetCriticalSectionSpinCount". What does the "dj" prefix mean?

    Good catch. That should have not gotten in the patch, in the first place. What is it with me and sf?! I added it after the fact, to intensionally break the GetProcAddress call to check that the code still well-behaves when there is no SetCriticalSectionSpinCount (since I lack a platform where this code path would have been taken "natively").
    Anyway, here is a new patch.

    Using TryEnterCriticalSection for earlier platforms wouldn't help, since it also requires XP. I don't know what the minimum supported platform for mingw-w64 is.

    Don't trust MSDN on minimum required version statements. Most of the stuff marked "XP" actually is available before that. It's just that XP right now is the oldest OS still supported, so they mass-changed their docs to say "XP", when it did say Win2000 or even earlier before.

    For example, see OpenFile, which is there since Windows 1.0 IIRC. Still, the docs say "XP".

    Since InitializeCriticalSection has the same minimum requirement as SetCriticalSectionSpinCount, does using the GetProcAddress really buy us anything? In fact, if we are using critical sections at all, why not just use InitializeCriticalSectionAndSpinCount in the first place, since it (like InitializeCriticalSection itself) also requires XP?

    See my previous comment. But IIRC SetCriticalSectionSpinCount actually has a minimum required version of XP, so the docs aren't lying this time. Still, I remember using critical sections long before that. So the patch will make winpthreads work on machines running Windows older than XP, at least with regards to spinlocks.c. I'd imagine there are still a bunch of Windows 2000 and earlier boxes out there...

    To clarify: the reason I made reference to kernel mode transitions is that's what EnterCriticalSection does if it can't get the lock. Adding the spincount allows you to delay/avoid that expensive operation.

    You're absolutely right in this regard.

     
  • snarfle

    snarfle - 2013-11-12

    Don't trust MSDN on minimum required version statements

    I hadn't noticed that before, but you are absolutely correct.

    But IIRC SetCriticalSectionSpinCount actually has a minimum required version of XP

    Turns out there is an import for it back to (at least) the WinMe kernel32.dll (although I'm pretty sure it just returns a not-implemented error). This actually makes sense when you realize that spinning while waiting for a lock is really only a useful strategy on computers that have more than 1 processor. So OS's that didn't support multiple processors don't need to support this function, beyond allowing programs that link to the symbol to load.

    While I said before I didn't know what the requirements were for mingw-w64, there's a link.

    With this in mind:

    My resources say the exports for the kernel32.dll in w2000 lists both SetCriticalSectionSpinCount and InitializeCriticalSectionAndSpinCount. Based on the fact that MSDN refers to _WIN32_WINNT as 0x0403 for InitializeCriticalSectionAndSpinCount, I assume it has been in the SDK headers since NT4 sp3.

    So I expect you should be able to link directly to InitializeCriticalSectionAndSpinCount on any platform supported by mingw-w64. I don't feel strongly that you "must" do this, it just seem to me that it makes for (slightly) cleaner code.

    Hopefully kai will accept this second patch.

     
  • Kai Tietz

    Kai Tietz - 2013-11-12

    I am fine by using SetCriticalSectionSpinCount. Well, AFAI found is SetCriticalSectionSpinCount present for Windows 2000, back to NT4 SP3, too. So I don't see why we shouldn't use here instead direct import-linkage of this symbol. No need for LoadLibrary call, and if we would need something like this GetModuleHandle should be used here instead.

     
  • snarfle

    snarfle - 2013-11-13

    Looks good to me.

     
  • Kai Tietz

    Kai Tietz - 2013-11-13
    • status: open --> closed-fixed
     
  • Kai Tietz

    Kai Tietz - 2013-11-13

    Thanks, Niels. I applied your patch at revision 6369 on trunk.

    As issue is fixed now, I closed this thread.

    Thanks,
    Kai

     

Log in to post a comment.