Menu

#103 [AIX] Application seems to be locked on a mutex

v1.0.4
closed
None
4
2015-02-19
2010-08-31
No

Hi all,

we encountered this situation not very often but only on AIX 5.3: the
application seems to be lock when calling _global_lock_common

0xd010b824 _waitlock(??, ??) + 0x27c  
0xd010bf28 _local_lock_common(??, ??, ??) + 0x138  
0xd01195b8 _mutex_lock(??, ??, ??) + 0x258  
0xdb07afa8 log4cplus::thread::Guard::Guard(pthread_mutex_t*)(0x315d49d8,
0x30215990) + 0x28  
0xdb07b180
log4cplus::helpers::AppenderAttachableImpl::appendLoopOnAppenders(const
log4cplus::spi::InternalLoggingEvent&) const(0x30215830, 0x315d4af0) +
0x48  
0xdb2332e0 log4cplus::spi::LoggerImpl::callAppenders(const
log4cplus::spi::InternalLoggingEvent&)(0x30225320, 0x315d4af0) + 0x68  
0xdb231bd8 log4cplus::spi::LoggerImpl::forcedLog(int,const
std::basic_string<char,std::char_traits<char>,std::allocator<char>
>&,const char*,int)(0x30225320, 0x4e20, 0x315d4c10, 0xdc44ae28, 0x44) +
0xc0  
0xdb068884 log4cplus::Logger::forcedLog(int,const
std::basic_string<char,std::char_traits<char>,std::allocator<char>
>&,const char*,int)(0x305d9f8c, 0x4e20, 0x315d4c10, 0xdc44ae28, 0x44) +
0x6c  
... 

This thread seems to be lock by:

0xd010c4c0 _global_lock_common(??, ??, ??) + 0x408  
0xd0381610 _rec_mutex_lock(??) + 0x160  
0xd01fb99c std::_Lock::_Wait()(??) + 0x5c  
0xd01fb8dc std::_Lockit::_Lockit(int)(??, ??) + 0x1c  
0xdb0945b4 std::basic_ostream<char,std::char_traits<char>
>::sentry::sentry(std::basic_ostream<char,std::char_traits<char>
>&)(0x5d3d1570, 0x3022c438) + 0x48  
0xdb0dae74 std::basic_ostream<char,std::char_traits<char> >&
std::operator<<<char,std::char_traits<char>,std::allocator<char>
>(std::basic_ostream<char,std::char_traits<char> >&,const
std::basic_string<char,std::char_traits<char>,std::allocator<char>
>&)(0x3022c438, 0x5d3d1600) + 0xc0  
0xdb1df660
log4cplus::pattern::PatternConverter::formatAndAppend(std::basic_ostream<char,std::char_traits<char>
>&,const log4cplus::spi::InternalLoggingEvent&)(0x3022d230, 0x3022c438,
0x5d3d1940) + 0x208  
0xdb1df7c0
log4cplus::PatternLayout::formatAndAppend(std::basic_ostream<char,std::char_traits<char>
>&,const log4cplus::spi::InternalLoggingEvent&)(0x3022cd10, 0x3022c438,
0x5d3d1940) + 0x70  
...

There are many other threads that are waiting for _global_lock_common. The situation looks like the bug 2961084 , but no SIGNAL handling there. I can't determine if the lock is due to log4cplus or AIX OS himself (or other). Please, find in the attachment file, the whole stack.

Thanks you for the help,
Mikael Tintinger

Discussion

1 2 > >> (Page 1 of 2)
  • Mikael Tintinger

     
  • Václav Haisman

    Václav Haisman - 2010-09-01
    • assigned_to: nobody --> wilx
     
  • Mikael Tintinger

    Reproduct with 1.0.4-rc10 version.
    It seems to be due to log4cplus because the issue occurs very quickly if the log level is increased from INFO to DEBUG (much more output).

     
  • Václav Haisman

    Václav Haisman - 2010-09-04

    I took a look at the call stacks you have provided. I cannot see anything wrong in log4cplus code though that does not mean there is nothing wrong with it. However your call stacks have raised some questions.

    I see two lines like "2101386: /APPLIS/DECALOG/7.0/AAC-2.0_INT/manager/opt/CS/bin/aix/IDCS -name POST_NAV 0" in the pstack_192812.log file. Does that mean that there are call stacks of two processes? Are you doing a fork() and not doing exec() afterwards in a threaded process?

     
  • Václav Haisman

    Václav Haisman - 2010-09-04
    • milestone: 877523 --> v1.0.4
     
  • Václav Haisman

    Václav Haisman - 2010-09-08

    Another idea, can you try your tests with GCC instead of IBM's AIX compiler? I cannot find any problem in log4cplus itself.

     
  • Mikael Tintinger

    Sorry, in the log file, we have the call stack of two processes. The only one that causes the issue is the second process:
    "2101386: /APPLIS/DECALOG/7.0/AAC-2.0_INT/manager/opt/CS/bin/aix/IDCS -name POST_NAV 0"

     
  • Václav Haisman

    Václav Haisman - 2010-10-15

    I have committed a change in revision 1465 to the PRODUCTION_1_0_x branch that makes all mutexes explicitly recursive. Please test the branch if it by a chance does not fix your problem, too.

     
  • Václav Haisman

    Václav Haisman - 2010-10-16
    • status: open --> closed-fixed
     
  • Václav Haisman

    Václav Haisman - 2010-10-16
    • status: closed-fixed --> open
     
  • Václav Haisman

    Václav Haisman - 2010-10-18

    I took another look at the call stacks. I do not think any more that the revision 1465 will fix it. Though, please try it anyway.

     
  • Mikael Tintinger

    Tested with PRODUCTION_1_0_x and rc11.
    Same behaviour occured.

     
  • Mikael Tintinger

    Coul you please do this change to configure file ?

     
  • Václav Haisman

    Václav Haisman - 2010-12-02

    I do not think that removing all locking from all IO streams is a good idea. I also do not understand what kind of live-lock can occur there. It seems more like a workaround or a hack rather than a real fix.

    I am going to add a note to the README file linking back to this bug report but I am not going to add the defines that you propose to the default configuration.

     
  • Václav Haisman

    Václav Haisman - 2010-12-02
    • priority: 5 --> 4
    • status: open --> open-postponed
     
  • Václav Haisman

    Václav Haisman - 2010-12-02
    • summary: Application seems to be locked on a mutex --> [AIX] Application seems to be locked on a mutex
     
  • anokl

    anokl - 2010-12-02

    We cant find exactly what causes this deadlock. It is not trivial. Definitely there is a conflict between our application which uses output streams quite extensively and log4cplus. When we compile either log4cplus with the mentioned option or our application in both cases the deadlock disappears. We prefer compile the log4cplus since it seems to be less dangerous. In your point of view, wilx, could the compilation with this option cause the crashes of whatever inside the log4cplus?

    Thanks

     
  • anokl

    anokl - 2010-12-02

    By the way it is not the live lock that we encounter it is the completely DEAD lock. No processor time consumption at all.

     
  • Václav Haisman

    Václav Haisman - 2010-12-02

    It sounds more like some sort of memory corruption or dying threads that leave the mutexes locked. It would be a deadlock if we could find cyclic dependencies between some mutexes/functions. But I was not able to find any such thing. Also, the deadlock that you observe happens in the code outside log4cplus, if it were a deadlock caused by log4cplus then the call stacks would have their tops somewhere inside log4cplus and not in the IBM C++ run time library.

    If you remove the locking from std::cout using the defines then you could get into trouble (undefined behaviour) in e.g. situations where one of your threads is logging to ConsoleAppender (which is implemented using std::cout) and another thread just printing anything outside log4cplus to console using std::cout.

    Reading the IBM docs, I have noticed they use xlC_r to compile. Are you using xlC_r to compile both your application and log4cplus? Google reveals that the _r part is important for threaded applications.

     
  • Mikael Tintinger

    lock with dbx

     
  • Mikael Tintinger

    Dbx stack attached (more accurate than procstack), if it can help us to find out the correct fix.
    Thanks for your help wilx !

     
  • Václav Haisman

    Václav Haisman - 2010-12-02

    In thread 12:

    pth_spinlock._waitlock(0xdeadbeef, 0xdeadbeef) at 0xd010b584

    Is this real or is it some DBX artifact?

     
  • anokl

    anokl - 2010-12-02

    It's real :) It is widly used in AIX. For instance not initialised registers are filed with dead beef as well.

     
  • Václav Haisman

    Václav Haisman - 2010-12-02

    Does it mean that the mutex is uninitialized? Or maybe that it is being used after it has been destroyed?

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.