Hi all,
we encountered this situation not very often but only on AIX 5.3: the
application seems to be lock when calling _global_lock_common
0xd010b824 _waitlock(??, ??) + 0x27c 0xd010bf28 _local_lock_common(??, ??, ??) + 0x138 0xd01195b8 _mutex_lock(??, ??, ??) + 0x258 0xdb07afa8 log4cplus::thread::Guard::Guard(pthread_mutex_t*)(0x315d49d8, 0x30215990) + 0x28 0xdb07b180 log4cplus::helpers::AppenderAttachableImpl::appendLoopOnAppenders(const log4cplus::spi::InternalLoggingEvent&) const(0x30215830, 0x315d4af0) + 0x48 0xdb2332e0 log4cplus::spi::LoggerImpl::callAppenders(const log4cplus::spi::InternalLoggingEvent&)(0x30225320, 0x315d4af0) + 0x68 0xdb231bd8 log4cplus::spi::LoggerImpl::forcedLog(int,const std::basic_string<char,std::char_traits<char>,std::allocator<char> >&,const char*,int)(0x30225320, 0x4e20, 0x315d4c10, 0xdc44ae28, 0x44) + 0xc0 0xdb068884 log4cplus::Logger::forcedLog(int,const std::basic_string<char,std::char_traits<char>,std::allocator<char> >&,const char*,int)(0x305d9f8c, 0x4e20, 0x315d4c10, 0xdc44ae28, 0x44) + 0x6c ...
This thread seems to be lock by:
0xd010c4c0 _global_lock_common(??, ??, ??) + 0x408 0xd0381610 _rec_mutex_lock(??) + 0x160 0xd01fb99c std::_Lock::_Wait()(??) + 0x5c 0xd01fb8dc std::_Lockit::_Lockit(int)(??, ??) + 0x1c 0xdb0945b4 std::basic_ostream<char,std::char_traits<char> >::sentry::sentry(std::basic_ostream<char,std::char_traits<char> >&)(0x5d3d1570, 0x3022c438) + 0x48 0xdb0dae74 std::basic_ostream<char,std::char_traits<char> >& std::operator<<<char,std::char_traits<char>,std::allocator<char> >(std::basic_ostream<char,std::char_traits<char> >&,const std::basic_string<char,std::char_traits<char>,std::allocator<char> >&)(0x3022c438, 0x5d3d1600) + 0xc0 0xdb1df660 log4cplus::pattern::PatternConverter::formatAndAppend(std::basic_ostream<char,std::char_traits<char> >&,const log4cplus::spi::InternalLoggingEvent&)(0x3022d230, 0x3022c438, 0x5d3d1940) + 0x208 0xdb1df7c0 log4cplus::PatternLayout::formatAndAppend(std::basic_ostream<char,std::char_traits<char> >&,const log4cplus::spi::InternalLoggingEvent&)(0x3022cd10, 0x3022c438, 0x5d3d1940) + 0x70 ...
There are many other threads that are waiting for _global_lock_common. The situation looks like the bug 2961084 , but no SIGNAL handling there. I can't determine if the lock is due to log4cplus or AIX OS himself (or other). Please, find in the attachment file, the whole stack.
Thanks you for the help,
Mikael Tintinger
Reproduct with 1.0.4-rc10 version.
It seems to be due to log4cplus because the issue occurs very quickly if the log level is increased from INFO to DEBUG (much more output).
I took a look at the call stacks you have provided. I cannot see anything wrong in log4cplus code though that does not mean there is nothing wrong with it. However your call stacks have raised some questions.
I see two lines like "2101386: /APPLIS/DECALOG/7.0/AAC-2.0_INT/manager/opt/CS/bin/aix/IDCS -name POST_NAV 0" in the pstack_192812.log file. Does that mean that there are call stacks of two processes? Are you doing a fork() and not doing exec() afterwards in a threaded process?
Another idea, can you try your tests with GCC instead of IBM's AIX compiler? I cannot find any problem in log4cplus itself.
Sorry, in the log file, we have the call stack of two processes. The only one that causes the issue is the second process:
"2101386: /APPLIS/DECALOG/7.0/AAC-2.0_INT/manager/opt/CS/bin/aix/IDCS -name POST_NAV 0"
I have committed a change in revision 1465 to the PRODUCTION_1_0_x branch that makes all mutexes explicitly recursive. Please test the branch if it by a chance does not fix your problem, too.
I took another look at the call stacks. I do not think any more that the revision 1465 will fix it. Though, please try it anyway.
Tested with PRODUCTION_1_0_x and rc11.
Same behaviour occured.
Issue found : it was a compilation issue on AIX.
You have to define the compilation option -DHAVE_CONFIG_H -D__NOLOCK_ON_OUTPUT.
FYI : http://publib.boulder.ibm.com/infocenter/comphelp/v8v101/index.jsp?topic=%2Fcom.ibm.xlcpp8a.doc%2Fproguide%2Fref%2Fthreadsafe_streams.htm
Coul you please do this change to configure file ?
I do not think that removing all locking from all IO streams is a good idea. I also do not understand what kind of live-lock can occur there. It seems more like a workaround or a hack rather than a real fix.
I am going to add a note to the README file linking back to this bug report but I am not going to add the defines that you propose to the default configuration.
We cant find exactly what causes this deadlock. It is not trivial. Definitely there is a conflict between our application which uses output streams quite extensively and log4cplus. When we compile either log4cplus with the mentioned option or our application in both cases the deadlock disappears. We prefer compile the log4cplus since it seems to be less dangerous. In your point of view, wilx, could the compilation with this option cause the crashes of whatever inside the log4cplus?
Thanks
By the way it is not the live lock that we encounter it is the completely DEAD lock. No processor time consumption at all.
It sounds more like some sort of memory corruption or dying threads that leave the mutexes locked. It would be a deadlock if we could find cyclic dependencies between some mutexes/functions. But I was not able to find any such thing. Also, the deadlock that you observe happens in the code outside log4cplus, if it were a deadlock caused by log4cplus then the call stacks would have their tops somewhere inside log4cplus and not in the IBM C++ run time library.
If you remove the locking from std::cout using the defines then you could get into trouble (undefined behaviour) in e.g. situations where one of your threads is logging to ConsoleAppender (which is implemented using std::cout) and another thread just printing anything outside log4cplus to console using std::cout.
Reading the IBM docs, I have noticed they use xlC_r to compile. Are you using xlC_r to compile both your application and log4cplus? Google reveals that the _r part is important for threaded applications.
lock with dbx
Dbx stack attached (more accurate than procstack), if it can help us to find out the correct fix.
Thanks for your help wilx !
In thread 12:
pth_spinlock._waitlock(0xdeadbeef, 0xdeadbeef) at 0xd010b584
Is this real or is it some DBX artifact?
It's real :) It is widly used in AIX. For instance not initialised registers are filed with dead beef as well.
Does it mean that the mutex is uninitialized? Or maybe that it is being used after it has been destroyed?