log4cplus / Bugs / #103 [AIX] Application seems to be locked on a mutex

Mikael Tintinger - 2010-08-31

pstack_192812.log

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-09-01

assigned_to: nobody --> wilx
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikael Tintinger - 2010-09-03

Reproduct with 1.0.4-rc10 version.
It seems to be due to log4cplus because the issue occurs very quickly if the log level is increased from INFO to DEBUG (much more output).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-09-04

I took a look at the call stacks you have provided. I cannot see anything wrong in log4cplus code though that does not mean there is nothing wrong with it. However your call stacks have raised some questions.

I see two lines like "2101386: /APPLIS/DECALOG/7.0/AAC-2.0_INT/manager/opt/CS/bin/aix/IDCS -name POST_NAV 0" in the pstack_192812.log file. Does that mean that there are call stacks of two processes? Are you doing a fork() and not doing exec() afterwards in a threaded process?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-09-04

milestone: 877523 --> v1.0.4
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-09-08

Another idea, can you try your tests with GCC instead of IBM's AIX compiler? I cannot find any problem in log4cplus itself.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikael Tintinger - 2010-09-21

Sorry, in the log file, we have the call stack of two processes. The only one that causes the issue is the second process:
"2101386: /APPLIS/DECALOG/7.0/AAC-2.0_INT/manager/opt/CS/bin/aix/IDCS -name POST_NAV 0"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-10-15

I have committed a change in revision 1465 to the PRODUCTION_1_0_x branch that makes all mutexes explicitly recursive. Please test the branch if it by a chance does not fix your problem, too.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-10-16

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-10-16

status: closed-fixed --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-10-18

I took another look at the call stacks. I do not think any more that the revision 1465 will fix it. Though, please try it anyway.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikael Tintinger - 2010-11-25

Tested with PRODUCTION_1_0_x and rc11.
Same behaviour occured.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikael Tintinger - 2010-12-02

Issue found : it was a compilation issue on AIX.
You have to define the compilation option -DHAVE_CONFIG_H -D__NOLOCK_ON_OUTPUT.

FYI : http://publib.boulder.ibm.com/infocenter/comphelp/v8v101/index.jsp?topic=%2Fcom.ibm.xlcpp8a.doc%2Fproguide%2Fref%2Fthreadsafe_streams.htm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikael Tintinger - 2010-12-02

Coul you please do this change to configure file ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-12-02

I do not think that removing all locking from all IO streams is a good idea. I also do not understand what kind of live-lock can occur there. It seems more like a workaround or a hack rather than a real fix.

I am going to add a note to the README file linking back to this bug report but I am not going to add the defines that you propose to the default configuration.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-12-02

priority: 5 --> 4

status: open --> open-postponed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-12-02

summary: Application seems to be locked on a mutex --> [AIX] Application seems to be locked on a mutex
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

anokl - 2010-12-02

We cant find exactly what causes this deadlock. It is not trivial. Definitely there is a conflict between our application which uses output streams quite extensively and log4cplus. When we compile either log4cplus with the mentioned option or our application in both cases the deadlock disappears. We prefer compile the log4cplus since it seems to be less dangerous. In your point of view, wilx, could the compilation with this option cause the crashes of whatever inside the log4cplus?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

anokl - 2010-12-02

By the way it is not the live lock that we encounter it is the completely DEAD lock. No processor time consumption at all.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-12-02

It sounds more like some sort of memory corruption or dying threads that leave the mutexes locked. It would be a deadlock if we could find cyclic dependencies between some mutexes/functions. But I was not able to find any such thing. Also, the deadlock that you observe happens in the code outside log4cplus, if it were a deadlock caused by log4cplus then the call stacks would have their tops somewhere inside log4cplus and not in the IBM C++ run time library.

If you remove the locking from std::cout using the defines then you could get into trouble (undefined behaviour) in e.g. situations where one of your threads is logging to ConsoleAppender (which is implemented using std::cout) and another thread just printing anything outside log4cplus to console using std::cout.

Reading the IBM docs, I have noticed they use xlC_r to compile. Are you using xlC_r to compile both your application and log4cplus? Google reveals that the _r part is important for threaded applications.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikael Tintinger - 2010-12-02

lock with dbx

dbx.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mikael Tintinger - 2010-12-02

Dbx stack attached (more accurate than procstack), if it can help us to find out the correct fix.
Thanks for your help wilx !

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-12-02

In thread 12:

pth_spinlock._waitlock(0xdeadbeef, 0xdeadbeef) at 0xd010b584

Is this real or is it some DBX artifact?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

anokl - 2010-12-02

It's real :) It is widly used in AIX. For instance not initialised registers are filed with dead beef as well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Václav Haisman - 2010-12-02

Does it mean that the mutex is uninitialized? Or maybe that it is being used after it has been destroyed?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[AIX] Application seems to be locked on a mutex

Logging Framework for C++

Group

Searches

Help

#103 [AIX] Application seems to be locked on a mutex

Discussion