|
From: Leif M. <le...@ta...> - 2010-11-26 11:52:18
|
Johannesburg,
That looks like a big clue.
What is the timestamp in the lines after the "03:20:19" line? Does the time go back immediately?
To get more output, can you try also setting the following?
Wrapper.loop_output=true
wrapper.timer_output=true
I am on my mobile but I think those are correct. I am wondering if the time is being set high, then back again. The logging code queries the system time for each line. It immediately shows changes when there are time adjustments.
Something like ntp changes seems unlikely as it is always right after the Wrapper starts though.
- Leif
On 2010/11/26, at 19:22, "Johannes Nicolai" <jni...@co...> wrote:
> Hi Leif,
>
> Yes, once the process starts to hang (after ten seconds to one minute), both threads are always in the nanonsleep syscall.
> These are the last lines in wrapper.log:
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 03:20:19 | Sleep: awake
>
> The last output that differs from this pattern is
> DEBUG | wrapperp | 2010/11/26 02:06:39 | send a packet PING : ping
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: awake
> INFO | jvm 1 | 2010/11/26 02:06:39 | WrapperManager Debug: Received a packet PING : ping
> INFO | jvm 1 | 2010/11/26 02:06:39 | WrapperManager Debug: Send a packet PING : ping
> DEBUG | wrapperp | 2010/11/26 02:06:39 | read a packet PING : ping
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: nanosleep 100ms
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: awake
> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: nanosleep 100ms
>
> What I find surprising is the huge jump in time from
> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms
> to
> STATUS | wrapper | 2010/11/26 03:20:19 | Sleep: awake
>
> If you look at the timestamps. The current date at this machine now is 02:19:44, so the time for the last awake has not been reached by now.
>
> Best, Johannes
>
> -----Original Message-----
> From: le...@ta... [mailto:le...@ta...] On Behalf Of Leif Mortenson
> Sent: Friday, November 26, 2010 11:03 AM
> To: wra...@li...
> Cc: Johannes Nicolai
> Subject: Re: [Wrapper-user] ServiceWrapper starts to hang in wrapperSleep after some seconds
>
> Johannes,
> When you get the hang, are you consistently seeing that both of the threads are getting stuck in the nanosleep calls?
>
> We have a low level debug mode that can be enabled to show calls to sleep. This was originally added to debug problems with usleep many years ago. It is not possible to go back to usleep because it has
> many problems with multithreads and interrupts. Please try setting
> the following and reproducing the problem.
>
> wrapper.sleep_output=TRUE
>
> It will only give us new information if any nanosleep calls are failing for any reason however. Otherwise it will simply say when the calls start and stop.
>
> We are continuing to look into this.
>
> Cheers,
> Leif
>
>
> On Fri, Nov 26, 2010 at 11:58 AM, Christian Mueller <chr...@ta...> wrote:
>> Christopher & Johannes,
>>
>> nanosleep is kind of very low-level system function, for which I can't
>> see any problems like this to be common.
>>
>> There are some hooks in your mail I'd like to get closer, you said you
>> are running a 2CPU instance and on vmware. So i'm wondering whether
>> sth. with this configuration is not working as expected.
>>
>> could you please run "cat /proc/<pid>/status" on the machine where
>> <pid> is the and send us also the output from there?
>> Furthermore, is NTP activated on the host?
>>
>> I did some tests on our virtual machines but were not able to
>> reproduce the behavior you described yet...
>>
>> Cheers,
>> Christian
>>
>>
>>
>> On Wed, Nov 24, 2010 at 5:26 PM, Leif Mortenson
>> <lei...@ta...> wrote:
>>> Christopher & Johannes,
>>> Thank you for the detailed report. This is not a problem which I am
>>> aware of to date. We set up a 2 CPU VM today with cent OS and are
>>> running some tests locally. So far we have not been able to
>>> reproduce the problem.
>>>
>>>> From what you said, it sounds like this problem only happens on
>>> shutdown, and can be reproduced easily?
>>>
>>> Could you please try and reproduce this with the following settings
>>> included in the wrapper.conf and then send me the resulting
>>> wrapper.log file directly? It would be a bit large for the list.
>>> wrapper.debug=true
>>> wrapper.state_output=TRUE
>>>
>>> It sounds like the nanosleep implementation is getting stuck for some
>>> reason. It is interesting that an interrupt caused by a signal is
>>> waking it up. We do not currently have a way to change to usleep
>>> without recompiling.
>>>
>>> Cheers,
>>> Leif
>>>
>>>
>>> On Tue, Nov 23, 2010 at 7:31 PM, Christopher Taylor <ct...@co...> wrote:
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I'm forwarding this on behalf of Johannes, he seems to be having
>>>> problems with his list subscription. I apologize if this is a double
>>>> post, I forgot to add the subject on the first mail.
>>>>
>>>>
>>>>
>>>> --- snip ---
>>>>
>>>>
>>>>
>>>> I am using ServiceWrapper (stable version 3.5.6) under 32bit CentOS
>>>> 5 in a VM with 2 CPUs.
>>>>
>>>>
>>>>
>>>> Whenever I stopped a Java program with the wrapper (different
>>>> configurations), the wrapper stopped pinging the JVM after some
>>>> seconds (which eventually leads to the JVM ending itself because it
>>>> does not receive ping packages from the wrapper any more).
>>>>
>>>>
>>>>
>>>> The last system call that never returns according to strace is
>>>> always
>>>> nanosleep:
>>>>
>>>>
>>>>
>>>> gettimeofday({1290172014, 465559}, NULL) = 0
>>>>
>>>> read(5, 0x8895070, 1024) = -1 EAGAIN (Resource
>>>> temporarily
>>>> unavailable)
>>>>
>>>> gettimeofday({1290172014, 465610}, NULL) = 0
>>>>
>>>> recv(7, 0xbffa98cb, 1, 0) = -1 EAGAIN (Resource
>>>> temporarily
>>>> unavailable)
>>>>
>>>> waitpid(13065, 0xbffa98a8, WNOHANG|WSTOPPED) = 0
>>>>
>>>> nanosleep({0, 100000000}, NULL) = 0
>>>>
>>>> gettimeofday({1290172014, 566505}, NULL) = 0
>>>>
>>>> read(5, 0x8895070, 1024) = -1 EAGAIN (Resource
>>>> temporarily
>>>> unavailable)
>>>>
>>>> gettimeofday({1290172014, 566552}, NULL) = 0
>>>>
>>>> recv(7, 0xbffa98cb, 1, 0) = -1 EAGAIN (Resource
>>>> temporarily
>>>> unavailable)
>>>>
>>>> waitpid(13065, 0xbffa98a8, WNOHANG|WSTOPPED) = 0
>>>>
>>>> nanosleep({0, 100000000},
>>>>
>>>>
>>>>
>>>> If I call pstack on the wrapper, you can see that two threads
>>>> currently hang in the nanosleep method:
>>>>
>>>>
>>>>
>>>> Thread 2 (Thread -1210287216 (LWP 13115)):
>>>>
>>>> #0 0x00994410 in __kernel_vsyscall ()
>>>>
>>>> #1 0x00d22506 in __nanosleep_nocancel () from /lib/libpthread.so.0
>>>>
>>>> #2 0x0805b0d0 in wrapperSleep ()
>>>>
>>>> #3 0x0805b420 in timerRunner ()
>>>>
>>>> #4 0x00d1b2db in start_thread () from /lib/libpthread.so.0
>>>>
>>>> #5 0x00c7512e in clone () from /lib/libc.so.6
>>>>
>>>> Thread 1 (Thread -1208169792 (LWP 13114)):
>>>>
>>>> #0 0x00994410 in __kernel_vsyscall ()
>>>>
>>>> #1 0x00d22506 in __nanosleep_nocancel () from /lib/libpthread.so.0
>>>>
>>>> #2 0x0805b0d0 in wrapperSleep ()
>>>>
>>>> #3 0x08059f6c in wrapperEventLoop ()
>>>>
>>>> #4 0x08056628 in wrapperRunConsole ()
>>>>
>>>> #5 0x0805cce3 in main ()
>>>>
>>>> #0 0x00994410 in __kernel_vsyscall ()
>>>>
>>>>
>>>>
>>>> If I send a signal to the wrapper, it reacts again, but before, it
>>>> hangs forever in the nanosleep method.
>>>>
>>>> Is this a known problem of ServiceWrapper running with multiple CPUs?
>>>>
>>>> Do I have to recompile with usleep support instead or is there an
>>>> option to always use usleep?
>>>>
>>>>
>>>>
>>>> Best,
|