|
From: Leif M. <le...@ta...> - 2010-12-01 03:04:33
|
Johannes, A few more questions and testing ideas. 1) What is the version of vmware ESX that you were using? 2) We have been looking into how nanosleep works on CentOS. All Linux platforms use the CLOCK_MONOTONIC clock rather than the CLOCK_REALTIME, so they should be fine. Even if they were using CLOCK_REALTIME, the nanosleep system call is supposed to take system time changes into account and always use a relative time. http://www.kernel.org/doc/man-pages/online/pages/man2/clock_settime.2.html BUT. There is a note at the bottom of the above page which says that each CPU in an SMP system could be using a different clock source. This can result in them being slightly out of sync and have a slightly different frequency. One problem that I have noticed with vmware in the past is that when you are not using ntp, the guest system time can drift drastically if the guest is being starved of CPU because of other processes on the host. I have seem drifts of several hours within a single day. The problem is that the Guest has no way of telling that it is behind. I am wondering if it is possible that one of your CPU clocks is drifting much more than the other. This is not something that could happen on a physical machine. Then if the Wrapper's threads are jumping between the two CPUs, that might explain why the time is changing. This is just a guess as we have not been able to reproduce this here. Here are some docs on timekeeping with vmware. There are several settings in ESX 4 to improve things: http://www.vmware.com/pdf/vmware_timekeeping.pdf http://www.fjc.net/linux/linux-and-vmware-related-issues/linux-2-6-kernels-and-vmware-clock-drift-issues 3) Along the lines above, I am wondering what happens if you restrict the Wrapper to run on a specific CPU rather than allowing it to balance between the two. This can be done with the taskset command. http://www.cyberciti.biz/faq/taskset-cpu-affinity-command/ Would it be possible for you to try this out to force the Wrapper to a single CPU? Thanks in advance, Leif On Tue, Nov 30, 2010 at 12:38 AM, Leif Mortenson <le...@ta...> wrote: > Johannes, > Thank you for the update on ntpd. Good to be able to remove that as a > possibility. > > This time difference is also 4398 seconds. >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/29 07:16:23 | Sleep: awake > > We will keep working on this on this end as well. > > Cheers, > Leif > > On Mon, Nov 29, 2010 at 11:08 PM, Johannes Nicolai <jni...@co...> wrote: >> Hi Leif, >> >> I shutdown ntpd and still get a similar error: >> >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: handle wrapper state: STARTED >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: handle JVM state: STARTED >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: sleep >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: awake >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: awake >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: maintain logger >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: process JVM output >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: process socket >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: maintain logger(2) >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: handle wrapper state: STARTED >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: handle JVM state: STARTED >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: sleep >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: awake >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: awake >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: maintain logger >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: process JVM output >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: process socket >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: maintain logger(2) >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: handle wrapper state: STARTED >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: handle JVM state: STARTED >> STATUS | wrapper | 2010/11/29 06:03:05 | Loop: sleep >> STATUS | wrapper | 2010/11/29 06:03:05 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/29 07:16:23 | Sleep: awake >> STATUS | wrapper | 2010/11/29 07:16:23 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/29 07:16:23 | Sleep: awake >> STATUS | wrapper | 2010/11/29 07:16:23 | Loop: maintain logger >> STATUS | wrapper | 2010/11/29 07:16:23 | Loop: process JVM output >> STATUS | wrapper | 2010/11/29 07:16:23 | Loop: process socket >> STATUS | wrapper | 2010/11/29 07:16:23 | Loop: maintain logger(2) >> STATUS | wrapper | 2010/11/29 07:16:23 | Loop: handle wrapper state: STARTED >> STATUS | wrapper | 2010/11/29 07:16:23 | Loop: handle JVM state: STARTED >> STATUS | wrapper | 2010/11/29 07:16:23 | Loop: sleep >> >> I will ask ops for the time difference between host and vm. >> >> Best, Johannes >> >> -----Original Message----- >> From: Leif Mortenson [mailto:le...@ta...] >> Sent: Monday, November 29, 2010 5:39 AM >> To: Johannes Nicolai >> Cc: wra...@li... >> Subject: Re: [Wrapper-user] ServiceWrapper starts to hang in wrapperSleep after some seconds >> >> Johannes, >> Thank you this latest log. It is showing that the system time that the Wrapper sees is being changed. The change is not happening when we are in a nanosleep so I don't think they are related. >> >> The first event shows a jump forwards of 4398 seconds: >> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms >> STATUS | wrapper | 2010/11/26 03:20:19 | Sleep: awake >> >> The second event ALSO shows a jump forwards of 4398 seconds: >> STATUS | wrapper | 2010/11/26 03:43:31 | Loop: process socket >> STATUS | wrapper | 2010/11/26 04:56:49 | Loop: maintain logger(2) >> >> The fact that they are identical seems a little odd. I need to look >> into this a bit, but could you double check the system time of the host of your VM server. Is it possible that vmware is trying to sync your host and guest, at the same time that your ntp is doing the same. >> >> The Wrapper seems to be working fine when the time jumps forward, but nanosleep is getting stuck later when something else happens. I am wondering if that "something" is the return of the system time to its previous value. We are running some tests along these lines here locally. >> >> The 4398 seconds number is 0x112e. So it does not appear to be a significant number in and of itself. >> >> Cheers, >> Leif >> >> >> On Fri, Nov 26, 2010 at 8:48 PM, Johannes Nicolai <jni...@co...> wrote: >>> Hi Leif, >>> >>>>What is the timestamp in the lines after the "03:20:19" line? Does >>>>the >>> time go back immediately? >>> This is the last line in wrapper.log >>> >>> After this one, the wrapper starts to hang. >>> >>> I added wrapper.loop_output=true to my config and now see the >>> following last lines before the wrapper starts to hang: >>> >>> STATUS | wrapper | 2010/11/26 03:43:30 | Loop: sleep STATUS | >>> wrapper | 2010/11/26 03:43:30 | Sleep: nanosleep 100ms STATUS | >>> wrapper | 2010/11/26 03:43:30 | Sleep: awake STATUS | wrapper | >>> 2010/11/26 03:43:30 | Sleep: nanosleep 100ms STATUS | wrapper | >>> 2010/11/26 03:43:30 | Sleep: awake STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: maintain logger STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: process JVM output STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: process socket STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: maintain logger(2) STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: handle wrapper >>> state: STARTED >>> STATUS | wrapper | 2010/11/26 03:43:30 | Loop: handle JVM state: >>> STARTED >>> STATUS | wrapper | 2010/11/26 03:43:30 | Loop: sleep STATUS | >>> wrapper | 2010/11/26 03:43:30 | Sleep: nanosleep 100ms STATUS | >>> wrapper | 2010/11/26 03:43:30 | Sleep: awake STATUS | wrapper | >>> 2010/11/26 03:43:30 | Sleep: nanosleep 100ms STATUS | wrapper | >>> 2010/11/26 03:43:30 | Sleep: awake STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: maintain logger STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: process JVM output STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: process socket STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: maintain logger(2) STATUS | wrapper | 2010/11/26 >>> 03:43:30 | Loop: handle wrapper >>> state: STARTED >>> STATUS | wrapper | 2010/11/26 03:43:30 | Loop: handle JVM state: >>> STARTED >>> STATUS | wrapper | 2010/11/26 03:43:30 | Loop: sleep STATUS | >>> wrapper | 2010/11/26 03:43:30 | Sleep: nanosleep 100ms STATUS | >>> wrapper | 2010/11/26 03:43:30 | Sleep: awake STATUS | wrapper | >>> 2010/11/26 03:43:30 | Sleep: nanosleep 100ms STATUS | wrapper | >>> 2010/11/26 03:43:31 | Sleep: awake STATUS | wrapper | 2010/11/26 >>> 03:43:31 | Loop: maintain logger STATUS | wrapper | 2010/11/26 >>> 03:43:31 | Loop: process JVM output STATUS | wrapper | 2010/11/26 >>> 03:43:31 | Loop: process socket STATUS | wrapper | 2010/11/26 >>> 04:56:49 | Loop: maintain logger(2) STATUS | wrapper | 2010/11/26 >>> 04:56:49 | Loop: handle wrapper >>> state: STARTED >>> STATUS | wrapper | 2010/11/26 04:56:49 | Loop: handle JVM state: >>> STARTED >>> STATUS | wrapper | 2010/11/26 04:56:49 | Loop: sleep STATUS | >>> wrapper | 2010/11/26 04:56:49 | Sleep: nanosleep 100ms STATUS | >>> wrapper | 2010/11/26 04:56:49 | Sleep: awake >>> >>> Current time on my machine is 03:45:12 >>> >>> Best, Johannes >>> >>> -----Original Message----- >>> From: Leif Mortenson [mailto:le...@ta...] >>> Sent: Friday, November 26, 2010 12:27 PM >>> To: Johannes Nicolai >>> Cc: Leif Mortenson; <wra...@li...> >>> Subject: Re: [Wrapper-user] ServiceWrapper starts to hang in >>> wrapperSleep after some seconds >>> >>> Johannesburg, >>> That looks like a big clue. >>> >>> What is the timestamp in the lines after the "03:20:19" line? Does >>> the time go back immediately? >>> >>> To get more output, can you try also setting the following? >>> >>> Wrapper.loop_output=true >>> wrapper.timer_output=true >>> >>> I am on my mobile but I think those are correct. I am wondering if >>> the time is being set high, then back again. The logging code queries >>> the system time for each line. It immediately shows changes when >>> there are time adjustments. >>> >>> Something like ntp changes seems unlikely as it is always right after >>> the Wrapper starts though. >>> >>> - Leif >>> >>> >>> On 2010/11/26, at 19:22, "Johannes Nicolai" <jni...@co...> wrote: >>> >>>> Hi Leif, >>>> >>>> Yes, once the process starts to hang (after ten seconds to one >>> minute), both threads are always in the nanonsleep syscall. >>>> These are the last lines in wrapper.log: >>>> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms >>>> STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: awake STATUS | >>>> wrapper | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms STATUS | >>>> wrapper | 2010/11/26 02:07:01 | Sleep: awake STATUS | wrapper | >>>> 2010/11/26 02:07:01 | Sleep: nanosleep 100ms STATUS | wrapper | >>>> 2010/11/26 02:07:01 | Sleep: awake STATUS | wrapper | 2010/11/26 >>>> 02:07:01 | Sleep: nanosleep 100ms STATUS | wrapper | 2010/11/26 >>>> 02:07:01 | Sleep: awake STATUS | wrapper | 2010/11/26 02:07:01 | >>>> Sleep: nanosleep 100ms STATUS | wrapper | 2010/11/26 02:07:01 | >>>> Sleep: awake STATUS | wrapper | 2010/11/26 02:07:01 | Sleep: >>>> nanosleep 100ms STATUS | wrapper | 2010/11/26 03:20:19 | Sleep: >>>> awake >>>> >>>> The last output that differs from this pattern is DEBUG | wrapperp | >>>> 2010/11/26 02:06:39 | send a packet PING : ping STATUS | wrapper | >>>> 2010/11/26 02:06:39 | Sleep: nanosleep 100ms STATUS | wrapper | >>>> 2010/11/26 02:06:39 | Sleep: awake STATUS | wrapper | 2010/11/26 >>>> 02:06:39 | Sleep: nanosleep 100ms STATUS | wrapper | 2010/11/26 >>>> 02:06:39 | Sleep: awake INFO | jvm 1 | 2010/11/26 02:06:39 | >>>> WrapperManager Debug: >>> Received a packet PING : ping >>>> INFO | jvm 1 | 2010/11/26 02:06:39 | WrapperManager Debug: Send >>>> a >>> packet PING : ping >>>> DEBUG | wrapperp | 2010/11/26 02:06:39 | read a packet PING : ping >>>> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: nanosleep 100ms >>>> STATUS | wrapper | 2010/11/26 02:06:39 | Sleep: awake STATUS | >>>> wrapper | 2010/11/26 02:06:39 | Sleep: nanosleep 100ms STATUS | >>>> wrapper | 2010/11/26 02:06:39 | Sleep: awake STATUS | wrapper | >>>> 2010/11/26 02:06:39 | Sleep: nanosleep 100ms >>>> >>>> What I find surprising is the huge jump in time from STATUS | wrapper >>>> | 2010/11/26 02:07:01 | Sleep: nanosleep 100ms to STATUS | >>>> wrapper | 2010/11/26 03:20:19 | Sleep: awake >>>> >>>> If you look at the timestamps. The current date at this machine now >>>> is >>> 02:19:44, so the time for the last awake has not been reached by now. >>>> >>>> Best, Johannes >>>> >>>> -----Original Message----- >>>> From: le...@ta... [mailto:le...@ta...] On >>>> Behalf Of Leif Mortenson >>>> Sent: Friday, November 26, 2010 11:03 AM >>>> To: wra...@li... >>>> Cc: Johannes Nicolai >>>> Subject: Re: [Wrapper-user] ServiceWrapper starts to hang in >>>> wrapperSleep after some seconds >>>> >>>> Johannes, >>>> When you get the hang, are you consistently seeing that both of the >>> threads are getting stuck in the nanosleep calls? >>>> >>>> We have a low level debug mode that can be enabled to show calls to >>> sleep. This was originally added to debug problems with usleep many >>> years ago. It is not possible to go back to usleep because it has >>>> many problems with multithreads and interrupts. Please try setting >>>> the following and reproducing the problem. >>>> >>>> wrapper.sleep_output=TRUE >>>> >>>> It will only give us new information if any nanosleep calls are >>> failing for any reason however. Otherwise it will simply say when the >>> calls start and stop. >>>> >>>> We are continuing to look into this. >>>> >>>> Cheers, >>>> Leif >>>> >>>> >>>> On Fri, Nov 26, 2010 at 11:58 AM, Christian Mueller >>> <chr...@ta...> wrote: >>>>> Christopher & Johannes, >>>>> >>>>> nanosleep is kind of very low-level system function, for which I >>>>> can't see any problems like this to be common. >>>>> >>>>> There are some hooks in your mail I'd like to get closer, you said >>>>> you are running a 2CPU instance and on vmware. So i'm wondering >>>>> whether sth. with this configuration is not working as expected. >>>>> >>>>> could you please run "cat /proc/<pid>/status" on the machine where >>>>> <pid> is the and send us also the output from there? >>>>> Furthermore, is NTP activated on the host? >>>>> >>>>> I did some tests on our virtual machines but were not able to >>>>> reproduce the behavior you described yet... >>>>> >>>>> Cheers, >>>>> Christian >>>>> >>>>> >>>>> >>>>> On Wed, Nov 24, 2010 at 5:26 PM, Leif Mortenson >>>>> <lei...@ta...> wrote: >>>>>> Christopher & Johannes, >>>>>> Thank you for the detailed report. This is not a problem which I >>>>>> am >>> >>>>>> aware of to date. We set up a 2 CPU VM today with cent OS and are >>>>>> running some tests locally. So far we have not been able to >>>>>> reproduce the problem. >>>>>> >>>>>>> From what you said, it sounds like this problem only happens on >>>>>> shutdown, and can be reproduced easily? >>>>>> >>>>>> Could you please try and reproduce this with the following settings >>>>>> included in the wrapper.conf and then send me the resulting >>>>>> wrapper.log file directly? It would be a bit large for the list. >>>>>> wrapper.debug=true >>>>>> wrapper.state_output=TRUE >>>>>> >>>>>> It sounds like the nanosleep implementation is getting stuck for >>>>>> some reason. It is interesting that an interrupt caused by a >>>>>> signal >>> >>>>>> is waking it up. We do not currently have a way to change to usleep >>>>>> without recompiling. >>>>>> >>>>>> Cheers, >>>>>> Leif >>>>>> >>>>>> >>>>>> On Tue, Nov 23, 2010 at 7:31 PM, Christopher Taylor >>> <ct...@co...> wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> >>>>>>> >>>>>>> I'm forwarding this on behalf of Johannes, he seems to be having >>>>>>> problems with his list subscription. I apologize if this is a >>>>>>> double post, I forgot to add the subject on the first mail. >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- snip --- >>>>>>> >>>>>>> >>>>>>> >>>>>>> I am using ServiceWrapper (stable version 3.5.6) under 32bit >>>>>>> CentOS >>>>>>> 5 in a VM with 2 CPUs. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Whenever I stopped a Java program with the wrapper (different >>>>>>> configurations), the wrapper stopped pinging the JVM after some >>>>>>> seconds (which eventually leads to the JVM ending itself because >>>>>>> it >>> >>>>>>> does not receive ping packages from the wrapper any more). >>>>>>> >>>>>>> >>>>>>> >>>>>>> The last system call that never returns according to strace is >>>>>>> always >>>>>>> nanosleep: >>>>>>> >>>>>>> >>>>>>> >>>>>>> gettimeofday({1290172014, 465559}, NULL) = 0 >>>>>>> >>>>>>> read(5, 0x8895070, 1024) = -1 EAGAIN (Resource >>>>>>> temporarily >>>>>>> unavailable) >>>>>>> >>>>>>> gettimeofday({1290172014, 465610}, NULL) = 0 >>>>>>> >>>>>>> recv(7, 0xbffa98cb, 1, 0) = -1 EAGAIN (Resource >>>>>>> temporarily >>>>>>> unavailable) >>>>>>> >>>>>>> waitpid(13065, 0xbffa98a8, WNOHANG|WSTOPPED) = 0 >>>>>>> >>>>>>> nanosleep({0, 100000000}, NULL) = 0 >>>>>>> >>>>>>> gettimeofday({1290172014, 566505}, NULL) = 0 >>>>>>> >>>>>>> read(5, 0x8895070, 1024) = -1 EAGAIN (Resource >>>>>>> temporarily >>>>>>> unavailable) >>>>>>> >>>>>>> gettimeofday({1290172014, 566552}, NULL) = 0 >>>>>>> >>>>>>> recv(7, 0xbffa98cb, 1, 0) = -1 EAGAIN (Resource >>>>>>> temporarily >>>>>>> unavailable) >>>>>>> >>>>>>> waitpid(13065, 0xbffa98a8, WNOHANG|WSTOPPED) = 0 >>>>>>> >>>>>>> nanosleep({0, 100000000}, >>>>>>> >>>>>>> >>>>>>> >>>>>>> If I call pstack on the wrapper, you can see that two threads >>>>>>> currently hang in the nanosleep method: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thread 2 (Thread -1210287216 (LWP 13115)): >>>>>>> >>>>>>> #0 0x00994410 in __kernel_vsyscall () >>>>>>> >>>>>>> #1 0x00d22506 in __nanosleep_nocancel () from >>>>>>> /lib/libpthread.so.0 >>>>>>> >>>>>>> #2 0x0805b0d0 in wrapperSleep () >>>>>>> >>>>>>> #3 0x0805b420 in timerRunner () >>>>>>> >>>>>>> #4 0x00d1b2db in start_thread () from /lib/libpthread.so.0 >>>>>>> >>>>>>> #5 0x00c7512e in clone () from /lib/libc.so.6 >>>>>>> >>>>>>> Thread 1 (Thread -1208169792 (LWP 13114)): >>>>>>> >>>>>>> #0 0x00994410 in __kernel_vsyscall () >>>>>>> >>>>>>> #1 0x00d22506 in __nanosleep_nocancel () from >>>>>>> /lib/libpthread.so.0 >>>>>>> >>>>>>> #2 0x0805b0d0 in wrapperSleep () >>>>>>> >>>>>>> #3 0x08059f6c in wrapperEventLoop () >>>>>>> >>>>>>> #4 0x08056628 in wrapperRunConsole () >>>>>>> >>>>>>> #5 0x0805cce3 in main () >>>>>>> >>>>>>> #0 0x00994410 in __kernel_vsyscall () >>>>>>> >>>>>>> >>>>>>> >>>>>>> If I send a signal to the wrapper, it reacts again, but before, it >>>>>>> hangs forever in the nanosleep method. >>>>>>> >>>>>>> Is this a known problem of ServiceWrapper running with multiple >>> CPUs? >>>>>>> >>>>>>> Do I have to recompile with usleep support instead or is there an >>>>>>> option to always use usleep? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Best, |