Menu

#48 Wrapper 3.1 Timer Problems

v3.1.0
closed-fixed
Misc (42)
5
2014-08-21
2004-05-07
Marc Lennox
No

We use wrapper to run our CruiseControl build loop
execution as a windows service. We recently upgraded
to 3.1 from 3.0.5 and have noticed a number of problems
that have forced us to revert to 3.0.5.

Initially, we did not make any configuration changes to
wrapper, we only simply upgraded to the latest release.
During our long overnight build and test, we started
getting errors such as:

Wrapper Process has not received any CPU time for
2787374857 seconds. Extending timeouts.

As well, the build process would simply hang
approximately 3 hours into it.

Next, I tried tweaking some of the new properties to use
the tick timer instead of the clock timer. With these
parameters, I started getting the following:

The system clock fell behind the timer by 69100ms.
FATAL | wrapper | 2004/05/07 03:26:42 | encountered
a fatal error in Wrapper
FATAL | wrapper | 2004/05/07 03:26:42 |
exceptionCode = EXCEPTION_ACCESS_VIOLATION
FATAL | wrapper | 2004/05/07 03:26:42 |
exceptionFlag =
EXCEPTION_NONCONTINUABLE_EXCEPTION
FATAL | wrapper | 2004/05/07 03:26:42 |
exceptionAddress = 0040E42A
FATAL | wrapper | 2004/05/07 03:26:42 | Write
access exception to 00000000
FATAL | wrapper | 2004/05/07 03:26:42 | Fatal error
in the Timer thread.

Again, this happened approximately 2 - 3 hours into the
process.

As background, the Wrapper is controlling the
CruiseControl java process, however CruiseControl is
configured to launch another JVM to perform the build
loop, which takes 5 hours.

It seems that the new version of wrapper has a problem
with "waiting" for the other JVM to exit for longer than a
certain time.

We have never seen these types of issues with 3.0.5

Thank you.

Discussion

1 2 > >> (Page 1 of 2)
  • Leif Mortenson

    Leif Mortenson - 2004-05-15

    Logged In: YES
    user_id=228081

    Marc,
    Ouch. That sounds like a pair of bad bugs. I will start
    looking at the code to try and figure this out.

    What platform are you running on? (Windows?)

    Also could you please try running your test with the
    following property enabled it will give me much more useful
    information. It will generate a very large log file
    however. If you could attach about the 3-5 minutes before
    the problem occurrs in each timer case it would be very
    helpful in tracking this down.

    wrapper.state_output=TRUE

    Also please post your full wrapper.conf file. Files can be
    attached at the bottom of this issue form.

    Thanks,
    Leif

     
  • Leif Mortenson

    Leif Mortenson - 2004-05-15
    • priority: 5 --> 9
    • assigned_to: nobody --> mortenson
     
  • Marc Lennox

    Marc Lennox - 2004-05-19

    Logged In: YES
    user_id=505589

    We are running on Windows XP. I will enable the property
    you have given me hopefully this week. I have to wait for a
    night where we don't need an overnight build, but I will get
    you this information as soon as I can.

    Thanks very much.

    As for the wrapper.conf, I will attach it now.

     
  • Marc Lennox

    Marc Lennox - 2004-05-19

    Wrapper Configuration File

     
  • Marc Lennox

    Marc Lennox - 2004-05-24

    Logged In: YES
    user_id=505589

    Well, isn't that always the case. I simply can't reproduce this
    problem now. I've got the overnight build/test runs working
    fine with wrapper 3.1.0 for the past 3 nights. I'll keep it going
    to see if this problem shows up again.

     
  • Leif Mortenson

    Leif Mortenson - 2004-05-24

    Logged In: YES
    user_id=228081

    Marc,
    Because it works in most cases, this is most likely a timing
    problem so the addition of that output may just be changing
    things enough to make the probem go away.

    That state info would have been a dream come true in
    tracking this down, but lets try it with just the regular
    debug output. That may still get me enough info.

    wrapper.state_output=FALSE # the default
    wrapper.debug=TRUE

    Also you are running XP as am I. Could you please tell me a
    little more about that system. multi CPU? memory? CPU
    speed? Can you guess at the system load at the time of
    this problem?

    Thanks,
    Leif.

     
  • Marc Lennox

    Marc Lennox - 2004-05-26

    Logged In: YES
    user_id=505589

    After turning the state output off, it worked fine for one
    night, then exhibited the same error the next night. I then
    turned on the debugging flag as you asked, and it worked fine
    last night. I'll leave it on tonight and see if it fails.

    As for the system specs:

    Windows XP Pro SP-1
    Dell P4 3.0 GHz
    2 GB RAM

    At the time when the error happened the system would have
    been highly loaded on the CPU... lots of memory to spare
    though.

     
  • Marc Lennox

    Marc Lennox - 2004-05-28

    Logged In: YES
    user_id=505589

    Well, I've turned on the debugging that you asked, and now it
    won't fail... it has worked for 2 nights straight. I'll continue
    with debugging on, but it doesn't look like this thing is going
    to fail with any useful debugging turned on.

     
  • Leif Mortenson

    Leif Mortenson - 2004-05-31

    Logged In: YES
    user_id=228081

    Marc,
    I don't see how this would be causing this problem. But I
    just noticed that you have configured the Wrapper to pass in
    the -Xrs parameter to JVM. The Wrapper's signal handling
    features will not
    work correctly when this is enabled. Were you encountering
    a problem that prompted you to add this or is it just a
    carryover from the original batch files used to launch your
    application.

    The -Xrs feature should not be needed with the Wrapper
    because the Wrapper already protects the JVM from receiving
    such signals.

    Still looking into what could be causing the problems you
    are seeing :-/

    Cheers,
    Leif

     
  • Leif Mortenson

    Leif Mortenson - 2004-05-31

    Logged In: YES
    user_id=228081

    Marc,
    I wanted to confirm that the following message was copied
    directly
    from the log and not entered by hand.

    Wrapper Process has not received any CPU time for
    2787374857 seconds. Extending timeouts.

    The problem is that that number of seconds is a623f709h, or
    it is using all 32 bits of the 32 bit number. The problem
    is that the number is returned by a function which divided
    another number by 1000 just before returning it. I have
    not been able to think of a way that you could have gotten
    such a large number... There may be some memory
    corruption going on... But it is stumping me at the moment...

    Cheers,
    Leif

     
  • Marc Lennox

    Marc Lennox - 2004-05-31

    Logged In: YES
    user_id=505589

    I think I added the -Xrs because "it seemed like the right
    thing to do"... I'll remove it.

    As for the message, my apologies, I had lost the original log
    file, so I did it from memory. I remember that the number was
    very large (same approximate number of digits as shown).

    Also keep in mind that this error was happening when not
    using the tick timer, which I have not done from some time.
    I've been using the tick timer option for a while now, and
    can's seem to reproduce that problem.

     
  • Marc Lennox

    Marc Lennox - 2004-06-01

    Logged In: YES
    user_id=505589

    Turned off all debugging, got the error again... doesn't look
    like I'll be able to provide you with any useful debug
    information.

    INFO | jvm 1 | 2004/06/01 04:34:37 | The system clock
    fell behind the timer by 62200ms.
    FATAL | wrapper | 2004/06/01 04:34:37 | encountered a
    fatal error in Wrapper
    FATAL | wrapper | 2004/06/01 04:34:37 |
    exceptionCode = EXCEPTION_ACCESS_VIOLATION
    FATAL | wrapper | 2004/06/01 04:34:37 | exceptionFlag
    = EXCEPTION_NONCONTINUABLE_EXCEPTION
    FATAL | wrapper | 2004/06/01 04:34:37 |
    exceptionAddress = 0040E42A
    FATAL | wrapper | 2004/06/01 04:34:37 | Write access
    exception to 00000000
    FATAL | wrapper | 2004/06/01 04:34:37 | Fatal error in the
    Timer thread.

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-02

    Logged In: YES
    user_id=228081

    Marc,
    I have now been able to reproduce this on two machines.
    The first was purely by accident. I went to upgrade an
    application at one of my customers using the newest version
    of the Wrapper and it would crash 9 times out of 10 within a
    few seconds of launching. It only happens when the new tick
    timer is enabled. It is a large multi CPU Sparc system.
    The problem is that I could not find another sparc system
    that would reproduce the problem and that particular machine
    runs all of their mission critical applications. It is not
    possible for me to do any debugging on it.

    I did get several runs with wrapper.debug=true and
    wrapper.state_output=true enabled. Several times the
    Wrapper crashed before a single message was even outputted
    to the log. Other times it would happen right as the JVM
    was being launched or immediately there after. This makes
    me think that there is some interference going on between
    the main thread and the tick timer thread. What that is
    however, I am still unsure of. The tick timer thread is by
    design extremely simple... The problem does not appear to
    be related to some normal state change with the operation of
    the Wrapper. This is not surprising as all of that code is
    by design identical no matter which timing method is being used.

    The second machine is a dual CPU W2K box. I got the exact
    same error message, down to the memory address as you
    reported. It also happened about 3 hours after being
    started. I did not however get the message about the system
    clock being behind by 69 seconds.

    That part actually has me a little puzzled. It is expected
    that the system clock will get ahead of the tick timer if
    the system is under heavy load, but the only way that I have
    seen the opposite is if the system time was modified while
    the Wrapper is running. Do you have any software running
    that might be doing that? Maybe synchronizing the time with
    another server?

    Anyway. I have a version of the Wrapper running on that
    same server which was compiled with DEBUG enabled. I am
    hoping to get some more useful information that way.

    Making a little progress, but still don't have much of a
    clue as to what is causing this...

    Thanks for all your help trying to track this down. For
    now go back to using the system tick timer and let me know
    if you are able to reproduce that message about not getting
    any CPU for a large number of seconds. I was able to get
    quite a bit of output from that sparc system and I don't
    think that any more from you would be all that helpful in
    tracking this down. As long as I am able to reproduce this
    more than once on my W2K box, I should be able to track it
    down. Just may take a little while. I will let you know
    if there is anything else that I need help with.

    Will keep you posted.
    Cheers,
    Leif

     
  • Marc Lennox

    Marc Lennox - 2004-06-02

    Logged In: YES
    user_id=505589

    I'll switch back to the system clock timer later this week. We
    have to get out some critical builds this week, so it will have
    to wait.

    As for the system clock, keep in mind that we're using
    Windows XP, which does have the automatic time
    synchronization service... could this be causing the problem?

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-06

    Logged In: YES
    user_id=228081

    Marc,
    Good news. I think I got this fixed on Windows. There was
    a synchronization problem in the logging code. The Windows
    APIs for accessing file handles all show that they should be
    thread safe. But they occasionally give various heap
    access errors deep within the ntdll.dll code. I was able
    to find other references on the web about that code being
    buggy in multi threaded environments.

    The fix was to wrap the code in a Mutex to guarantee that
    only one thread is ever performing logging at any given
    instant. I am still doing long term testing, but the
    problem that you reported is now fixed. After I have done
    a little more testing here I will put up a version that you
    may wish to help test out with prior to the next release? :-)

    The problem is that the code should have been Ok on UNIX
    platforms. I went ahead and added the equivalent code for
    those platforms and it appears to work fine. But I am not
    convinced that the crashes that I was seeing on my
    customer's Solaris box were being caused by this section of
    code. I'll try out this new version on their box the next
    time it is safe to do so. I am also working on getting
    access to a multi CPU solaris box for testing. Just about
    have that lined up. Then I am hoping that I can reproduce
    the problem with 3.1.0 before testing to see whether this
    new fix has affected it.

    Cheers,
    Leif

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-06
    • labels: --> Misc
    • status: open --> open-fixed
     
  • thc

    thc - 2004-06-08

    Logged In: YES
    user_id=1001884

    We changed from 3.0.5 to 3.1.0 for about one month, and
    observed similar problem on one of the machines running the
    wrapper. It happend several times over the period of time:

    INFO | jvm 1 | 2004/05/03 14:06:00 | JVM Process has
    not received any CPU time for 429492510 seconds. Extending
    timeouts.
    INFO | jvm 1 | 2004/05/03 16:53:59 | JVM Process has
    not received any CPU time for 429496729 seconds. Extending
    timeouts.
    INFO | jvm 1 | 2004/06/01 11:58:49 | JVM Process has
    not received any CPU time for 429496468 seconds. Extending
    timeouts.
    INFO | jvm 1 | 2004/06/02 16:47:43 | JVM Process has
    not received any CPU time for 429496729 seconds. Extending
    timeouts.
    INFO | jvm 1 | 2004/06/07 13:50:40 | JVM Process has
    not received any CPU time for 429496504 seconds. Extending
    timeouts.

    I'm turning on the DEBUG mode to see if any additional
    messages would come up when this happens.

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-14

    Logged In: YES
    user_id=228081

    I am still not sure where you are getting those large
    "429492510" second messages. Thanks for another report.
    The exact numbers should hep me track it down however.

    I am wondering if this is maybe this is being caused by the
    same synchronization problems that were leading to the
    crash. Would it be possible for you both to give a
    development build a try? I have placed version 3.1.1c up on
    my server at:
    http://wrapper.tanukisoftware.org/tmp/wrapper_win32_3.1.1c.zip

    Thanks,
    Leif

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-14

    Logged In: YES
    user_id=228081

    Update:
    I have also figured out the cause of the crashes on startup
    on Solaris systems. It was actually a completely unrelated
    problem. When running with the tick timer on a multi-CPU
    system, the main wrapper process would sometimes receive a
    SIGALRM signal within a few milliseconds of the JVM being
    launched. It would not happen if wrapper.startup.delay was
    set > 0. I am not sure where the signal is coming from,
    but everything works fine if I simply catch and ignore it.
    I will do some asking around to see if anyone knows why I
    might be receiving it.

    Cheers,
    Leif

     
  • thc

    thc - 2004-06-14

    Logged In: YES
    user_id=1001884

    Leif,

    Thanks for your response.

    I would love to try a dev-build, but our software is running
    on a RedHat 9.0. In addition, we found out later the large
    timeout value happened when the system was actaully under
    heavy load. The reported large number is not a problem to
    our system, but the process is terminated by the wrapper and
    not coming back until a manual restart, which would be an
    issue to us:

    INFO | jvm 1 | 2004/06/07 13:50:40 | JVM Process has
    not received any CPU time for 429496504 seconds. Extending
    timeouts.
    INFO | jvm 1 | 2004/06/07 13:51:39 | Wrapper Manager:
    The Wrapper code did not ping the JVM for 59 seconds. Quit
    and let the Wrapper resynch.
    INFO | jvm 1 | 2004/06/07 13:51:39 | Wrapper Manager:
    Unable to unregister shutdown hook: null
    ERROR | wrapper | 2004/06/07 13:51:48 | Shutdown failed:
    Timed out waiting for the JVM to terminate.
    ERROR | wrapper | 2004/06/07 13:51:53 | JVM did not exit
    on request, terminated with SIGKILL
    STATUS | wrapper | 2004/06/07 13:51:54 | <-- Wrapper Stopped

    Thanks again.
    THC

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-15

    Logged In: YES
    user_id=228081

    Ok, I think I got the last of this thread figured out. The
    very large cpu timeout values were being caused by an error
    in the way I was calculating the difference between two tick
    counts when the result should be negative. Even if you are
    not using the new tick timer, starting with version 3.1.0,
    the system time is converted to ticks so that there is
    almost no redundant code between the system time and tick
    based timers.

    THC, I believe you were seeing that large CPU timeout
    because you are using the system timer and the actual system
    time got set back by at least a few seconds. Many linux
    systems have a daemon running which synchronizes the system
    clock with an external server.

    At any rate, this should be fixed now.

    If you could both go back and test out the issues covered in
    this thread, I would greatly appreciate it. I have put up
    another development build at the following location:

    http://wrapper.tanukisoftware.org/tmp/wrapper_3.1.1d_src.tar.gz
    http://wrapper.tanukisoftware.org/tmp/wrapper_3.1.1d_src.zip
    http://wrapper.tanukisoftware.org/tmp/wrapper_linux_3.1.1d.tar.gz
    http://wrapper.tanukisoftware.org/tmp/wrapper_win32_3.1.1d.zip

    Thanks,
    Leif

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-15
    • milestone: --> v3.1.0
     
  • Marc Lennox

    Marc Lennox - 2004-06-18

    Logged In: YES
    user_id=505589

    OK, it seems to be working fine for us now with the 3.1.1d
    version.

     
  • thc

    thc - 2004-06-25

    Logged In: YES
    user_id=1001884

    It also appears to be running fine on our system so far.
    Thanks.

     
  • Leif Mortenson

    Leif Mortenson - 2004-06-26

    Logged In: YES
    user_id=228081

    Thanks guys. Glad it is now fixed. I am hoping to get the
    release out soon. I was ready buy a FreeBSD problem was
    discovered during the release process. As soon as that is
    fixed the release goes out.

    Cheers,
    Leif

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB