Java Service Wrapper / Bugs / #48 Wrapper 3.1 Timer Problems

Leif Mortenson - 2004-05-15

Logged In: YES
user_id=228081

Marc,
Ouch. That sounds like a pair of bad bugs. I will start
looking at the code to try and figure this out.

What platform are you running on? (Windows?)

Also could you please try running your test with the
following property enabled it will give me much more useful
information. It will generate a very large log file
however. If you could attach about the 3-5 minutes before
the problem occurrs in each timer case it would be very
helpful in tracking this down.

wrapper.state_output=TRUE

Also please post your full wrapper.conf file. Files can be
attached at the bottom of this issue form.

Thanks,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-05-15

priority: 5 --> 9

assigned_to: nobody --> mortenson
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-05-19

Logged In: YES
user_id=505589

We are running on Windows XP. I will enable the property
you have given me hopefully this week. I have to wait for a
night where we don't need an overnight build, but I will get
you this information as soon as I can.

Thanks very much.

As for the wrapper.conf, I will attach it now.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-05-19

Wrapper Configuration File

mxmatrix-build.1.19.properties

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-05-24

Logged In: YES
user_id=505589

Well, isn't that always the case. I simply can't reproduce this
problem now. I've got the overnight build/test runs working
fine with wrapper 3.1.0 for the past 3 nights. I'll keep it going
to see if this problem shows up again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-05-24

Logged In: YES
user_id=228081

Marc,
Because it works in most cases, this is most likely a timing
problem so the addition of that output may just be changing
things enough to make the probem go away.

That state info would have been a dream come true in
tracking this down, but lets try it with just the regular
debug output. That may still get me enough info.

wrapper.state_output=FALSE # the default
wrapper.debug=TRUE

Also you are running XP as am I. Could you please tell me a
little more about that system. multi CPU? memory? CPU
speed? Can you guess at the system load at the time of
this problem?

Thanks,
Leif.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-05-26

Logged In: YES
user_id=505589

After turning the state output off, it worked fine for one
night, then exhibited the same error the next night. I then
turned on the debugging flag as you asked, and it worked fine
last night. I'll leave it on tonight and see if it fails.

As for the system specs:

Windows XP Pro SP-1
Dell P4 3.0 GHz
2 GB RAM

At the time when the error happened the system would have
been highly loaded on the CPU... lots of memory to spare
though.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-05-28

Logged In: YES
user_id=505589

Well, I've turned on the debugging that you asked, and now it
won't fail... it has worked for 2 nights straight. I'll continue
with debugging on, but it doesn't look like this thing is going
to fail with any useful debugging turned on.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-05-31

Logged In: YES
user_id=228081

Marc,
I don't see how this would be causing this problem. But I
just noticed that you have configured the Wrapper to pass in
the -Xrs parameter to JVM. The Wrapper's signal handling
features will not
work correctly when this is enabled. Were you encountering
a problem that prompted you to add this or is it just a
carryover from the original batch files used to launch your
application.

The -Xrs feature should not be needed with the Wrapper
because the Wrapper already protects the JVM from receiving
such signals.

Still looking into what could be causing the problems you
are seeing :-/

Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-05-31

Logged In: YES
user_id=228081

Marc,
I wanted to confirm that the following message was copied
directly
from the log and not entered by hand.

Wrapper Process has not received any CPU time for
2787374857 seconds. Extending timeouts.

The problem is that that number of seconds is a623f709h, or
it is using all 32 bits of the 32 bit number. The problem
is that the number is returned by a function which divided
another number by 1000 just before returning it. I have
not been able to think of a way that you could have gotten
such a large number... There may be some memory
corruption going on... But it is stumping me at the moment...

Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-05-31

Logged In: YES
user_id=505589

I think I added the -Xrs because "it seemed like the right
thing to do"... I'll remove it.

As for the message, my apologies, I had lost the original log
file, so I did it from memory. I remember that the number was
very large (same approximate number of digits as shown).

Also keep in mind that this error was happening when not
using the tick timer, which I have not done from some time.
I've been using the tick timer option for a while now, and
can's seem to reproduce that problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-06-01

Logged In: YES
user_id=505589

Turned off all debugging, got the error again... doesn't look
like I'll be able to provide you with any useful debug
information.

INFO | jvm 1 | 2004/06/01 04:34:37 | The system clock
fell behind the timer by 62200ms.
FATAL | wrapper | 2004/06/01 04:34:37 | encountered a
fatal error in Wrapper
FATAL | wrapper | 2004/06/01 04:34:37 |
exceptionCode = EXCEPTION_ACCESS_VIOLATION
FATAL | wrapper | 2004/06/01 04:34:37 | exceptionFlag
= EXCEPTION_NONCONTINUABLE_EXCEPTION
FATAL | wrapper | 2004/06/01 04:34:37 |
exceptionAddress = 0040E42A
FATAL | wrapper | 2004/06/01 04:34:37 | Write access
exception to 00000000
FATAL | wrapper | 2004/06/01 04:34:37 | Fatal error in the
Timer thread.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-02

Logged In: YES
user_id=228081

Marc,
I have now been able to reproduce this on two machines.
The first was purely by accident. I went to upgrade an
application at one of my customers using the newest version
of the Wrapper and it would crash 9 times out of 10 within a
few seconds of launching. It only happens when the new tick
timer is enabled. It is a large multi CPU Sparc system.
The problem is that I could not find another sparc system
that would reproduce the problem and that particular machine
runs all of their mission critical applications. It is not
possible for me to do any debugging on it.

I did get several runs with wrapper.debug=true and
wrapper.state_output=true enabled. Several times the
Wrapper crashed before a single message was even outputted
to the log. Other times it would happen right as the JVM
was being launched or immediately there after. This makes
me think that there is some interference going on between
the main thread and the tick timer thread. What that is
however, I am still unsure of. The tick timer thread is by
design extremely simple... The problem does not appear to
be related to some normal state change with the operation of
the Wrapper. This is not surprising as all of that code is
by design identical no matter which timing method is being used.

The second machine is a dual CPU W2K box. I got the exact
same error message, down to the memory address as you
reported. It also happened about 3 hours after being
started. I did not however get the message about the system
clock being behind by 69 seconds.

That part actually has me a little puzzled. It is expected
that the system clock will get ahead of the tick timer if
the system is under heavy load, but the only way that I have
seen the opposite is if the system time was modified while
the Wrapper is running. Do you have any software running
that might be doing that? Maybe synchronizing the time with
another server?

Anyway. I have a version of the Wrapper running on that
same server which was compiled with DEBUG enabled. I am
hoping to get some more useful information that way.

Making a little progress, but still don't have much of a
clue as to what is causing this...

Thanks for all your help trying to track this down. For
now go back to using the system tick timer and let me know
if you are able to reproduce that message about not getting
any CPU for a large number of seconds. I was able to get
quite a bit of output from that sparc system and I don't
think that any more from you would be all that helpful in
tracking this down. As long as I am able to reproduce this
more than once on my W2K box, I should be able to track it
down. Just may take a little while. I will let you know
if there is anything else that I need help with.

Will keep you posted.
Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-06-02

Logged In: YES
user_id=505589

I'll switch back to the system clock timer later this week. We
have to get out some critical builds this week, so it will have
to wait.

As for the system clock, keep in mind that we're using
Windows XP, which does have the automatic time
synchronization service... could this be causing the problem?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-06

Logged In: YES
user_id=228081

Marc,
Good news. I think I got this fixed on Windows. There was
a synchronization problem in the logging code. The Windows
APIs for accessing file handles all show that they should be
thread safe. But they occasionally give various heap
access errors deep within the ntdll.dll code. I was able
to find other references on the web about that code being
buggy in multi threaded environments.

The fix was to wrap the code in a Mutex to guarantee that
only one thread is ever performing logging at any given
instant. I am still doing long term testing, but the
problem that you reported is now fixed. After I have done
a little more testing here I will put up a version that you
may wish to help test out with prior to the next release? :-)

The problem is that the code should have been Ok on UNIX
platforms. I went ahead and added the equivalent code for
those platforms and it appears to work fine. But I am not
convinced that the crashes that I was seeing on my
customer's Solaris box were being caused by this section of
code. I'll try out this new version on their box the next
time it is safe to do so. I am also working on getting
access to a multi CPU solaris box for testing. Just about
have that lined up. Then I am hoping that I can reproduce
the problem with 3.1.0 before testing to see whether this
new fix has affected it.

Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-06

labels: --> Misc

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

thc - 2004-06-08

Logged In: YES
user_id=1001884

We changed from 3.0.5 to 3.1.0 for about one month, and
observed similar problem on one of the machines running the
wrapper. It happend several times over the period of time:

INFO | jvm 1 | 2004/05/03 14:06:00 | JVM Process has
not received any CPU time for 429492510 seconds. Extending
timeouts.
INFO | jvm 1 | 2004/05/03 16:53:59 | JVM Process has
not received any CPU time for 429496729 seconds. Extending
timeouts.
INFO | jvm 1 | 2004/06/01 11:58:49 | JVM Process has
not received any CPU time for 429496468 seconds. Extending
timeouts.
INFO | jvm 1 | 2004/06/02 16:47:43 | JVM Process has
not received any CPU time for 429496729 seconds. Extending
timeouts.
INFO | jvm 1 | 2004/06/07 13:50:40 | JVM Process has
not received any CPU time for 429496504 seconds. Extending
timeouts.

I'm turning on the DEBUG mode to see if any additional
messages would come up when this happens.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-14

Logged In: YES
user_id=228081

I am still not sure where you are getting those large
"429492510" second messages. Thanks for another report.
The exact numbers should hep me track it down however.

I am wondering if this is maybe this is being caused by the
same synchronization problems that were leading to the
crash. Would it be possible for you both to give a
development build a try? I have placed version 3.1.1c up on
my server at:
http://wrapper.tanukisoftware.org/tmp/wrapper_win32_3.1.1c.zip

Thanks,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-14

Logged In: YES
user_id=228081

Update:
I have also figured out the cause of the crashes on startup
on Solaris systems. It was actually a completely unrelated
problem. When running with the tick timer on a multi-CPU
system, the main wrapper process would sometimes receive a
SIGALRM signal within a few milliseconds of the JVM being
launched. It would not happen if wrapper.startup.delay was
set > 0. I am not sure where the signal is coming from,
but everything works fine if I simply catch and ignore it.
I will do some asking around to see if anyone knows why I
might be receiving it.

Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

thc - 2004-06-14

Logged In: YES
user_id=1001884

Leif,

Thanks for your response.

I would love to try a dev-build, but our software is running
on a RedHat 9.0. In addition, we found out later the large
timeout value happened when the system was actaully under
heavy load. The reported large number is not a problem to
our system, but the process is terminated by the wrapper and
not coming back until a manual restart, which would be an
issue to us:

INFO | jvm 1 | 2004/06/07 13:50:40 | JVM Process has
not received any CPU time for 429496504 seconds. Extending
timeouts.
INFO | jvm 1 | 2004/06/07 13:51:39 | Wrapper Manager:
The Wrapper code did not ping the JVM for 59 seconds. Quit
and let the Wrapper resynch.
INFO | jvm 1 | 2004/06/07 13:51:39 | Wrapper Manager:
Unable to unregister shutdown hook: null
ERROR | wrapper | 2004/06/07 13:51:48 | Shutdown failed:
Timed out waiting for the JVM to terminate.
ERROR | wrapper | 2004/06/07 13:51:53 | JVM did not exit
on request, terminated with SIGKILL
STATUS | wrapper | 2004/06/07 13:51:54 | <-- Wrapper Stopped

Thanks again.
THC

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-15

Logged In: YES
user_id=228081

Ok, I think I got the last of this thread figured out. The
very large cpu timeout values were being caused by an error
in the way I was calculating the difference between two tick
counts when the result should be negative. Even if you are
not using the new tick timer, starting with version 3.1.0,
the system time is converted to ticks so that there is
almost no redundant code between the system time and tick
based timers.

THC, I believe you were seeing that large CPU timeout
because you are using the system timer and the actual system
time got set back by at least a few seconds. Many linux
systems have a daemon running which synchronizes the system
clock with an external server.

At any rate, this should be fixed now.

If you could both go back and test out the issues covered in
this thread, I would greatly appreciate it. I have put up
another development build at the following location:

http://wrapper.tanukisoftware.org/tmp/wrapper_3.1.1d_src.tar.gz
http://wrapper.tanukisoftware.org/tmp/wrapper_3.1.1d_src.zip
http://wrapper.tanukisoftware.org/tmp/wrapper_linux_3.1.1d.tar.gz
http://wrapper.tanukisoftware.org/tmp/wrapper_win32_3.1.1d.zip

Thanks,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-15

milestone: --> v3.1.0
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Lennox - 2004-06-18

Logged In: YES
user_id=505589

OK, it seems to be working fine for us now with the 3.1.1d
version.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

thc - 2004-06-25

Logged In: YES
user_id=1001884

It also appears to be running fine on our system so far.
Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2004-06-26

Logged In: YES
user_id=228081

Thanks guys. Glad it is now fixed. I am hoping to get the
release out soon. I was ready buy a FreeBSD problem was
discovered during the release process. As soon as that is
fixed the release goes out.

Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Wrapper 3.1 Timer Problems

Group

Searches

Help

#48 Wrapper 3.1 Timer Problems

Discussion