RE: [Wrapper-user] Occasional wrapper restart under light load

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Leif-

Thanks again for your attention with this issue. I have sent the zipped
up logs in a separate email. Please find my answers to your questions
below, inline.

-Bill Littman

> -----Original Message-----
> From: Leif Mortenson [mailto:le...@ta...]=20
> Sent: Thursday, October 16, 2003 10:19 PM
> To: wra...@li...
> Subject: Re: [Wrapper-user] Occasional wrapper restart under=20
> light load
>=20
>=20
> Bill,
>     Could you zip up your full unmodified wrapper.log file=20
> and send it=20
> to me directly?
> Most likely too big for the list.  I want to compare the output when=20
> things are going
> smoothly to that just before it fails.
>=20
>     For some reason, the JVM stops replying to ping requests from the=20
> Wrapper.
> After the ping timeout expires, it gives up and restarts the JVM.
>     A normal ping cycle looks like the following:
> DEBUG  | wrapperp | 2003/10/14 10:43:19 | send a packet 103 : ping
> INFO   | jvm 2    | 2003/10/14 10:43:19 | Received a packet 103 : ping
> INFO   | jvm 2    | 2003/10/14 10:43:19 | Send a packet 103 : ok
> DEBUG  | wrapperp | 2003/10/14 10:43:19 | read a packet 103 : ok
> DEBUG  | wrapper  | 2003/10/14 10:43:19 | Got ping response from JVM
>=20
>     The problem is that before the JVM is restarted, there are no=20
> messages from
> the JVM about having received any packets.

I will go back through the logs and see when the wrapper behavior
changed and will see if it correlates with any events on the application
side.

>     You are using close to 500MB of memory.  I have seen the=20
> JVM take a very
> long time to do a single garbage collection sweep.   When this is=20
> happening, the
> %CPU in the task manager does not always show the system as=20
> being all that
> busy.   If this is the problem you might want to try using=20
> incremental=20
> garbage
> collection by adding the -Xincgc.   I was not sure what the
> -XX:+UseConcMarkSweepGC option does?

A couple of months ago, we had some major memory/garbage collection
issues. After investigation we have found that for our application:

1. When using the default garbage collector, if a major collection is
performed while some of the JVM is sitting in the paging file, the GC
times can increase up to 2 orders of magnitude. We were getting some 80
- 90 second garbage collections! Doubling the RAM solved this problem.

2. We made further improvements in our GC times by using a GC strategy
that is new to 1.4.2, the Concurrent Low Pause collector. There is lots
of information out there about the new GC strategies. One of the better
ones is here: http://java.sun.com/docs/hotspot/gc1.4.2/. From that web
page, it says to: "Use the concurrent low pause collector if your
application would benefit from shorter garbage collector pauses and can
afford to share processor resources with the garbage collector when the
application is running." I could be wrong, but I am pretty sure that
time in GC is not the issue here.

> Also try extending your wrapper.ping.timeout to around 300, 5=20
> minutes. =20
> If the
> problem is GC related, that will hopefully be long enough to make the=20
> problem
> go away.   If the problem is GC related, then your=20
> application would be
> unresponsive to its clients and not just the Wrapper during this time=20
> however,
> have you seen such problems?

I would rather not do that right now. It feels to me like there is some
problem between the wrapper and the application. The application is not
working hard and I don't think it is experiencing major GC pauses.
Because it happens so infrequently I would like to do that as a last
resort because I won't be comfortable that the issue is fixed for a
while.

>     I can't think of anything off hand that I have fixed=20
> since version=20
> 3.0.2 that would
> affect this, but there have been lots of improvements to the wrapper.=20
> You may want
> to consider upgrading to version 3.0.5

We can upgrade on a future version, however the application is part of a
medical device that has tight FDA constraints. We could change the
version, but it would be a lot of work. There would be documentation to
change, and even worse, we would have to rerun many tests. If we knew it
would fix the problem, then we would go ahead and do it. Otherwise, I
don't want to change.

> Cheers,
> Leif
>=20
> Bill Littman wrote:
>=20
> >Hi-
> >
> >I have an application running on Windows 2000. It has dual=20
> Xeons and 2
> >Gigs of memory. I am using Sun Java JRE 1.4.1_02 and Wrapper version
> >3.0.2.
> >
> >My application is a CORBA server serving data from a local=20
> DB2 database.
> >The ORB we use is OpenORB 1.3.0.
> >
> >Very occasionally the wrapper restarts the application while the
> >application is under very little load. There is no reason I=20
> can think of
> >why it would do this. I am pretty sure that the application is not
> >running out of resources. Memory usage appears to be fine=20
> and the number
> >of allocated threads appears bounded. The amount of time from
> >application start to when this problem occurs is not constant and can
> >vary wildly.
> >
> >I have attached my config files. A few notes on the config files:
> >
> >1. When this started, I added the lines: wrapper.ping.timeout=3D180 =
and
> >wrapper.cpu.timeout=3D30. This restart occurs so infrequently=20
> that it is
> >impossible to tell if this helps. As the last restart shows,=20
> it does not
> >solve the problem.
> >
> >2. Just recently, I changed the wrapper.logfile.loglevel to DEBUG and
> >added the line: =
wrapper.request_thread_dump_on_failed_jvm_exit=3DTRUE.
> >
> >3. One of the JVM flags, -Xloggc:C:\Tomo\Logs\GC.log, sends garbage
> >collection traces out to a log. There is a JVM bug where some of the
> >messages are sent to the console instead of the log file.=20
> Those messages
> >end up in the wrapper log produced by the application.
> >
> >So, from the attached wrapper log, here is the last two=20
> pings followed
> >by the error handling traces, with the garbage collection traces
> >removed.
> >
> >DEBUG  | wrapperp | 2003/10/14 10:42:51 | send a packet 103 : ping
> >DEBUG  | wrapperp | 2003/10/14 10:42:57 | send a packet 103 : ping
> >ERROR  | wrapper  | 2003/10/14 10:43:03 | JVM appears hung: Timed out
> >waiting for signal from JVM.
> >STATUS | wrapper  | 2003/10/14 10:43:03 | Dumping JVM state.
> >DEBUG  | wrapper  | 2003/10/14 10:43:03 | Sending BREAK=20
> event to process
> >group 2492.
> >ERROR  | wrapper  | 2003/10/14 10:43:03 | Unable to send=20
> BREAK event to
> >JVM process.  Err(6 : The handle is invalid. (0x6))
> >ERROR  | wrapper  | 2003/10/14 10:43:04 | Java Virtual=20
> Machine did not
> >exit on request, terminated
> >STATUS | wrapper  | 2003/10/14 10:43:10 | Launching a JVM...
> >
> >I am not sure what this means. Can someone help?
> >
> >Thank you.
> >
> >-Bill Littman
> > Lead Software Engineer
> > TomoTherapy, Inc.
> > 1240 Deming Way
> > Madison, WI  53717
> > Direct Phone: 608 824-2815
> > Phone: 608 824-2800
> > Fax: 608 824-2996
> > Web address: http://www.tomotherapy.com
> > Email: bli...@to...
> > =20
> >
>=20
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> SourceForge.net hosts over 70,000 Open Source Projects.
> See the people who have HELPED US provide better services:
> Click here: http://sourceforge.net/supporters.php
> _______________________________________________
> Wrapper-user mailing list
> Wra...@li...
> https://lists.sourceforge.net/lists/listinfo/wrapper-user
>=20