|
From: Bill L. <bli...@to...> - 2003-10-18 01:34:02
|
Hi Leif- As always, thanks for your help with this issue. See my comments below. -Bill > Bill, > That is a big log (20MB) I asked for it though. :-) I found a=20 > single restart in the logs. > You originally started the application at 2003/10/08 17:30:04=20 > and it was=20 > running fine until > 2003/10/14 10:43:03 when it was restarted due to a ping timeout. The=20 > service was then > stopped manually at 2003/10/15 07:49:06. You are correct. That is a=20 > long time to > reproduce the problem. I have actually been monitoring the computer since approximately 2003/10/02, so it is even worse than that. Sometimes it can happen within a day or two of starting the application, and sometimes it is close to a month. > Scanning through the logs, it looks like the highest frequency of=20 > garbage collection > happened right before the JVM was restarted. Each of the=20 > individual GC=20 > sweeps was > very short, but there were a lot of them. You may be right but it may be impossible to tell from the logs. We are using a flag that instructs the JVM to trace the GCs and to send them to a file. Unfortunately, not all of the GC messages are sent to the file. Some of them are sent to stdout (or it could be stderr, I don't know). These are the messages that end up in your logs. Doubly (or triply) unfortunate, the JVM overwrites the GC log on restart, so the GC behavior captured in the file before the app stopped is now gone. (Hmmm, maybe I should turn off the GC messages to the file and turn on the flag to send them to stdout, but that may flood the wrapper logs.) > Looking at the log, the last successful ping was at 2003/10/14=20 > 10:39:57, or 186 > seconds before Wrapper timed out waiting for a ping. The=20 > previous pings=20 > had all been > completing like clockwork once every 6 seconds. >=20 > One thing I noticed is that there is no Java side output=20 > in the log=20 > except for immediately > after the JVM is launched. Is your application redirecting this=20 > output? And if so would it > be possible for you to send me that as well? It might give me some=20 > additional clues. > Esp whether or not the JVM is receiving the final ping request. Yes, our logs redirect stdout and stderr. I sent you our app logs a little while after I sent you the Wrapper logs. Let me know if you did not get them. > From the log so far, I do not have a lot of ideas. Everything is=20 > running fine and then the > JVM stops receiving or responding to pings. I have a Wrapper=20 > controlled app running on > a Win2k at home that has been up for about 7 weeks, so I=20 > don't think it=20 > is a time issue. >=20 > I'll try and think of other ideas. >=20 > >> The problem is that before the JVM is restarted, there are no=20 > >>messages from > >>the JVM about having received any packets. > >> =20 > >> > > > > > >I will go back through the logs and see when the wrapper behavior > >changed and will see if it correlates with any events on the=20 > application > >side. > > > Great, let me know what you find out. >=20 > >>collection by adding the -Xincgc. I was not sure what the > >>-XX:+UseConcMarkSweepGC option does? > >> =20 > >> > > > > > >A couple of months ago, we had some major memory/garbage collection > >issues. After investigation we have found that for our application: > > > >1. When using the default garbage collector, if a major collection is > >performed while some of the JVM is sitting in the paging file, the GC > >times can increase up to 2 orders of magnitude. We were=20 > getting some 80 > >- 90 second garbage collections! Doubling the RAM solved=20 > this problem. > > > >2. We made further improvements in our GC times by using a=20 > GC strategy > >that is new to 1.4.2, the Concurrent Low Pause collector.=20 > There is lots > >of information out there about the new GC strategies. One of=20 > the better > >ones is here: http://java.sun.com/docs/hotspot/gc1.4.2/.=20 > From that web > >page, it says to: "Use the concurrent low pause collector if your > >application would benefit from shorter garbage collector=20 > pauses and can > >afford to share processor resources with the garbage=20 > collector when the > >application is running." I could be wrong, but I am pretty sure that > >time in GC is not the issue here. > > > Thanks always more things to study up on.... Thanks for the link. >=20 > >=20 > > > >>Also try extending your wrapper.ping.timeout to around 300, 5=20 > >>minutes. =20 > >>If the > >>problem is GC related, that will hopefully be long enough=20 > to make the=20 > >>problem > >>go away. If the problem is GC related, then your=20 > >>application would be > >>unresponsive to its clients and not just the Wrapper during=20 > this time=20 > >>however, > >>have you seen such problems? > >> =20 > >> > > > > > >I would rather not do that right now. It feels to me like=20 > there is some > >problem between the wrapper and the application. The=20 > application is not > >working hard and I don't think it is experiencing major GC pauses. > >Because it happens so infrequently I would like to do that as a last > >resort because I won't be comfortable that the issue is fixed for a > >while. > > > Ok, I'll try to think of some other causes. That is the only=20 > thing in=20 > the logs right now so it > is what first comes to mind... >=20 > >> I can't think of anything off hand that I have fixed=20 > >>since version=20 > >>3.0.2 that would > >>affect this, but there have been lots of improvements to=20 > the wrapper.=20 > >>You may want > >>to consider upgrading to version 3.0.5 > >> =20 > >> > > > > > >We can upgrade on a future version, however the application=20 > is part of a > >medical device that has tight FDA constraints. We could change the > >version, but it would be a lot of work. There would be=20 > documentation to > >change, and even worse, we would have to rerun many tests.=20 > If we knew it > >would fix the problem, then we would go ahead and do it. Otherwise, I > >don't want to change. > > > Ok, go ahead and stick with 3.0.2 for now. I don't think=20 > there were any=20 > changes > that would affect this anyway. >=20 > You can play with the ping timeout in your version. But if=20 > you use the=20 > latest version, you > can also change the actual ping interval. May be useful. >=20 > Cheers, > Leif I am reproducing a question I had in one of my messages when I sent the logs directly to you. Perhaps others would be interested also: One idea I am kicking around is to turn off the JVM pinging from the Wrapper. Before we used the Wrapper, the application ran continually and without problem for up to 6 weeks. And that was on a Window 2K box! There have been changes since then but after we solved the memory/GC issues, it still appears very stable, with the exception of this problem. If I do go ahead and turn off the Wrapper's JVM pinging, do you foresee any possible problems? I am concerned that what is happening degrades the communication between the Wrapper and my application. After launching the application, what is the Wrapper actively doing if pinging is disabled? |