Re: [Wrapper-user] [EXT] Re: no JVM running (state: DOWN_CLEAN) on linux after OOM

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Christoph

Yes, the following configuration is what causes the Wrapper do restart
based on the text in the console output:
---
wrapper.filter.trigger.1001=java.lang.OutOfMemoryError
wrapper.filter.action.1001=RESTART
wrapper.filter.message.1001=The JVM has run out of memory.
---

It sounds like you are on the right track.  We will see if we can reproduce
something here as well.

Cheers,
Leif

On Wed, Jan 9, 2019 at 10:16 PM Christoph SCHWAIGER <csc...@am...>
wrote:

> CONFIDENTIAL & RESTRICTED
>
>
>
> Hello Leif,
>
>
>
> Those were the very last entries in the wrapper log:
>
> ERROR  | wrapper  | 2019/01/07 15:02:45 | Shutdown failed: Timed out
> waiting for signal from JVM.
>
> ERROR  | wrapper  | 2019/01/07 15:02:46 | JVM did not exit on request,
> termination requested.
>
> STATUS | wrapper  | 2019/01/07 15:02:46 | JVM received a signal SIGKILL
> (9).
>
> STATUS | wrapper  | 2019/01/07 15:02:46 | JVM process is gone.
>
> STATUS | wrapper  | 2019/01/07 15:02:46 | JVM exited after being requested
> to terminate.
>
> STATUS | wrapper  | 2019/01/07 15:02:50 | Reloading Wrapper
> configuration...
>
> STATUS | wrapper  | 2019/01/07 15:02:50 | Launching a JVM...
>
> Time of last entry was the timestamp of java.status file. I had noticed
> that one day later.
>
>
>
> I didn’t notice this log entry before:
>
> STATUS | wrapper  | 2019/01/07 15:02:11 | The JVM has run out of memory.
> Restarting JVM.
>
> …to me it sounds like the wrapper was using the exception (in first email
> it the complete log section) to consider JVM should best be restarted due
> to running out of memory. Which, in this situation was not the case. I
> don’t know why for the JVM heap depletion is the culprit by default when a
> thread cannot be spawned. Definitely there are other reasons for failure.
>
>
>
> The user was limited to 1k processes max, but for what I read, threads are
> counted. In stackoverflow I found this command to count the # of threads
> for a user:
>
>
>
> ps -eo euser,nlwp | grep scheck | awk '{print $2}' |  awk '{ num_threads
> += $1 } END { print num_threads }'
>
>
>
> Currently it shows 7579 - the JVMs are heavily multithreaded.
>
>
>
> Unfortunately I don’t even got a test box to simulate this. It gets time
> to get one.
>
>
>
> Cheers,
>
> Christoph
>
> *From:* Leif Mortenson [mailto:lei...@ta...]
> *Sent:* 09 January 2019 10:50
> *To:* Wrapper User List <wra...@li...>
> *Subject:* Re: [Wrapper-user] [EXT] Re: no JVM running (state:
> DOWN_CLEAN) on linux after OOM
>
>
>
> Christoph
>
> Ok.  So you are using a newer version of the Wrapper, so ignore the issue
> I mentioned about failing to kill the JVM.  That was an old problem.
>
>
>
> Please send the debug output if you get it again.
>
>
>
> We will play around with the ulimits here as well and make sure the
> Wrapper behaves correctly.
>
>
>
> I am maybe not understanding the exact problem.
>
> After you get the OOM and the wrapper tries to restart, is the Wrapper
> just failing to start the next JVM and exiting?  Or is it getting stuck.
>
> The later would be bad, and something we will want to get to the bottom of.
>
>
>
> It does not sound like this is easily reproduceable.  But so, then the
> following will output detailed information about the state.  It is a LOT of
> output though so not realistic unless you are testing.
>
> wrapper.state_output=TRUE
>
>
>
> Cheers,
>
> Leif
>
>
>
> On Wed, Jan 9, 2019 at 6:11 PM Christoph SCHWAIGER <csc...@am...>
> wrote:
>
> CONFIDENTIAL & RESTRICTED
>
>
>
> Hello Leif,
>
>
>
> Thanks for your response.
>
>
>
> Easy one first, the version we use:
>
> [scheck@muctxp5b scheck_unix4]$ ./wrapper  --version
>
> Java Service Wrapper Community Edition 64-bit 3.5.30
>
>   Copyright (C) 1999-2016 Tanuki Software, Ltd. All Rights Reserved.
>
>     http://wrapper.tanukisoftware.com
> <https://clicktime.symantec.com/32PRRMdpdoTcCTFhKkEQTZ96H2?u=http%3A%2F%2Fwrapper.tanukisoftware.com>
>
>
>
> concerning the forced kill, I think I have seen once on another instance
> and time in the wrapper log something like “..JVM received sigkill (9)..”.
>
>
>
> In the case I looked at, the JVM process owned by the wrapper was gone,
> which suits the DOWN_CLEAN as you explained.
>
>
>
> I’ll turn on debug output on a few of them in case it happens again.
>
>
>
> As I interpret it, the configuration as such is OK, as well as the normal
> behaviour: when I i.e. kill the JVM manually, the wrapper brings it back
> online. And due to the OOM situation – more precisely, wrapper and JVM were
> limited by 1024 processes max in ulimits – the wrapper was not able i.e. to
> fork a command and that could explain why recovery stalled. Likely is that
> other wrapper / JVM tandems on the same machine (20-30 tandems) faced the
> same trouble and tried to recover, which would mean sometimes the ceiling
> was reached, sometimes not (i.e. when yet another jvm with many threads was
> killed or die). Does this makes sense to you?
>
>
>
> Should I look into updating my script to interpret the output of “app.sh
> status” concerning certain Java:__ states and kill the wrapper ?
>
> (in such a case the veritas cluster would consider the resource being
> offline and start the wrapper again).
>
> If that is a good idea depends on the amount of states to consider and for
> how long such a state can be tolerated.  Maybe it is paranoid, since our
> box is very big, we should be fine concerning OOM unless we screw up
> settings again. We’re newbies on Linux, used windows for years.
>
>
>
> Cheers,
>
> Christoph
>
>
>
> *From:* Leif Mortenson [mailto:lei...@ta...]
> *Sent:* 09 January 2019 03:10
> *To:* Wrapper User List <wra...@li...>
> *Subject:* [EXT] Re: [Wrapper-user] no JVM running (state: DOWN_CLEAN) on
> linux after OOM
>
>
>
> Christoph
>
>
>
> 1) Could you please send me the wrapper.log file with debug output enabled
> (wrapper.debug=true) that shows what is happening when the Wrapper is
> failing to restart the JVM?
>
> Please include the part of the log showing the last few moments of the JVM
> that runs out of memory as well.
>
>
>
> 2) What version of the Wrapper are you running?
>
> The following issue was fixed in 3.5.16 and sounds like it might be what
> you are seeing.
>
> https://wrapper.tanukisoftware.com/doc/english/release-notes.html#3.5.16
> <https://clicktime.symantec.com/a/1/s_mZYlanJcqYJWQ55URpsksoMfAB69FuqpCaCaHcFZI=?d=lq4ISNO68RaA_a5U2L38JAI42nrP-Lj0_jQA4RKR0ryTRdGXvAEAfHiDUn-vKdryduqkwm-zX0YYsOECXFXDc6niuyt7Ae837n0-wWAZ8u99Nabj6hxgw76Xg8rXhtHV8FEA0rrzVL_1TAZuUAMX2ztmAkWA0qdhQO1XYUkMswad3bsnlUv2XxQZ09Oc1lbfNAXv0DNlGOaVnU6lrHEJobFamicDkAhsG_GVSZVC9oI_NjgxAcJ-M7XOvhLaol54ep5LiB5j_uxRx-67kzXJbZT0fZIK8-9mNXr7t7qXXF3EHeUiqKaJuWdkuTfMfI_ZzmE2QhUiHCnSvJmRfKZPZ8K_jzVJBlUz0PDGfOAqzIOVsQmLsYSkVxRtrXkK_DwR_O_u91EdthCtNsTLDOxUBJzYFmWLI6CrV_jpYReCYAthEio3DegMb4kU9fCvs37XzsrCLlh41tLw87m9neyQHU9F5aZIyZY1&u=https%3A%2F%2Fwrapper.tanukisoftware.com%2Fdoc%2Fenglish%2Frelease-notes.html%233.5.16>
>
> ---
>
> Fix a problem where a JVM process was not stopped completely on a UNIX
> platform and stayed defunct after a forced kill until the Wrapper process
> itself stopped. This was especially noticeable if the JVM is frozen and the
> JVM is being killed forcibly.
>
> ---
>
> Are you seeing a zombie Java process still running?
>
> This bug meant that the JVM was being left around in the background when
> the Wrapper thought it was gone.
>
> If you are out of memory then the next JVM would not have enough memory to
> launch.
>
> If the first JVM is not actually frozen, it would shut itself down after
> losing its backend connection to the Wrapper.  But that might be happening
> too late and result in what you are seeing.
>
>
>
> 3) The DOWN_CLEAN state means that the Wrapper has completely shutdown the
> JVM and cleaned up any associated resources.
>
> We will take a look at the documentation on the following page as you are
> correct that it is missing some information.
>
> https://wrapper.tanukisoftware.com/doc/english/prop-java-statusfile.html
> <https://clicktime.symantec.com/a/1/pFsMh63Y_XDBdbw7xETbo40_Uhah2ByBR7xqKlm8s8w=?d=lq4ISNO68RaA_a5U2L38JAI42nrP-Lj0_jQA4RKR0ryTRdGXvAEAfHiDUn-vKdryduqkwm-zX0YYsOECXFXDc6niuyt7Ae837n0-wWAZ8u99Nabj6hxgw76Xg8rXhtHV8FEA0rrzVL_1TAZuUAMX2ztmAkWA0qdhQO1XYUkMswad3bsnlUv2XxQZ09Oc1lbfNAXv0DNlGOaVnU6lrHEJobFamicDkAhsG_GVSZVC9oI_NjgxAcJ-M7XOvhLaol54ep5LiB5j_uxRx-67kzXJbZT0fZIK8-9mNXr7t7qXXF3EHeUiqKaJuWdkuTfMfI_ZzmE2QhUiHCnSvJmRfKZPZ8K_jzVJBlUz0PDGfOAqzIOVsQmLsYSkVxRtrXkK_DwR_O_u91EdthCtNsTLDOxUBJzYFmWLI6CrV_jpYReCYAthEio3DegMb4kU9fCvs37XzsrCLlh41tLw87m9neyQHU9F5aZIyZY1&u=https%3A%2F%2Fwrapper.tanukisoftware.com%2Fdoc%2Fenglish%2Fprop-java-statusfile.html>
>
>
>
> Cheers,
>
> Leif
>
>
>
> On Tue, Jan 8, 2019 at 8:32 PM Christoph SCHWAIGER <csc...@am...>
> wrote:
>
> CONFIDENTIAL & RESTRICTED
>
>
>
> Hello Leif,
>
>
>
> Thanks for the information about the subscription. I did so.
>
>
>
> We have been using the wrapper on windows for many years, since a couple
> of years we have a standard support version.
>
>
>
> Our problem is on linux RH. *After an out of memory situation (the jvm
> exited) it is not restarted and remains down indefinitely, the status
> script exits with status zero*, so all looks up for the cluster.
> (integrated into veritas cluster). The OOM was bad: not related to JVM, but
> caused by overly optimistic ulimits of the user - that has been corrected.
>
>
>
> STATUS | wrapper  | 2019/01/07 13:38:41 | Launching a JVM...
>
> INFO   | jvm 1    | 2019/01/07 13:38:43 | WrapperManager: Initializing...
>
> INFO   | jvm 1    | 2019/01/07 13:38:45 | S-Check version 3.0.4 Monte Rosa
> from 12-Sep-2018 08:02 by cschwaiger
>
> INFO   | jvm 1    | 2019/01/07 13:38:45 | Scheck is starting on server
> MUCTXP5B
>
> INFO   | jvm 1    | 2019/01/07 13:38:52 | parsed 1 xml files and created 0
> service records.
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 | Exception in thread
> "InactivityMonitor WriteCheck" java.lang.OutOfMemoryError: unable to create
> new native thread
>
> STATUS | wrapper  | 2019/01/07 15:02:11 | The JVM has run out of memory.
> Restarting JVM.
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> java.lang.Thread.start0(Native Method)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> java.lang.Thread.start(Thread.java:717)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> org.apache.activemq.transport.InactivityMonitor.writeCheck(InactivityMonitor.java:147)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> org.apache.activemq.transport.InactivityMonitor$2.run(InactivityMonitor.java:113)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> org.apache.activemq.thread.SchedulerTimerTask.run(SchedulerTimerTask.java:33)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> java.util.TimerThread.mainLoop(Timer.java:555)
>
> INFO   | jvm 1    | 2019/01/07 15:02:11 |       at
> java.util.TimerThread.run(Timer.java:505)
>
> ERROR  | wrapper  | 2019/01/07 15:02:45 | Shutdown failed: Timed out
> waiting for signal from JVM.
>
> ERROR  | wrapper  | 2019/01/07 15:02:46 | JVM did not exit on request,
> termination requested.
>
> STATUS | wrapper  | 2019/01/07 15:02:46 | JVM received a signal SIGKILL
> (9).
>
> STATUS | wrapper  | 2019/01/07 15:02:46 | JVM process is gone.
>
> STATUS | wrapper  | 2019/01/07 15:02:46 | JVM exited after being requested
> to terminate.
>
> STATUS | wrapper  | 2019/01/07 15:02:50 | Reloading Wrapper
> configuration...
>
> STATUS | wrapper  | 2019/01/07 15:02:50 | Launching a JVM...
>
>
>
> [scheck@muctxp5b scheck_unix11]$ ./scheck.sh status
>
> *Service check monitoring instance (not installed) is running: PID:56766,
> Wrapper:STARTED, Java:DOWN_CLEAN*
>
>
>
> I could not find the DOWN_CLEAN state documented – looked at:
> https://wrapper.tanukisoftware.com/doc/english/prop-java-statusfile.html
> <https://clicktime.symantec.com/a/1/pFsMh63Y_XDBdbw7xETbo40_Uhah2ByBR7xqKlm8s8w=?d=lq4ISNO68RaA_a5U2L38JAI42nrP-Lj0_jQA4RKR0ryTRdGXvAEAfHiDUn-vKdryduqkwm-zX0YYsOECXFXDc6niuyt7Ae837n0-wWAZ8u99Nabj6hxgw76Xg8rXhtHV8FEA0rrzVL_1TAZuUAMX2ztmAkWA0qdhQO1XYUkMswad3bsnlUv2XxQZ09Oc1lbfNAXv0DNlGOaVnU6lrHEJobFamicDkAhsG_GVSZVC9oI_NjgxAcJ-M7XOvhLaol54ep5LiB5j_uxRx-67kzXJbZT0fZIK8-9mNXr7t7qXXF3EHeUiqKaJuWdkuTfMfI_ZzmE2QhUiHCnSvJmRfKZPZ8K_jzVJBlUz0PDGfOAqzIOVsQmLsYSkVxRtrXkK_DwR_O_u91EdthCtNsTLDOxUBJzYFmWLI6CrV_jpYReCYAthEio3DegMb4kU9fCvs37XzsrCLlh41tLw87m9neyQHU9F5aZIyZY1&u=https%3A%2F%2Fwrapper.tanukisoftware.com%2Fdoc%2Fenglish%2Fprop-java-statusfile.html>
>
>
>
> ”scheck.sh stop” fails – indefinitely waits for wrapper to stop. A simple
> kill <pid> terminates it.
>
>
>
> Any recommendations – i.e. measures to avoid  hanging in the “looks good =
> status zero, but down” state?
>
>
>
> Below/attached is the information about os version and configuration.
>
>
>
> Thanks in advance,
>
> Christoph
>
>
>
> Linux muctxp5b 2.6.32-754.3.5.el6.x86_64 #1 SMP Thu Aug 9 11:56:22 EDT
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
>