Java Service Wrapper / Bugs / #280 Crash in java service wrapper WrapperManager.nativeGetControlEvent()

Erik Brakkee - 2013-07-15

Immediately prior to the crash we see the following message in the logs a number of times:

2013-07-15 17:22:43,130 WARNING [sun.rmi.transport.tcp] (RMI TCP Accept-1098) RMI TCP Accept-1098: accept loop for ServerSocket[addr=/0.0.0.0,port=0,localp
ort=1098] throws
java.net.SocketException: Socket closed
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
at java.net.ServerSocket.implAccept(ServerSocket.java:462)
at java.net.ServerSocket.accept(ServerSocket.java:430)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:662)

This could be another symptom of the cause of the crash.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2013-07-16

Erik,
I am sorry for the trouble. We had a similar problem several years ago, but there have been no reports since the fix in 3.3.0. That had been a well understood synchronization issue and we are confident that it has been fixed.

Is this something that you have been able to reproduce?
Also, please let me know if you are running a 32 or 64-bit JVM, the exact JVM version, as well as the Linux architecture being used.

It strange that you would be having the other socket issues prior to the crash. As I said this is not anything we have seen recently.

As this was a crash in the Java process, please confirm that the Wrapper successfully restarted the JVM.

Could you please send your dump file to support@tanukisoftware.com and reference this issue? I would like to take a look at it and see if it provides any clues. Please also send your wrapper.log file.

Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2013-07-16

Group: v3.5.9 --> v3.5.13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Erik Brakkee - 2013-07-16

<quote who="Leif Mortenson">

Erik,
I am sorry for the trouble. We had a similar problem several years ago,
but there have been no reports since the fix in 3.3.0. That had been a
well understood synchronization issue and we are confident that it has
been fixed.

I have read about this issue.

Is this something that you have been able to reproduce?
Also, please let me know if you are running a 32 or 64-bit JVM, the exact
JVM version, as well as the Linux architecture being used.

Have not been able to reproduce it yet. Would need to try with a test that
repidly uses file descriptors to see what the behavior is. The crash has
only occurred once. Previously, we had issues with crashes libzip.so where
the crash would always occur in ZipFile.getEntry(). We eliminated that
problem by measuring and reducing the access to classpath resources. In
the end I think it is an application resource issue that is somehow
causing these crashes in native code.

It is a 64 bit JVM
$ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

$ uname -a
Linux server17 2.6.32-220.4.2.el6.x86_64 #1 SMP Mon Feb 6 16:39:28 EST
2012 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue
Red Hat Enterprise Linux Server release 6.2 (Santiago)
Kernel \r on an \m

It strange that you would be having the other socket issues prior to the
crash. As I said this is not anything we have seen recently.

I suspect the original cause could be a rapid increase in the use of file
descriptors over a very short time. Our monitoring is at 1 minute
intervals and could have missed this. There are no messages in the logs
about file descriptor limits being reached. The monitoring shows moderate
file descriptor use (approx. 200) at the time of the crash (but again,
this is at one minute intervals).

We have increased the file descriptor limit for the java process to 50000
from 9999 to hopefully get more information at the next crash or perhaps
avoid it alltogether.

As this was a crash in the Java process, please confirm that the Wrapper
successfully restarted the JVM.

The wrapper restarted the VM.

Could you please send your dump file to support@tanukisoftware.com and
reference this issue? I would like to take a look at it and see if it
provides any clues. Please also send your wrapper.log file.

The dump file is too big to transport. It would be possible to do some
queries on the core file using gdb for instance. I have tried in the past
to extract a heapdump and stack dump from a core file but this has never
worked.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Erik Brakkee - 2013-07-16

I can also not reproduce the bug by rapidly using up file descriptors. In
that case I get an exception about 'Too many open file' but the
application keeps on running.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leif Mortenson - 2013-07-16

Erik,
Thank you for the update. The function that is crashing (nativeGetControlEvent) basically requests a lock on a semaphore, reads a value from a memory location and then releases the semaphore. There are no new file descriptors that should be getting used as the semaphore itself already exists.

What Edition of the Wrapper are you using? There is a little more work being done in this function in the Professional edition but that is only if you are also using child processes launched with WrapperManager.exec.

Cheers,
Leif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Erik Brakkee - 2013-07-16

<quote who="Leif Mortenson">

Erik,
Thank you for the update. The function that is crashing
(nativeGetControlEvent) basically requests a lock on a semaphore, reads a
value from a memory location and then releases the semaphore. There are
no new file descriptors that should be getting used as the semaphore
itself already exists.

That would probably rule out resource limits of file descriptors.

What Edition of the Wrapper are you using? There is a little more work
being done in this function in the Professional edition but that is only
if you are also using child processes launched with WrapperManager.exec.

It is the community edition.

Cheers,
Leif

[bugs:#280] Crash in java service wrapper
WrapperManager.nativeGetControlEvent()

Status: open
Labels: crash
Created: Mon Jul 15, 2013 08:39 PM UTC by Erik Brakkee
Last Updated: Tue Jul 16, 2013 03:11 AM UTC
Owner: nobody

Today we had a server crash in the java service wrapper. We are using
version 3.5.13 (doublechecked with the wrapper.jar file MANIFEST.MF and
the md5 checksum of the shared library).

OS: RHEL 6.2

The java stack frames in the hs_err file were:

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.tanukisoftware.wrapper.WrapperManager.nativeGetControlEvent()I+0
J org.tanukisoftware.wrapper.WrapperManager$3.run()V
v ~StubRoutines::call_stub

The logs from monitoring show nothing special around the time of the crash
(no high heap space or permgen space usage). Also, the gc.log did not
indicate any concurrent mode failures or full gcs (we are using the CMS
garbage collector).

We still have the core dump file available so we could extract more
information.
Please advise.

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/wrapper/bugs/280/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

Related

Bugs: #280

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Erik Brakkee - 2014-04-03

We have most likely found a cause for these problems. After creating our own service wrapper, we saw freezes in the log files and Bad file descriptor exceptions in some cases. This led us to investigate an issue with file descriptors. Tt turns out that there was some code that cloned a FileOutputStream (indirectly through an object such as a non-static logger for instance) using the objenesis library. The cloned object had a file descriptor with incorrect reference count, causing an early close of the log file as soon as the cloned object would become eligble for garbage collection.

According to the POSIX standard (see e.g. http://pubs.opengroup.org/onlinepubs/009696899/functions/open.html), the next file descriptor used should be the lowest unused one. So in that case, the file descriptor just closed would be quickly reused. For instance, the Tanuki service wrapper could use the new file descriptor with the same numeric value as that of the closed log file.

Now, the original file descriptor (thas is in use now by for instance Tanuki) is still in use by the log file and as soon as the original FileOutputStream become eligble for garbage collection it closes the original file descriptor. This in turn leads to a problem in Tanuki since using a closed file descriptor in native code will likely cause a segmentation violation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Crash in java service wrapper WrapperManager.nativeGetControlEvent()

Group

Searches

Help

#280 Crash in java service wrapper WrapperManager.nativeGetControlEvent()

Related

Discussion

Related