Today we had a server crash in the java service wrapper. We are using version 3.5.13 (doublechecked with the wrapper.jar file MANIFEST.MF and the md5 checksum of the shared library).
OS: RHEL 6.2
The java stack frames in the hs_err file were:
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.tanukisoftware.wrapper.WrapperManager.nativeGetControlEvent()I+0
J org.tanukisoftware.wrapper.WrapperManager$3.run()V
v ~StubRoutines::call_stub
The logs from monitoring show nothing special around the time of the crash (no high heap space or permgen space usage). Also, the gc.log did not indicate any concurrent mode failures or full gcs (we are using the CMS garbage collector).
We still have the core dump file available so we could extract more information.
Please advise.
Immediately prior to the crash we see the following message in the logs a number of times:
2013-07-15 17:22:43,130 WARNING [sun.rmi.transport.tcp] (RMI TCP Accept-1098) RMI TCP Accept-1098: accept loop for ServerSocket[addr=/0.0.0.0,port=0,localp
ort=1098] throws
java.net.SocketException: Socket closed
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
at java.net.ServerSocket.implAccept(ServerSocket.java:462)
at java.net.ServerSocket.accept(ServerSocket.java:430)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:662)
This could be another symptom of the cause of the crash.
Erik,
I am sorry for the trouble. We had a similar problem several years ago, but there have been no reports since the fix in 3.3.0. That had been a well understood synchronization issue and we are confident that it has been fixed.
Is this something that you have been able to reproduce?
Also, please let me know if you are running a 32 or 64-bit JVM, the exact JVM version, as well as the Linux architecture being used.
It strange that you would be having the other socket issues prior to the crash. As I said this is not anything we have seen recently.
As this was a crash in the Java process, please confirm that the Wrapper successfully restarted the JVM.
Could you please send your dump file to support@tanukisoftware.com and reference this issue? I would like to take a look at it and see if it provides any clues. Please also send your wrapper.log file.
Cheers,
Leif
<quote who="Leif Mortenson">
I have read about this issue.
Have not been able to reproduce it yet. Would need to try with a test that
repidly uses file descriptors to see what the behavior is. The crash has
only occurred once. Previously, we had issues with crashes libzip.so where
the crash would always occur in ZipFile.getEntry(). We eliminated that
problem by measuring and reducing the access to classpath resources. In
the end I think it is an application resource issue that is somehow
causing these crashes in native code.
It is a 64 bit JVM
$ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
$ uname -a
Linux server17 2.6.32-220.4.2.el6.x86_64 #1 SMP Mon Feb 6 16:39:28 EST
2012 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Red Hat Enterprise Linux Server release 6.2 (Santiago)
Kernel \r on an \m
I suspect the original cause could be a rapid increase in the use of file
descriptors over a very short time. Our monitoring is at 1 minute
intervals and could have missed this. There are no messages in the logs
about file descriptor limits being reached. The monitoring shows moderate
file descriptor use (approx. 200) at the time of the crash (but again,
this is at one minute intervals).
We have increased the file descriptor limit for the java process to 50000
from 9999 to hopefully get more information at the next crash or perhaps
avoid it alltogether.
The wrapper restarted the VM.
The dump file is too big to transport. It would be possible to do some
queries on the core file using gdb for instance. I have tried in the past
to extract a heapdump and stack dump from a core file but this has never
worked.
I can also not reproduce the bug by rapidly using up file descriptors. In
that case I get an exception about 'Too many open file' but the
application keeps on running.
Erik,
Thank you for the update. The function that is crashing (nativeGetControlEvent) basically requests a lock on a semaphore, reads a value from a memory location and then releases the semaphore. There are no new file descriptors that should be getting used as the semaphore itself already exists.
What Edition of the Wrapper are you using? There is a little more work being done in this function in the Professional edition but that is only if you are also using child processes launched with WrapperManager.exec.
Cheers,
Leif
<quote who="Leif Mortenson">
That would probably rule out resource limits of file descriptors.
It is the community edition.
Related
Bugs: #280
We have most likely found a cause for these problems. After creating our own service wrapper, we saw freezes in the log files and Bad file descriptor exceptions in some cases. This led us to investigate an issue with file descriptors. Tt turns out that there was some code that cloned a FileOutputStream (indirectly through an object such as a non-static logger for instance) using the objenesis library. The cloned object had a file descriptor with incorrect reference count, causing an early close of the log file as soon as the cloned object would become eligble for garbage collection.
According to the POSIX standard (see e.g. http://pubs.opengroup.org/onlinepubs/009696899/functions/open.html), the next file descriptor used should be the lowest unused one. So in that case, the file descriptor just closed would be quickly reused. For instance, the Tanuki service wrapper could use the new file descriptor with the same numeric value as that of the closed log file.
Now, the original file descriptor (thas is in use now by for instance Tanuki) is still in use by the log file and as soon as the original FileOutputStream become eligble for garbage collection it closes the original file descriptor. This in turn leads to a problem in Tanuki since using a closed file descriptor in native code will likely cause a segmentation violation.