#280 Crash in java service wrapper WrapperManager.nativeGetControlEvent()

v3.5.13
open
nobody
crash (1)
5
2014-04-03
2013-07-15
Erik Brakkee
No

Today we had a server crash in the java service wrapper. We are using version 3.5.13 (doublechecked with the wrapper.jar file MANIFEST.MF and the md5 checksum of the shared library).

OS: RHEL 6.2

The java stack frames in the hs_err file were:

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.tanukisoftware.wrapper.WrapperManager.nativeGetControlEvent()I+0
J org.tanukisoftware.wrapper.WrapperManager$3.run()V
v ~StubRoutines::call_stub

The logs from monitoring show nothing special around the time of the crash (no high heap space or permgen space usage). Also, the gc.log did not indicate any concurrent mode failures or full gcs (we are using the CMS garbage collector).

We still have the core dump file available so we could extract more information.
Please advise.

Related

Bugs: #280

Discussion

  • Erik Brakkee
    Erik Brakkee
    2013-07-15

    Immediately prior to the crash we see the following message in the logs a number of times:

    2013-07-15 17:22:43,130 WARNING [sun.rmi.transport.tcp] (RMI TCP Accept-1098) RMI TCP Accept-1098: accept loop for ServerSocket[addr=/0.0.0.0,port=0,localp
    ort=1098]
    throws
    java.net.SocketException: Socket closed
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
    at java.net.ServerSocket.implAccept(ServerSocket.java:462)
    at java.net.ServerSocket.accept(ServerSocket.java:430)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
    at java.lang.Thread.run(Thread.java:662)

    This could be another symptom of the cause of the crash.

     
  • Leif Mortenson
    Leif Mortenson
    2013-07-16

    Erik,
    I am sorry for the trouble. We had a similar problem several years ago, but there have been no reports since the fix in 3.3.0. That had been a well understood synchronization issue and we are confident that it has been fixed.

    Is this something that you have been able to reproduce?
    Also, please let me know if you are running a 32 or 64-bit JVM, the exact JVM version, as well as the Linux architecture being used.

    It strange that you would be having the other socket issues prior to the crash. As I said this is not anything we have seen recently.

    As this was a crash in the Java process, please confirm that the Wrapper successfully restarted the JVM.

    Could you please send your dump file to support@tanukisoftware.com and reference this issue? I would like to take a look at it and see if it provides any clues. Please also send your wrapper.log file.

    Cheers,
    Leif

     
  • Leif Mortenson
    Leif Mortenson
    2013-07-16

    • Group: v3.5.9 --> v3.5.13
     
  • Erik Brakkee
    Erik Brakkee
    2013-07-16

    <quote who="Leif Mortenson">

    Erik,
    I am sorry for the trouble. We had a similar problem several years ago,
    but there have been no reports since the fix in 3.3.0. That had been a
    well understood synchronization issue and we are confident that it has
    been fixed.

    I have read about this issue.

    Is this something that you have been able to reproduce?
    Also, please let me know if you are running a 32 or 64-bit JVM, the exact
    JVM version, as well as the Linux architecture being used.

    Have not been able to reproduce it yet. Would need to try with a test that
    repidly uses file descriptors to see what the behavior is. The crash has
    only occurred once. Previously, we had issues with crashes libzip.so where
    the crash would always occur in ZipFile.getEntry(). We eliminated that
    problem by measuring and reducing the access to classpath resources. In
    the end I think it is an application resource issue that is somehow
    causing these crashes in native code.

    It is a 64 bit JVM
    $ java -version
    java version "1.6.0_45"
    Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
    Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

    $ uname -a
    Linux server17 2.6.32-220.4.2.el6.x86_64 #1 SMP Mon Feb 6 16:39:28 EST
    2012 x86_64 x86_64 x86_64 GNU/Linux

    $ cat /etc/issue
    Red Hat Enterprise Linux Server release 6.2 (Santiago)
    Kernel \r on an \m

    It strange that you would be having the other socket issues prior to the
    crash. As I said this is not anything we have seen recently.

    I suspect the original cause could be a rapid increase in the use of file
    descriptors over a very short time. Our monitoring is at 1 minute
    intervals and could have missed this. There are no messages in the logs
    about file descriptor limits being reached. The monitoring shows moderate
    file descriptor use (approx. 200) at the time of the crash (but again,
    this is at one minute intervals).

    We have increased the file descriptor limit for the java process to 50000
    from 9999 to hopefully get more information at the next crash or perhaps
    avoid it alltogether.

    As this was a crash in the Java process, please confirm that the Wrapper
    successfully restarted the JVM.

    The wrapper restarted the VM.

    Could you please send your dump file to support@tanukisoftware.com and
    reference this issue? I would like to take a look at it and see if it
    provides any clues. Please also send your wrapper.log file.

    The dump file is too big to transport. It would be possible to do some
    queries on the core file using gdb for instance. I have tried in the past
    to extract a heapdump and stack dump from a core file but this has never
    worked.

     
  • Erik Brakkee
    Erik Brakkee
    2013-07-16

    I can also not reproduce the bug by rapidly using up file descriptors. In
    that case I get an exception about 'Too many open file' but the
    application keeps on running.

     
  • Leif Mortenson
    Leif Mortenson
    2013-07-16

    Erik,
    Thank you for the update. The function that is crashing (nativeGetControlEvent) basically requests a lock on a semaphore, reads a value from a memory location and then releases the semaphore. There are no new file descriptors that should be getting used as the semaphore itself already exists.

    What Edition of the Wrapper are you using? There is a little more work being done in this function in the Professional edition but that is only if you are also using child processes launched with WrapperManager.exec.

    Cheers,
    Leif

     
  • Erik Brakkee
    Erik Brakkee
    2013-07-16

    <quote who="Leif Mortenson">

    Erik,
    Thank you for the update. The function that is crashing
    (nativeGetControlEvent) basically requests a lock on a semaphore, reads a
    value from a memory location and then releases the semaphore. There are
    no new file descriptors that should be getting used as the semaphore
    itself already exists.

    That would probably rule out resource limits of file descriptors.

    What Edition of the Wrapper are you using? There is a little more work
    being done in this function in the Professional edition but that is only
    if you are also using child processes launched with WrapperManager.exec.

    It is the community edition.

    Cheers,
    Leif


    [bugs:#280] Crash in java service wrapper
    WrapperManager.nativeGetControlEvent()

    Status: open
    Labels: crash
    Created: Mon Jul 15, 2013 08:39 PM UTC by Erik Brakkee
    Last Updated: Tue Jul 16, 2013 03:11 AM UTC
    Owner: nobody

    Today we had a server crash in the java service wrapper. We are using
    version 3.5.13 (doublechecked with the wrapper.jar file MANIFEST.MF and
    the md5 checksum of the shared library).

    OS: RHEL 6.2

    The java stack frames in the hs_err file were:

    Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
    j org.tanukisoftware.wrapper.WrapperManager.nativeGetControlEvent()I+0
    J org.tanukisoftware.wrapper.WrapperManager$3.run()V
    v ~StubRoutines::call_stub

    The logs from monitoring show nothing special around the time of the crash
    (no high heap space or permgen space usage). Also, the gc.log did not
    indicate any concurrent mode failures or full gcs (we are using the CMS
    garbage collector).

    We still have the core dump file available so we could extract more
    information.
    Please advise.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/wrapper/bugs/280/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #280

  • Erik Brakkee
    Erik Brakkee
    2014-04-03

    We have most likely found a cause for these problems. After creating our own service wrapper, we saw freezes in the log files and Bad file descriptor exceptions in some cases. This led us to investigate an issue with file descriptors. Tt turns out that there was some code that cloned a FileOutputStream (indirectly through an object such as a non-static logger for instance) using the objenesis library. The cloned object had a file descriptor with incorrect reference count, causing an early close of the log file as soon as the cloned object would become eligble for garbage collection.

    According to the POSIX standard (see e.g. http://pubs.opengroup.org/onlinepubs/009696899/functions/open.html), the next file descriptor used should be the lowest unused one. So in that case, the file descriptor just closed would be quickly reused. For instance, the Tanuki service wrapper could use the new file descriptor with the same numeric value as that of the closed log file.

    Now, the original file descriptor (thas is in use now by for instance Tanuki) is still in use by the log file and as soon as the original FileOutputStream become eligble for garbage collection it closes the original file descriptor. This in turn leads to a problem in Tanuki since using a closed file descriptor in native code will likely cause a segmentation violation.