#1338 failed STAX job termination can lead to loop spinning CPU

Unix::Linux
closed-fixed
Sharon Lucas
5
2010-03-25
2010-03-24
Nathan Parrish
No

Sharon Lucas wrote:
>
> Nathan,
>
> It does appear that it is possible in certain cases (especially if a
> STAX job is terminated with running stafcmds and/or processes) that a
> STAX Job's STAFQueueMonitor thread is not terminating. Instead, it gets
> stuck in a forever loop (using up a lot of CPU) submitting a local QUEUE
> GET WAIT request which fails immediately with RC 5 and with the result
> shown below because the STAX job's handle has already been unregistered
> in another thread so it never receives the STAX job termination message
> on its queue (which normally causes the STAFQueueMonitor thread to stop
> running).
>
> RC: 5 (Handle Does Not Exist)
> Result: HandleManager::updateTimestamp() failed to update handle nn"
>
> This appears to be exactly what your strace showed for these threads.
> Thanks for providing the thorough debug information.
>
> This should be easy to fix. Once I have the fix, I can provide you with
> a private STAX.jar file to verify the fix. And this fix will go into
> the next release of STAX (V3.4.1) planned by be released at the end of
> March 2010 (next week).
>

please let me know if I need to provide more context from the email thread.

Discussion

  • Sharon Lucas
    Sharon Lucas
    2010-03-24

    • milestone: --> Unix::Linux
    • assigned_to: nobody --> slucas
     
  • Sharon Lucas
    Sharon Lucas
    2010-03-24

    I have been able to recreate this problem on my Windows machine when terminating a STAX job that has many stafcmds completing simultaneously and I have a fix that resolves the problem. I'm doing some more testing with it now.

    Let me know if you would like this fix and I can send you a new STAX.jar file to use to verify that this resolves your problem.

     
  • Sharon Lucas
    Sharon Lucas
    2010-03-24

    Here's a cvs diff of the fix:

    Index: services/stax/service/STAXJob.java

    RCS file: /cvsroot/staf/src/staf/services/stax/service/STAXJob.java,v
    retrieving revision 1.80
    diff -r1.80 STAXJob.java
    1173c1173,1177
    < // Check if the result was unsuccessful (except ignore RC 2)
    ---
    > // Check if the result was unsuccessful except ignore the following
    > // errors:
    > // - UnknownService (2) in case the LOG service is not registered
    > // - HandleDoesNotExist (5) in case the STAX job's handle has been
    > // unregistered indicating the job has completed
    1176c1180,1181
    < (result.rc != STAFResult.UnknownService))
    ---
    > (result.rc != STAFResult.UnknownService) &&
    > (result.rc != STAFResult.HandleDoesNotExist))
    1852a1858,1867
    > int numErrors = 0;
    >
    > // Maximum consecutive errors submitting a local QUEUE GET WAIT
    > // request before we decide to exit the infinite loop
    > int maxErrors = 5;
    >
    > // Process messages on the STAX job handle's queue until we get
    > // a "STAF/Service/STAX/End" message or until an error occurs 5
    > // consecutive times submitting a STAF local QUEUE GET request
    > // (so that we don't get stuck in an infinite loop eating CPU).
    1871a1887,1888
    > numErrors++;
    >
    1873,1876c1890,1907
    < "STAF local QUEUE " + request + " returned " +
    < "null and this may have been caused by running " +
    < " out of memory when creating the result.", null);
    < continue;
    ---
    > "STAF local QUEUE " + request +
    > " returned null. This may have been caused by " +
    > "running out of memory creating the result.",
    > null);
    >
    > if (numErrors < maxErrors)
    > {
    > continue;
    > }
    > else
    > {
    > logMessage(
    > "Exiting this thread after the QUEUE GET " +
    > "request failed " + maxErrors +
    > " consecutive times.", null);
    >
    > return; // Exit STAFQueueMonitor thread
    > }
    1879c1910
    < if (result.rc != STAFResult.Ok)
    ---
    > if (result.rc == STAFResult.Ok)
    1880a1912,1931
    > numErrors = 0;
    > }
    > else if (result.rc == STAFResult.HandleDoesNotExist)
    > {
    > // This means that the STAX job's handle has been
    > // unregistered which means that the STAX job is no
    > // longer running so we should exit this thread.
    > // We've seen this happen before this thread gets
    > // the message with type "STAF/Service/STAX/End" off
    > // the queue.
    >
    > System.out.println("Exiting STAXJob::STAFQueueMonitor" +
    > "thread because QUEUE GET failed " +
    > "with RC=5");
    > return; // Exit STAFQueueMonitor thread
    > }
    > else
    > {
    > numErrors++;
    >
    1882,1885c1933,1949
    < "STAF local QUEUE " + request + " failed with " +
    < "RC=" + result.rc + ", Result=" + result.result,
    < null);
    < continue;
    ---
    > "STAF local QUEUE " + request +
    > " failed with RC=" + result.rc + ", Result=" +
    > result.result, null);
    >
    > if (numErrors < maxErrors)
    > {
    > continue;
    > }
    > else
    > {
    > logMessage(
    > "Exiting this thread after the QUEUE GET " +
    > "request failed " + maxErrors +
    > " consecutive times", null);
    >
    > return; // Exit STAFQueueMonitor thread
    > }
    1897,1898c1961,1962
    < // Log an error message and continue
    <
    ---
    > numErrors++;
    >
    1901c1965,1978
    < continue;
    ---
    >
    > if (numErrors < maxErrors)
    > {
    > continue;
    > }
    > else
    > {
    > logMessage(
    > "Exiting this thread after the QUEUE GET " +
    > "request failed " + maxErrors +
    > " consecutive times", null);
    >
    > return; // Exit STAFQueueMonitor thread
    > }
    1952c2029
    < return;
    ---
    > return; // Exit STAFQueueMonitorThread

     
  • Sharon Lucas
    Sharon Lucas
    2010-03-24

    STAX job that can recreate the problem

     
    Attachments
  • Sharon Lucas
    Sharon Lucas
    2010-03-25

    • status: open --> closed-fixed