Menu

#926 OutOfMemory issues on Linux creating a thread and elsewhere

Unix::Linux
closed-fixed
STAFProc (180)
8
2007-08-28
2007-03-26
No

Problem: STAFProc is currently either hanging or crashing with:

20070324-02:24:12;1801214864;00000100;Caught unknown exception in HandleRequest()

during usage.

[root@wplcauto staf]# ps -ef | grep -i staf
root 3012 1 0 Mar23 ? 00:03:45 /usr/local/staf/bin/STAFProc
root 3019 3012 0 Mar23 ? 00:00:40 java com.ibm.staf.service.STAFServiceHelper STAFJVM1
root 3071 3012 99 Mar23 ? 5-10:57:15 java -Xmx1024m com.ibm.staf.service.STAFServiceHelper STAFJVM
root 10452 10277 0 08:01 pts/0 00:00:00 grep -i staf
[root@wplcauto staf]# staf local ping ping
Error registering with STAF, RC: 21
[root@wplcauto staf]#

Requesting debugging support to resolve the issue.

Discussion

1 2 > >> (Page 1 of 2)
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26
    • priority: 5 --> 8
     
  • Sharon Lucas

    Sharon Lucas - 2007-03-26

    Logged In: YES
    user_id=285070
    Originator: NO

    Were there any errors in the JVM log(s)?

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    STAX JVM Log

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    Logged In: YES
    user_id=1350649
    Originator: YES

    STAF Version: 3.2.1
    STAX Version: 3.2.0

    STAX log posted as JVMLog.1.

    There was a lot of automation activity at the time of failure. The only way we're going to determine the failing service/command is by enabling trace and recreating, granted none of the logs help.
    File Added: JVMLog.1

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    Logged In: YES
    user_id=1350649
    Originator: YES

    STAFProc is currently in the error state, if we can force a dump and get any useful information. All of the sockets are still in their pre-hung state:

    [root@wplcauto bin]# netstat -an | grep 6500
    tcp 0 0 0.0.0.0:6500 0.0.0.0:* LISTEN
    tcp 0 0 9.42.125.105:6500 9.42.123.110:13839 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.222:34495 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.107.229:23453 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.107.229:23452 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.225:38560 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.146:16036 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.146:16035 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.107.235:14533 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.107.235:14532 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.107.209:16622 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.107.209:16621 TIME_WAIT
    tcp 0 0 9.42.125.105:49430 9.42.123.173:6500 ESTABLISHED
    tcp 0 0 9.42.125.105:52760 9.42.123.173:6500 ESTABLISHED
    tcp 0 0 9.42.125.105:6500 9.42.123.222:34496 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.33.133.76:7330 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.149:47513 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.101:10088 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.190:29198 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.33.133.63:11363 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.214:22357 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.33.132.147:44521 TIME_WAIT
    tcp 0 0 9.42.125.105:6500 9.42.123.239:16742 TIME_WAIT

     
  • Sharon Lucas

    Sharon Lucas - 2007-03-26

    Logged In: YES
    user_id=285070
    Originator: NO

    Please use gdb to try to gather some debugging information. We
    provided the information for how to run gdb and get a stacktrace via Bug
    #1548759. I'll also include it
    here:

    You can run "ps -ea" to get the PID for STAFProc, and then run "gdb
    STAFProc <PID>". Then at the (gdb)
    prompt you can run "bt" to get the stacktrace.

    Also, please verify the following while it's in the error state:
    1) Do all local requests to the machine fail with "Error registering with
    STAF, RC: 21"?
    2) Do all remote requests to the machine fail with "Error registering with
    STAF, RC: 21"? e.g. STAF wplcauto PING PING
    If not, please provide the results from:
    a) STAF wplcauto SERVICE LIST REQUESTS
    b) STAF wplcauto HANDLE LIST
    c) STAF wplcauto HANDLE LIST NOTIFICATIONS
    d) STAF wplcauto PROCESS LIST
    e) STAF wplcauto PROCESS LIST SETTINGS

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    Logged In: YES
    user_id=1350649
    Originator: YES

    By 11:00 EST, I have to restart STAF/STAX on this machine. Please gather any debugging information before this time.

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    Logged In: YES
    user_id=1350649
    Originator: YES

    Hostname: wplcauto.rtp.raleigh.ibm.com
    Password: same as before (if you don't remember, please sametime me)

     
  • Sharon Lucas

    Sharon Lucas - 2007-03-26

    Logged In: YES
    user_id=285070
    Originator: NO

    Please obtain the debugging information yourself per my previous posting. I cannot access your machine at this time. Thanks.

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    Backtrace of STAFProc

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    Logged In: YES
    user_id=1350649
    Originator: YES

    Backtrace of STAFProc in current state uploaded as backtrace.txt
    File Added: backtrace.txt

     
  • Sharon Lucas

    Sharon Lucas - 2007-03-26

    Logged In: YES
    user_id=285070
    Originator: NO

    Charles used gdb to try to get more information about the problem while STAF was in this error state.
    Basically, any requests submitted from another machine to wplcauto work (e.g. STAF wplcauto PING PING, etc), but no local requests work (e.g. STAF local wplcauto PING PING). So, it would appear that the STAFLocalConnectionProvider is no longer responding. However, the STAFProc output did not show any trace errors logged by connproviders/localipc/unix/STAFLocalIPCConnProvider.cpp's STAFTCPRunThread() method.

    We told Bryan he can restart STAF. We'll think about what else can be done to debug the problem and get back to Bryan.

     
  • Bryan Osenbach

    Bryan Osenbach - 2007-03-26

    Logged In: YES
    user_id=1350649
    Originator: YES

    I'm handing this off to my teammate Brendan Furey (Brendan_Furey@ie.ibm.com). Moving forward, please work with him regarding this issue.

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-04

    Debug information while in bad state

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-04

    Logged In: YES
    user_id=285070
    Originator: NO

    This problem occurred today on wplcauto after STAF/STAX had been running for 9 days (since March 26). I gathered some debug information from the machine (and attached it to this bug) before Brendan restarted the machine. However, we still don't know what's causing this problem.
    File Added: wecmDebug.txt

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-05
    • assigned_to: nobody --> slucas
     
  • Sharon Lucas

    Sharon Lucas - 2007-04-05

    Logged In: YES
    user_id=285070
    Originator: NO

    I'm working on a debug version of STAF to provide to you that we'll hopefully provide more information when this error occurs. There's also a slight chance that another fix in it could resolve this problem. I'll let you know when this debug version of STAF is ready.

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-06

    Logged In: YES
    user_id=285070
    Originator: NO

    This problem happened again this morning. This time if I submit a STAF local ping ping request on machine wplcauto, the request hangs. But I can submit requests to wplcauto from other machines. Also, the "Caught unknown exception in HandleRequest()" error is in the STAFProc output, but it was logged at 20:42 EST last night. Brendan said STAF was running fine until this morning around 10:20am EST when he contacted me.

    I gave Brendan a new libSTAF.so file to try that contains a fix for Bug #1690067 "Problem handling multiple waiters on an event semaphore" that can occur on Linux SMP machines like wplcauto. There's a small chance that this bug fix could also fix your problem. I'm also working on a debug version of STAF to provide you this afternoon that will hopefully provide us with more information on what's causing your problem if this doesn't fix the problem.

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-06

    Logged In: YES
    user_id=285070
    Originator: NO

    Sent some debug STAF files to Brendan and Bryan with instructions on how to install them. Let me know when the problem is reproduced so that I can check some things out on that machine while it's in the bad state.

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-12

    Logged In: YES
    user_id=285070
    Originator: NO

    The problem occurred this morning. Charles and I used gdb to get more information. There are 299 threads. There seems to be a problem creating 300 threads. The following trace message was the last entry in /usr/local/staf/STAFProc.out:

    20070412-10:58:06;53332880;00000100;STAFThreadManager error in STAFThreadStart rc=10 osRC=12

    The "STAF local PING PING" (as well as any STAF request submitted on wplcauto) hangs when registering with STAF. STAF cannot seem to create another STAFHandle, probably because no more threads can be created.

    Brendan verified that STAFProc cannot start if it tries to create 500 initial threads (by adding a "SET INTIALTHREADS 500" line to the STAF.cfg file).

    So, it appears that the Linux machine has a limit on the number of threads that can be created and that might be the source of the problem. We need to investigate how this system limit can be increased.

    Your ulimit settings are:

    [root@wplcauto bin]# ulimit -a
    core file size (blocks, -c) 0
    data seg size (kbytes, -d) unlimited
    max nice (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 65536
    max locked memory (kbytes, -l) 32
    max memory size (kbytes, -m) unlimited
    open files (-n) 1024
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    max rt priority (-r) 0
    stack size (kbytes, -s) 10240
    cpu time (seconds, -t) unlimited
    max user processes (-u) 65536
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited

    I had Brendan add the following line to the script that starts STAFProc to increase the limit on the number of open files from 1024 to 2048: ulimit -HSn 2048
    I don't know if that's at all related to the number of threads limit -- it's just a guess.
    We'll see if this makes a difference. Again, I'm not overly optimistic. But, at least we got more debug information in that it seems to be that it's hitting a thread limit. We'll think about this some more.

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-12

    Logged In: YES
    user_id=285070
    Originator: NO

    The default stack size on Linux is 10M. When a thread is created, it's created with the default stack size. So, when 300 threads are created, that requires 3G of memory. It appears you ran out of memory -- you have 4G of memory, but it's likely that 1G was being already being used.

    The following error was logged in your STAFProc.out file when the problem occurred (a trace message for this was added in the debug version of STAF that you are running). osRC 12 on Linux means OutOfMemory.

    20070412-10:58:06;53332880;00000100;STAFThreadManager error in STAFThreadStart rc=10 osRC=12

    So, our recommendation is the following:
    1) Remove "ulimit -HSn 2048" from your STAFProc startup script as it doesn't have any effect.
    2) Add the following environment variable setting to your STAFProc startup script to decrease the default stack size for each thread from 10M to 4M (the value for this variable is specified in K):
    export STAF_THREAD_STACK_SIZE=4096
    3) Add the following line near the top of your STAF.cfg file:
    SET INITIALTHREADS 500
    4) Start STAFProc and verify that 500 threads can now be created with the smaller default stack size.
    5) Shutdown STAF
    6) Remove the "SET INITIALTHREADS 500" line from the STAF.cfg file.
    7) Start STAFProc

    Hopefully, the problem will now no longer occur anymore during peak usage times.
    Try this and let us know the results.

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-25

    Logged In: YES
    user_id=285070
    Originator: NO

    After restarting STAFProc with a decreased thread stack size set of 4096 (4M), there were no problems for 10 days. Then, there was an incident on 4/22 (late night) where STAF/STAX was not working correctly. However, this time, instead of running out of memory when trying to create a new STAF thread, the following error appears in the JVM Log for the STAX service (and no STAFThreadManager error was logged).

    terminate called after throwing an instance of 'std::bad_alloc'
    what(): St9bad_alloc

    This same problem occurred again last night (2007/04/24 at approx 23:00 EST), after running for only 2 days. This incident caused all running jobs to hang in an 'In progress' state and no new jobs could be submitted as they immediately failed with an execution error. The same 'std::bad_alloc' error was logged in the JVM log for the STAX service.

    This error also indicates an OutOfMemory issue. But this time, it's being thrown from STAF JNI code (that communicates with the JVM) or possibly by the JVM. This could be due, for example, to a FS GET of a large file and STAF didn't have the memory to allocate to it.

    So, the problem is still that you're running out of virtual memory, determined by:
    managed heap + native heap + (number of threads * thread stack size)

    I asked Brendan to decrease the thread stack size to 2048 (2M) by setting the following in the STAF startup script before starting STAFProc:

    export STAF_THREAD_STACK_SIZE=2048

    We'll see still if that helps the virtual memory usage.

     
  • Sharon Lucas

    Sharon Lucas - 2007-04-30

    Logged In: YES
    user_id=285070
    Originator: NO

    The problem happened again on Sunday 29th April. The STAX JVM log shows the following error (which is the same error that has occurred since the STAF_THREAD_STACK_SIZE was decreased):

    terminate called after throwing an instance of 'std::bad_alloc'
    what(): St9bad_alloc

    This still indicates that you're running out of memory.

    So, please try the following tuning updates on wplcauto:

    1) Remove the maximum heap size of 2048m when registering the STAX JVM (by changing the STAX service registration line in the STAF.cfg file). I think Bryan had increased the JVM max heap size to 2048m when the IBM Java 1.5.0 JVM for Linux was being used and it was not behaving correctly. When using Sun Java 6 server class JVM (as you currently are), the maximum heap size for a server class JVM is the smaller of either 1/4 physical memory or 1GB. Since your machine has 4GB of memory, the default maximum heap size should be 1024m (1G). A larger heap size takes away from memory for creating threads, etc. so let's try it without specifying the maximum heap size.

    2) Change to have the STAX service use it's default number of threads (10), by removing PARMS "NUMTHREADS 40" when registering the STAX service in the STAF.cfg file. Note that having more threads uses more memory. I think Bryan had increased the number of threads to 40 when we were debugging other problems which have already been resolved.

    So, the STAX registration line in your STAF.cfg file should look like:

    SERVICE STAX LIBRARY JSTAF \ EXECUTE {STAF/Config/STAFRoot}/services/stax/STAX.jar \ OPTION JVMName=STAFJVM

    Shutdown STAF and restart it to have it pick up the new STAX configuration options.

    Note: Keep export STAF_THREAD_STACK_SIZE=2048 in your STAF startup script.

    Also, we'd like to see how many threads are being used by the STAX JVM. Please check how many threads are being used at least once a day (while STAX jobs are running) and let me know this information. To see how many threads are being used by the STAX JVM, use gdb to break into the STAX JVM PID and run "info threads" to see how many threads are currently running in the STAX JVM and then exit. For example:

    1) Get the STAX JVM PID from the STAX JVM Log. For example, look at /usr/local/staf/data/STAF/lang/java/jvm/STAFJVM/JVMLog.1 and go to the bottom of the file and get the JVM PID from the startup information for the last time the JVM was started:

    ******************************************************************************
    *** 20070430-06:10:20 - Start of Log for JVMName: STAFJVM
    *** JVM Executable: java
    *** JVM Options : -Xmx2048m
    *** JVM PID : 20427
    ******************************************************************************

    2) Break into the STAX JVM using gdb:

    gdb java <STAX JVM Pid>

    run
    info threads

    3) Exit gdb.

     
  • Sharon Lucas

    Sharon Lucas - 2007-05-02

    Logged In: YES
    user_id=285070
    Originator: NO

    Added a new environment variable, STAF_RESULT_WARNING_SIZE, for debugging
    running out of memory issues. If this environment variable is set before
    STAFProc is started and the Warning tracepoint is enabled, trace warning
    messages will be logged about large results being requested/returned by
    service requests.

    Also, updated STAFProc.cpp to provide better error messages logged by
    catch blocks when errors occurs (especially std::bad_alloc exceptions)
    to get more information about when/why an OutOfMemory error may occur.

    Updated lang/java/JSTAF.cpp's createResultMethod() to check if
    env->NewStringUTF returned 0 which could happen if ran out of memory.

    Updated STAXJob's STAFQueueMonitor's run() method to catch exceptions
    and to better handle a result in case it is null due to running of of
    memory when trying to create the result.

    Here's a cvs diff of the changes:

    Index: stafif/STAF.cpp

    RCS file: /cvsroot/staf/src/staf/stafif/STAF.cpp,v
    retrieving revision 1.11
    diff -r1.11 STAF.cpp
    475a476
    > response = STAFString("Unknown Error");
    Index: lang/java/JSTAF.cpp
    ===================================================================
    RCS file: /cvsroot/staf/src/staf/lang/java/JSTAF.cpp,v
    retrieving revision 1.15
    diff -r1.15 JSTAF.cpp
    843,845c843,847
    < // XXX: If javaString can not be created, I think we can create
    < // the exception object ok, but there may be a problem trying
    < // to print it out, as we will try to print out a null string.
    ---
    > if (javaString == 0)
    > {
    > // The Java VM threw an exception, so just return
    > return 0;
    > }
    Index: services/stax/service/STAXJob.java
    ===================================================================
    RCS file: /cvsroot/staf/src/staf/services/stax/service/STAXJob.java,v
    retrieving revision 1.53
    diff -r1.53 STAXJob.java
    915c915,917
    < System.out.println("STAX Service Error: Invalid Logfile" +
    ---
    > STAXTimestamp currentTimestamp = new STAXTimestamp();
    > System.out.println(currentTimestamp.getTimestampString() +
    > " STAX Service Error: Invalid Logfile " +
    916a919
    >
    931,936c934,943
    < {
    < System.out.println("STAX Log failed with RC " +
    < result.rc + " and Result " + result.result +
    < " level: " + level +
    < " logRequest: " + logRequest);
    < }
    ---
    > {
    > STAXTimestamp currentTimestamp = new STAXTimestamp();
    >
    > System.out.println(
    > currentTimestamp.getTimestampString() +
    > " STAX Log failed with RC " + result.rc +
    > " and Result " + result.result +
    > " level: " + level +
    > " logRequest: " + logRequest);
    > }
    937a945
    >
    1007c1015,1020
    < System.out.println(e.toString());
    ---
    > STAXTimestamp currentTimestamp = new STAXTimestamp();
    >
    > System.out.println(
    > currentTimestamp.getTimestampString() +
    > " STAXJob::submitAsync - " + e.toString());
    >
    1248a1262,1286
    > if (result == null)
    > {
    > // Log a warning message in the JVM log and the STAX Job
    > // log and continue
    >
    > String warningMsg = "STAXJob$STAFQueueMonitor.run(): " +
    > "Result from a QUEUE GET WAIT request for STAX " +
    > "Job ID " + fJob.getJobNumber() + " is null. " +
    > "A possible cause is ran OutOfMemory when creating " +
    > "the result from submitting a request to a STAF " +
    > "service.";
    >
    > STAXTimestamp currentTimestamp = new STAXTimestamp();
    >
    > System.out.println(currentTimestamp.getTimestampString() +
    > " " + warningMsg);
    >
    > fJob.log(STAXJob.JOB_LOG, "warning", warningMsg);
    >
    > // XXX: Maybe we should do something to prevent an
    > // infinite loop if keep getting.
    >
    > continue;
    > }
    >
    1252a1291
    >
    1254a1294,1298
    >
    > try
    > {
    > // Debug:
    > if (false) System.out.println("Result: " + result.result);
    1256,1257c1300,1301
    < // Debug:
    < if (false) System.out.println("Result: " + result.result);
    ---
    > // Break up result
    > STAXSTAFMessage msg = new STAXSTAFMessage(result.result);
    1259,1260c1303
    < // Break up result
    < STAXSTAFMessage msg = new STAXSTAFMessage(result.result);
    ---
    > String type = msg.getType();
    1262c1305,1309
    < String type = msg.getType();
    ---
    > if (type != null &&
    > type.equalsIgnoreCase("STAF/Service/STAX/End"))
    > {
    > return;
    > }
    1264,1267c1311
    < if (type != null &&
    < type.equalsIgnoreCase("STAF/Service/STAX/End"))
    < {
    < return;
    ---
    > notifyListeners(msg);
    1268a1313,1333
    > catch (Throwable t)
    > {
    > // Log a warning message in the JVM log and the STAX Job
    > // log and continue
    >
    > String warningMsg = "STAXJob$STAFQueueMonitor.run(): " +
    > "Exception handling a queued message for STAX Job ID " +
    > fJob.getJobNumber() + ". Exception: " +
    > t + " Queued Message: ";
    >
    > // Only log up to 2000 bytes of the queue message
    >
    > if (result.result.length() < 2000)
    > warningMsg += result.result;
    > else
    > warningMsg += result.result.substring(0, 2000) + "...";
    >
    > STAXTimestamp currentTimestamp = new STAXTimestamp();
    >
    > System.out.println(currentTimestamp.getTimestampString() +
    > " Warning: " + warningMsg);
    1270c1335,1338
    < notifyListeners(msg);
    ---
    > fJob.log(STAXJob.JOB_LOG, "warning", warningMsg);
    >
    > continue;
    > }
    Index: services/stax/service/STAX.java
    ===================================================================
    RCS file: /cvsroot/staf/src/staf/services/stax/service/STAX.java,v
    retrieving revision 1.107
    diff -r1.107 STAX.java
    35c35
    < static final String fVersion = "3.2.0";
    ---
    > static final String fVersion = "3.2.1 Beta 1";
    Index: stafproc/STAFFSService.cpp
    ===================================================================
    RCS file: /cvsroot/staf/src/staf/stafproc/STAFFSService.cpp,v
    retrieving revision 1.81
    diff -r1.81 STAFFSService.cpp
    3026a3027,3048
    >
    > // Added for debugging memory issues
    >
    > if (gResultWarningSize)
    > {
    > if (fileLength > gResultWarningSize)
    > {
    > // Log a warning tracepoint message
    >
    > STAFString warningMsg = STAFString(
    > "WARNING: Submitting a FS GET FILE request for a large "
    > "file (") + fileLength + " bytes) uses a lot of memory. "
    > "STAFFSService::handleGet()";
    >
    > warningMsg += " - Client: " + requestInfo.fMachine +
    > ", Handle: " + STAFString(requestInfo.fHandle) +
    > ", Handle Name: " + requestInfo.fHandleName +
    > ", Request: " + requestInfo.fRequest;
    >
    > STAFTrace::trace(kSTAFTraceWarning, warningMsg);
    > }
    > }
    Index: stafproc/STAFProc.cpp
    ===================================================================
    RCS file: /cvsroot/staf/src/staf/stafproc/STAFProc.cpp,v
    retrieving revision 1.188
    diff -r1.188 STAFProc.cpp
    187c187
    < STAFString gVersion("3.2.1");
    ---
    > STAFString gVersion("3.2.2 Alpha 2");
    220a221,224
    >
    > // Constant for debugging memory usage
    > unsigned int gResultWarningSize = 0;
    >
    735a740,768
    > // Check if STAF_RESULT_WARNING_SIZE environment variable is set.
    > // If it is, that indicates to log a trace warning message the
    > // length of the result from a service request is greater than the
    > // size specified by this environment variable.
    > // The warning size is specified in Megabytes and must be at least 1.
    >
    > if (getenv("STAF_RESULT_WARNING_SIZE") != NULL)
    > {
    > STAFString warningSizeString = getenv("STAF_RESULT_WARNING_SIZE");
    >
    > if (warningSizeString.isDigits())
    > {
    > unsigned int warningSize = warningSizeString.asUInt();
    >
    > if (warningSize > 0)
    > gResultWarningSize = warningSize * 1024000;
    > }
    >
    > if (gResultWarningSize == 0)
    > {
    > cout << endl
    > << "WARNING: Ignoring the STAF_RESULT_WARNING_SIZE "
    > << "environment variable because it's value is not valid."
    > << " It must be an integer > 0 which represents a size "
    > << "in megabytes. Invalid value: "
    > << warningSizeString << endl;
    > }
    > }
    >
    1684,1687c1717
    < STAFRC_t rc = gServiceManager.get(serviceRequest.fTargetService,
    < service, errorBuffer);
    <
    < if (rc)
    ---
    > try
    1689,1690c1719,1722
    < if ((serviceRequest.fSyncMode != kSTAFReqRetain) &&
    < (serviceRequest.fSyncMode != kSTAFReqQueueRetain))
    ---
    > STAFRC_t rc = gServiceManager.get(serviceRequest.fTargetService,
    > service, errorBuffer);
    >
    > if (rc)
    1692,1696c1724,1731
    < gRequestManager.requestCompleted(serviceRequest.fRequestNumber,
    < serviceResult);
    < gRequestManager.freeRequest(serviceRequest.fRequestNumber,
    < &requestResult);
    < }
    ---
    > if ((serviceRequest.fSyncMode != kSTAFReqRetain) &&
    > (serviceRequest.fSyncMode != kSTAFReqQueueRetain))
    > {
    > gRequestManager.requestCompleted(
    > serviceRequest.fRequestNumber, serviceResult);
    > gRequestManager.freeRequest(
    > serviceRequest.fRequestNumber, &requestResult);
    > }
    1698c1733
    < if (rc == kSTAFDoesNotExist) rc = kSTAFUnknownService;
    ---
    > if (rc == kSTAFDoesNotExist) rc = kSTAFUnknownService;
    1700,1702c1735,1737
    < if (serviceRequest.fSyncMode != kSTAFReqSync)
    < {
    < // pass back the request number
    ---
    > if (serviceRequest.fSyncMode != kSTAFReqSync)
    > {
    > // pass back the request number
    1704,1705c1739,1740
    < serviceResult = STAFServiceResult(kSTAFOk,
    < serviceRequest.fRequestNumber);
    ---
    > serviceResult = STAFServiceResult(
    > kSTAFOk, serviceRequest.fRequestNumber);
    1707,1708c1742,1743
    < connection->writeUInt(serviceResult.fRC);
    < connection->writeString(serviceResult.fResult);
    ---
    > connection->writeUInt(serviceResult.fRC);
    > connection->writeString(serviceResult.fResult);
    1710,1725c1745,1761
    < // send the rc/result to the requester's queue
    < if ((serviceRequest.fSyncMode == kSTAFReqQueue) ||
    < (serviceRequest.fSyncMode == kSTAFReqQueueRetain))
    < {
    < // Create the message to queue. The message is a marshalled
    < // map containing the request completion information.
    <
    < STAFObjectPtr mc = STAFObject::createMarshallingContext();
    < STAFObjectPtr messageMap = STAFObject::createMap();
    < messageMap->put("requestNumber",
    < STAFString(serviceRequest.fRequestNumber));
    < messageMap->put("rc", STAFString(rc));
    < messageMap->put("result", serviceRequest.fTargetService +
    < errorBuffer);
    < mc->setRootObject(messageMap);
    < STAFString resultMessage = mc->marshall();
    ---
    > // send the rc/result to the requester's queue
    > if ((serviceRequest.fSyncMode == kSTAFReqQueue) ||
    > (serviceRequest.fSyncMode == kSTAFReqQueueRetain))
    > {
    > // Create the message to queue. The message is a
    > // marshalled map containing the request completion
    > // information.
    >
    > STAFObjectPtr mc = STAFObject::createMarshallingContext();
    > STAFObjectPtr messageMap = STAFObject::createMap();
    > messageMap->put("requestNumber",
    > STAFString(serviceRequest.fRequestNumber));
    > messageMap->put("rc", STAFString(rc));
    > messageMap->put("result", serviceRequest.fTargetService +
    > errorBuffer);
    > mc->setRootObject(messageMap);
    > STAFString resultMessage = mc->marshall();
    1727c1763
    < // Submit the QUEUE request to the QUEUE service
    ---
    > // Submit the QUEUE request to the QUEUE service
    1729,1730c1765,1771
    < STAFString queueRequest(sQueueHandle + serviceRequest.fHandle);
    < queueRequest += sType + sRequestComplete;
    ---
    > STAFString queueRequest(sQueueHandle +
    > serviceRequest.fHandle);
    > queueRequest += sType + sRequestComplete;
    >
    > queueRequest += sMessage + sColon +
    > resultMessage.length(STAFString::kChar) + sColon +
    > resultMessage;
    1732,1734c1773,1775
    < queueRequest += sMessage + sColon +
    < resultMessage.length(STAFString::kChar) + sColon +
    < resultMessage;
    ---
    > gSTAFProcHandlePtr->submit(sLocal, sQueue, queueRequest,
    > kSTAFReqFireAndForget);
    > }
    1736,1737c1777
    < gSTAFProcHandlePtr->submit(sLocal, sQueue, queueRequest,
    < kSTAFReqFireAndForget);
    ---
    > return STAFServiceResult(kSTAFOk);
    1740c1780,1781
    < return STAFServiceResult(kSTAFOk);
    ---
    > return STAFServiceResult(
    > rc, serviceRequest.fTargetService + errorBuffer);
    1741a1783,1786
    > else
    > {
    > // Let the submitter continue by sending back the service result
    > // and then submit the actual request
    1743,1749c1788,1795
    < return STAFServiceResult(rc,
    < serviceRequest.fTargetService + errorBuffer);
    < }
    < else
    < {
    < // Let the submitter continue by sending back the service result
    < // and then submit the actual request
    ---
    > if (serviceRequest.fSyncMode != kSTAFReqSync)
    > {
    > // pass back the request number
    > serviceResult = STAFServiceResult(
    > kSTAFOk, serviceRequest.fRequestNumber);
    > connection->writeUInt(serviceResult.fRC);
    > connection->writeString(serviceResult.fResult);
    > }
    1751,1757c1797,1834
    < if (serviceRequest.fSyncMode != kSTAFReqSync)
    < {
    < // pass back the request number
    < serviceResult = STAFServiceResult(kSTAFOk,
    < serviceRequest.fRequestNumber);
    < connection->writeUInt(serviceResult.fRC);
    < connection->writeString(serviceResult.fResult);
    ---
    > // If the request originated from the local machine, assign trust
    > // based on the machine being local://local
    >
    > if (serviceRequest.fIsLocalRequest)
    > {
    > serviceRequest.fTrustLevel = gTrustManager.getTrustLevel(
    > "local", "local", "local", serviceRequest.fAuthenticator,
    > serviceRequest.fUserIdentifier);
    > }
    > else
    > {
    > serviceRequest.fTrustLevel = gTrustManager.getTrustLevel(
    > serviceRequest.fInterface, serviceRequest.fMachine,
    > serviceRequest.fPhysicalInterfaceID,
    > serviceRequest.fAuthenticator,
    > serviceRequest.fUserIdentifier);
    > }
    >
    > serviceResult = service->submitRequest(serviceRequest);
    >
    > if (gResultWarningSize &&
    > (serviceResult.fResult.length() > gResultWarningSize))
    > {
    > // Log a warning tracepoint message
    >
    > STAFString warningMsg = STAFString(
    > "WARNING: Result size is large (") +
    > serviceResult.fResult.length() +
    > " bytes). STAFProc::submitLocalRequest() - " +
    > serviceRequest.fTargetService +
    > " Service Request(" + serviceResult.fRC +
    > ") Client: " + serviceRequest.fMachine +
    > ", Handle: " + serviceRequest.fHandle +
    > ", Handle Name: " + serviceRequest.fHandleName +
    > ", Request: " + serviceRequest.fRequest;
    >
    > STAFTrace::trace(kSTAFTraceWarning, warningMsg);
    > }
    1760,1761c1837,1838
    < // If the request originated from the local machine, assign trust
    < // based on the machine being local://local
    ---
    > gRequestManager.requestCompleted(
    > serviceRequest.fRequestNumber, serviceResult);
    1763c1840,1841
    < if (serviceRequest.fIsLocalRequest)
    ---
    > if ((serviceRequest.fSyncMode == kSTAFReqQueue) ||
    > (serviceRequest.fSyncMode == kSTAFReqQueueRetain))
    1765,1767c1843,1863
    < serviceRequest.fTrustLevel = gTrustManager.getTrustLevel(
    < "local", "local", "local", serviceRequest.fAuthenticator,
    < serviceRequest.fUserIdentifier);
    ---
    > // Create the message to queue. The message is a marshalled map
    > // containing the request completion information.
    >
    > STAFObjectPtr mc = STAFObject::createMarshallingContext();
    > STAFObjectPtr messageMap = STAFObject::createMap();
    > messageMap->put("requestNumber",
    > STAFString(serviceRequest.fRequestNumber));
    > messageMap->put("rc", STAFString(serviceResult.fRC));
    > messageMap->put("result", serviceResult.fResult);
    > mc->setRootObject(messageMap);
    > STAFString resultMessage = mc->marshall();
    >
    > // Submit the QUEUE request to the QUEUE service
    >
    > STAFString queueRequest(sQueueHandle + serviceRequest.fHandle);
    > queueRequest += sType + sRequestComplete;
    > queueRequest += sMessage + sColon + resultMessage.length(
    > STAFString::kChar) + sColon + resultMessage;
    >
    > gSTAFProcHandlePtr->submit(sLocal, sQueue, queueRequest,
    > kSTAFReqFireAndForget);
    1769c1865,1867
    < else
    ---
    >
    > if ((serviceRequest.fSyncMode != kSTAFReqRetain) &&
    > (serviceRequest.fSyncMode != kSTAFReqQueueRetain))
    1771,1775c1869,1870
    < serviceRequest.fTrustLevel = gTrustManager.getTrustLevel(
    < serviceRequest.fInterface, serviceRequest.fMachine,
    < serviceRequest.fPhysicalInterfaceID,
    < serviceRequest.fAuthenticator,
    < serviceRequest.fUserIdentifier);
    ---
    > gRequestManager.freeRequest(serviceRequest.fRequestNumber,
    > &requestResult);
    1777,1778d1871
    <
    < serviceResult = service->submitRequest(serviceRequest);
    1780,1785c1873
    <
    < gRequestManager.requestCompleted(serviceRequest.fRequestNumber,
    < serviceResult);
    <
    < if ((serviceRequest.fSyncMode == kSTAFReqQueue) ||
    < (serviceRequest.fSyncMode == kSTAFReqQueueRetain))
    ---
    > catch (STAFException)
    1787,1788c1875,1880
    < // Create the message to queue. The message is a marshalled map
    < // containing the request completion information.
    ---
    > throw;
    > }
    > catch (std::bad_alloc)
    > {
    > // Doing a cout instead of STAFTrace::trace as may not have
    > // enough memory to create a STAFString trace message
    1790,1797c1882,1887
    < STAFObjectPtr mc = STAFObject::createMarshallingContext();
    < STAFObjectPtr messageMap = STAFObject::createMap();
    < messageMap->put("requestNumber",
    < STAFString(serviceRequest.fRequestNumber));
    < messageMap->put("rc", STAFString(serviceResult.fRC));
    < messageMap->put("result", serviceResult.fResult);
    < mc->setRootObject(messageMap);
    < STAFString resultMessage = mc->marshall();
    ---
    > cout << "ERROR: Ran out of memory. STAFProc::submitLocalRequest()"
    > << " - " << serviceRequest.fTargetService
    > << " Service Request - Client: " << serviceRequest.fMachine
    > << ", Handle: " << serviceRequest.fHandle
    > << ", Handle Name: " << serviceRequest.fHandleName
    > << ", Request: " << serviceRequest.fRequest << endl;
    1799c1889,1893
    < // Submit the QUEUE request to the QUEUE service
    ---
    > throw;
    > }
    > catch (...)
    > {
    > STAFString data("Caught unknown exception in submitLocalRequest(): ");
    1801,1804c1895,1899
    < STAFString queueRequest(sQueueHandle + serviceRequest.fHandle);
    < queueRequest += sType + sRequestComplete;
    < queueRequest += sMessage + sColon +
    < resultMessage.length(STAFString::kChar) + sColon + resultMessage;
    ---
    > data += serviceRequest.fTargetService + " Service Request - Client: " +
    > serviceRequest.fMachine +
    > ", Handle: " + STAFString(serviceRequest.fHandle) +
    > ", Handle Name: " + serviceRequest.fHandleName +
    > ", Request: " + serviceRequest.fRequest;
    1806,1808c1901
    < gSTAFProcHandlePtr->submit(sLocal, sQueue, queueRequest,
    < kSTAFReqFireAndForget);
    < }
    ---
    > STAFTrace::trace(kSTAFTraceError, data);
    1810,1814c1903
    < if ((serviceRequest.fSyncMode != kSTAFReqRetain) &&
    < (serviceRequest.fSyncMode != kSTAFReqQueueRetain))
    < {
    < gRequestManager.freeRequest(serviceRequest.fRequestNumber,
    < &requestResult);
    ---
    > throw;
    1964c2053,2121
    < result = connection->readString();
    ---
    >
    > if (!gResultWarningSize)
    > {
    > result = connection->readString();
    > }
    > else
    > {
    > // Added for debugging memory issues
    > // Copied code from STAFConnection::readString()
    >
    > // First read in the UTF-8 data
    >
    > unsigned int dataSize = connection->readUInt();
    >
    > if (dataSize > gResultWarningSize)
    > {
    > // Log a warning tracepoint message
    >
    > STAFString warningMsg = STAFString(
    > "WARNING: Result size is large (") + dataSize +
    > " bytes). STAFProc::submitRemoteRequest() - Remote " +
    > serviceRequest.fTargetService +
    > " Service Request(" + rc +
    > ") To: " + serviceRequest.fTargetMachine +
    > ", from Handle: " + serviceRequest.fHandle +
    > ", Handle Name: " + serviceRequest.fHandleName +
    > ", Request: " + serviceRequest.fRequest;
    >
    > STAFTrace::trace(kSTAFTraceWarning, warningMsg);
    > }
    >
    > char *inputData = 0;
    >
    > try
    > {
    > inputData = new char[dataSize];
    >
    > connection->read((void *)inputData, dataSize);
    >
    > // Indicate UTF-8 data since default is Current Code Page
    > result = STAFString(inputData, dataSize, STAFString::kUTF8);
    > delete [] inputData;
    > }
    > catch (std::bad_alloc)
    > {
    > if (inputData != 0)
    > delete [] inputData;
    >
    > // Doing a cout instead of STAFTrace::trace as may not have
    > // enough memory to create a STAFString trace message
    >
    > cout << "ERROR: Ran out of memory creating a STAFString "
    > << "for a result (" << dataSize << " bytes). "
    > << "STAFProc::submitRemoteRequest() - Remote "
    > << serviceRequest.fTargetService
    > << " Service Request(" << rc
    > << ") - To: " + serviceRequest.fTargetMachine
    > << ", from Handle: " << serviceRequest.fHandle
    > << ", Handle Name: " << serviceRequest.fHandleName
    > << ", Request: " << serviceRequest.fRequest << endl;
    > }
    > catch (...)
    > {
    > if (inputData != 0)
    > delete [] inputData;
    >
    > throw;
    > }
    > }
    1978a2136,2145
    > STAFString data("Caught STAFException: ");
    >
    > data += serviceRequest.fTargetService + " Service Request (" + rc +
    > ") - To: " + serviceRequest.fMachine +
    > ", from Handle: " + STAFString(serviceRequest.fHandle) +
    > ", Handle Name: " + serviceRequest.fHandleName +
    > ", Request: " + serviceRequest.fRequest;
    >
    > STAFTrace::trace(kSTAFTraceError, data);
    >
    1980a2148,2160
    > catch (std::bad_alloc)
    > {
    > // Doing a cout instead of STAFTrace::trace as may not have
    > // enough memory to create a STAFString trace message
    >
    > cout << "ERROR: Ran out of memory. STAFProc::submitRemoteRequest()"
    > << " - Remote " << serviceRequest.fTargetService
    > << " Service Request(" << rc
    > << ") - To: " + serviceRequest.fTargetMachine
    > << ", from Handle: " << serviceRequest.fHandle
    > << ", Handle Name: " << serviceRequest.fHandleName
    > << ", Request: " << serviceRequest.fRequest << endl;
    > }
    1983,1985c2163,2171
    < STAFTrace::trace(
    < kSTAFTraceError,
    < "Caught unknown exception in submitRemoteRequest()");
    ---
    > STAFString data("Caught unknown exception in submitRemoteRequest(): ");
    >
    > data += serviceRequest.fTargetService + " Service Request (" + rc +
    > ") - To: " + serviceRequest.fMachine +
    > ", from Handle: " + STAFString(serviceRequest.fHandle) +
    > ", Handle Name: " + serviceRequest.fHandleName +
    > ", Request: " + serviceRequest.fRequest;
    >
    > STAFTrace::trace(kSTAFTraceError, data);
    Index: stafproc/STAFProc.h
    ===================================================================
    RCS file: /cvsroot/staf/src/staf/stafproc/STAFProc.h,v
    retrieving revision 1.27
    diff -r1.27 STAFProc.h
    73a74
    > extern unsigned int gResultWarningSize;

     
  • Sharon Lucas

    Sharon Lucas - 2007-05-03

    Logged In: YES
    user_id=285070
    Originator: NO

    I sent you the following updated STAF/STAX files to try on wplcauto. Hopefully they will provide more information when you get an error about running out of memory (e.g. bad_alloc, etc.)

    /usr/local/staf/bin/STAFProc
    /usr/local/staf/lib/libSTAF.so
    /usr/local/staf/lib/libJSTAF.so
    /usr/local/staf/services/stax/STAX.jar

    1) Shutdown STAF and then backup and replace these files provided above.

    2) Also, in your shell script that starts STAFProc (/usr/local/staf/autostart.sh), before STAFProc is started (e.g. after line "export STAF_THREAD_STACK_SIZE=2048"), add the following line to set environment variable STAF_RESULT_WARNING_SIZE to 25

    export STAF_RESULT_WARNING_SIZE=25

    This will cause a trace warning message to be logged to your STAFProc output file whenever a STAF service request's result size is > 25M and when submitting a FS GET FILE for a file > 25M (if the Warning tracepoint is enabled).

    3) Update your STAF.cfg file to enable the Warning tracepoint:

    trace enable tracepoints "error deprecated info warning"

    4) Also, specify a maximum heap size of 1024m and don't specify the NUMTHREADS parameter when registering the STAX service in your STAF.cfg file. For example:

    SERVICE STAX LIBRARY JSTAF \ EXECUTE {STAF/Config/STAFRoot}/services/stax/STAX.jar \ OPTION JVMName=STAFJVM \ OPTION J2=-Xmx1024m

    If the above causes a problem when using the STAX service, then go back to specifying a maximum heap size of 2048m and don't specify the NUMTHREADS parameter. For example:

    SERVICE STAX LIBRARY JSTAF \ EXECUTE {STAF/Config/STAFRoot}/services/stax/STAX.jar \ OPTION JVMName=STAFJVM \ OPTION J2=-Xmx2048m

    5) Start up STAF. Verify that the STAF version is 3.2.2 Alpha 2 and the STAX version is 3.2.1 Beta 1 and verify that everything appears to be working correctly.

    # staf local misc version
    Response
    --------
    3.2.2 Alpha 2

    # staf local stax version
    Response
    --------
    3.2.1 Beta 1

    6) When you encounter another incident with STAF/STAX, be sure to save the /usr/local/staf/STAFProc.out file for us to look at before you restart STAF.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.