Menu

#143 STAF crashes on Solaris 10

open
None
5
2014-08-27
2010-03-24
JoeVeilleux
No

We have a fully-automated test system based on STAF, STAX, and Lotus Automator which drives testcases
on target systems with a variety of archtectures. By and large the system works but we are having a
problem with certain functions on Solaris.

Problem description:

When using Solaris machine 'speedy' during automated test runs, the STAF process sometimes crashes.

The problem is intermittent, and seems to occur more frequently during runs in which we are also
upgrading the Domino software on the machine, although it has also occurred during runs in which we
were not upgrading Domino. We don't know of any logical relationship between upgrading Domino and the
failing operation (details below) so we think the observed connection may be a coincidence.

The crash seems to occur when the ZIP service is un-zipping a copy of a Domino server's data directory
from a ZIP file on the local Solaris machine, into a directory which is also local on the Solaris machine.

The ZIP file is ~1.2Gb zipped, and would be ~4.7Gb unzipped. When the crash occurred on the most
recent occasion, no files were un-zipped into the target directory (hmmm... so I wonder if the ZIP
operation couldn't process because STAF had already crashed? no... doesn't look like it... the
immediate-prior operation is executing a STAF process start... command to run the UNIX 'mv' command
to rename a directory, and it succeeded (as evidenced by RC and also by the fact that the directory
was in fact renamed on the target system)....

1463 Info 20100324-13:36:41 ActivateTestConfig_SaveOffProductDefaultDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Renaming the current default data directory '/local/notesdata' to '/local/notesdata_SAVED_ORIGINAL'...
1464 Debug 20100324-13:36:41 ActivateTestConfig_SaveOffProductDefaultDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Command to rename directory: 'mv "/local/notesdata" "/local/notesdata_SAVED_ORIGINAL"'
... (lines related to a different machine removed) ...
1466 Info 20100324-13:36:42 afclient7.swg.usma.ibm.com.runTestSuite cafRunCommand: Successfully executed: "mv "/local/notesdata" "/local/notesdata_SAVED_ORIGINAL" None". STAXResult:
1467 Info 20100324-13:36:42 ActivateTestConfig_SaveOffProductDefaultDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Notes/Domino data directory '/local/notesdata' was successfully renamed to '/local/notesdata_SAVED_ORIGINAL'. RC='0' Result=''
1468 Info 20100324-13:36:42 ActivateTestConfig_UnZIPTestConfigToDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Un-zipping the zipfile containing this TestConfig's data directory...
1469 Info 20100324-13:36:42 ActivateTestConfig_UnZIPTestConfigToDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Creating Notes/Domino data directory: '/local/notesdata'
... (lines related to a different machine removed) ...
1472 Info 20100324-13:36:42 ActivateTestConfig_UnZIPTestConfigToDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Done creating Notes/Domino data directory '/local/notesdata'. RC=0 STAFResult=''
1473 Info 20100324-13:36:42 ActivateTestConfig_UnZIPTestConfigToDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Un-ZIPing data directory...
1474 Info 20100324-13:36:42 ActivateTestConfig_UnZIPTestConfigToDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): SOURCE: ZIP file: '/local/auto/BTN3_Server2.zip'
1475 Info 20100324-13:36:42 ActivateTestConfig_UnZIPTestConfigToDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): TARGET: Notes/Domino data directory: '/local/notesdata'
... (lines related to a different machine removed) ...
1484 Error 20100324-13:36:45 ActivateTestConfig_UnZIPTestConfigToDataDir: BTN3: Server2(speedy.swg.usma.ibm.com): Error un-ZIPing data directory from ZIP file '/local/auto/BTN3_Server2.zip' to Notes/Domino data directory '/local/notesdata'. Elapsed time=00:00:02.63 RC=22 STAFResult='STAFConnectionReadUInt: Error reading from socket: other side closed socket: 22'
1485 Error 20100324-13:36:45 SetupTestConfig: BTN3: Server2(speedy.swg.usma.ibm.com): Done initial TestConfig setup on this machine. RC=1 Result='killDomino=OK. DeployTestConfig=OK. ActivateTestConfig=FAIL. '.

When the crash occurs, from the perspective of the machine which requested the ZIP service (our Lotus Automator
server), the STAF RC is '22' and the message is 'STAFConnectionReadUInt: Error reading from socket: other side
closed socket: 22'.

There are no obvious symptoms on the Solaris machine. We have tried running STAFProc directly in a command
window so as to make sure we see any relevant console output, but there isn't any. We just see the normal
STAF startup messages, then no output at all while STAF is up (which is typically many hours or even days),
then the STAFProc process just goes away. There are no console messages when it dies.

The crash does create a core file, which is approximately 24Mb in size. On the most recent occurrence on
3/24/10 I used 'ptrace' to get a stack trace (saved as file core_ptrace.txt). It describes a total of 25
threads, each of which seem to have recognizable STAF-related functions in their stack. Beyong that, I don't
really know what I'm looking for in it so I don't know if there is anything interesting here. I also loaded
the core file using dbx (note to self: dbx is at: /opt/SUNWspro/prod/bin) and saved the initial output
as core_dbxoutput.txt but again, I don't really know what I'm looking for in it...

As a knee-jerk reaction, a colleague suggested that we upgrade STAF to the latest version to
see if the problem goes away. We requested this from our lab admin, but instead of putting the latest version
(3.4.0, as of this writing), he put the latest LA-provided version (3.3.4). Using 3.3.4, the problem still
occurs.

Here is some info about the machine:

#######################################
# uname -a #
#######################################
SunOS speedy.swg.usma.ibm.com 5.10 Generic_118833-03 sun4v sparc SUNW,Sun-Fire-T200

#######################################
# install.properties #
#######################################
version=3.3.4
platform=solaris-sparc
architecture=32-bit
installer=STAFInst
file=STAF334-solaris-sparc.tar
osname=SunOS
osversion=5.6+
osarch=sparc

...hmm... interestingly, when STAF starts up it says it's version 3.3.4.1:

speedy:root{607}# /usr/local/staf/bin/STAFProc

Machine : speedy.swg.usma.ibm.com
Machine nickname : speedy.swg.usma.ibm.com
Startup time : 20100324-15:51:27

STAFProc version 3.3.4.1 initialized

...and 'staf local misc version' agrees:
speedy:root{609}# staf local misc version
Response
--------
3.3.4.1

#######################################
# STAF.cfg #
#######################################

# Turn on tracing of internal errors and deprecated options
trace enable tracepoints "error deprecated"

# Enable TCP/IP connections
#interface ssl library STAFTCP option Secure=Yes option Port=6550
interface tcp library STAFTCP option Secure=No option Port=6500

# Set default local trust
trust machine local://local level 5

# Add default service loader
serviceloader library STAFDSLS
#This interface controls the STAF.cfg file that is used while installing the STAF peers via LA. The entries in this interface will be appended to the default STAF.cfg file that is installed by the STAF installation media. Please see the STAF user's guide at http://staf.sourceforge.net/current/STAFUG.htm for further details on STAF.cfg settings.

#Set the trust level for the LA server
trust machine tcp*://dev-la* level 5

trust machine tcp*://*.ibm.com level 5

#######################################
# isainfo #
#######################################
speedy:root{580}# isainfo
sparcv9 sparc
speedy:root{581}# isainfo -b
64

#######################################
# psrinfo #
#######################################
speedy:root{582}# psrinfo
0 on-line since 03/08/2010 13:57:44
1 on-line since 03/08/2010 13:57:45
2 on-line since 03/08/2010 13:57:45
3 on-line since 03/08/2010 13:57:45
4 on-line since 03/08/2010 13:57:45
5 on-line since 03/08/2010 13:57:45
6 on-line since 03/08/2010 13:57:45
7 on-line since 03/08/2010 13:57:45
8 on-line since 03/08/2010 13:57:45
9 on-line since 03/08/2010 13:57:45
10 on-line since 03/08/2010 13:57:45
11 on-line since 03/08/2010 13:57:45
12 on-line since 03/08/2010 13:57:45
13 on-line since 03/08/2010 13:57:45
14 on-line since 03/08/2010 13:57:45
15 on-line since 03/08/2010 13:57:45
16 on-line since 03/08/2010 13:57:45
17 on-line since 03/08/2010 13:57:45
18 on-line since 03/08/2010 13:57:45
19 on-line since 03/08/2010 13:57:45
20 on-line since 03/08/2010 13:57:45
21 on-line since 03/08/2010 13:57:45
22 on-line since 03/08/2010 13:57:45
23 on-line since 03/08/2010 13:57:45
24 on-line since 03/08/2010 13:57:45
25 on-line since 03/08/2010 13:57:45
26 on-line since 03/08/2010 13:57:45
27 on-line since 03/08/2010 13:57:45
28 on-line since 03/08/2010 13:57:45
29 on-line since 03/08/2010 13:57:45
30 on-line since 03/08/2010 13:57:45
31 on-line since 03/08/2010 13:57:45
speedy:root{583}# psrinfo -pv
The physical processor has 32 virtual processors (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
UltraSPARC-T1 (cpuid 0 clock 1000 MHz)

Hmmmm.... I see that this is a 64-bit processor... but my STAF version is for 32-bit. Is that a problem?

I'm going to try putting the latest version of STAF (3.4.0) on the machine and see if it helps...
(want 32-bit -- which is on there now -- or 64-bit?)

Searched the BugZilla database on staf.sourceforge.net but didn't find anything that looks directly
applicable. Searched using 'Solaris crash sparc zip' which yields 159 results, but none of them seem
to relate directly to this problem... here are some relevant observations:

Note bug report:
STAFProc core dump - ID: 1746872
...but the failing function is different...

Note feature request:
ZIP fails on achives containing > 65.535 entries - ID: 1807608
-- Could we have that many entries in this ZIP? I suppose that's possible... but the documented behavior
is an error (incomplete directory listing, not all files unzipped) not a crash...

So -- two specific questions:
1. Does it look to you like I should be using the 64-bit version of STAF, or is my 32-bit version OK?
2. Do you have any insight into what's causing this crash and what I can do to fix it?

Discussion

  • JoeVeilleux

    JoeVeilleux - 2010-03-24

    ptrace output (stack trace)

     
  • JoeVeilleux

    JoeVeilleux - 2010-03-24

    dbx output when loading the core file

     
  • Sharon Lucas

    Sharon Lucas - 2010-03-24
    • assigned_to: nobody --> slucas
     
  • Sharon Lucas

    Sharon Lucas - 2010-03-24

    It would be interesting to see if the ZIP service's UNZIP request for this 1.2Gb file failed the same way when using the 64-bit version of STAF.

    Note that you can run either the 32-bit or 64-bit version of an application like STAF on a 64-bit machine. The 64-bit version should be faster though.

    Note that the 64-bit version of STAF provides support for a 64-bit JVM. So if you were registering a STAF java service or submitting STAF requests from a Java program, you would need to use a 64-bit JVM.

    Similarly, the 32-bit version of STAF provides support for a 32-bit JVM. So if you were registering a STAF java service or submitting STAF requests from a Java program, you would need to use a 32-bit JVM.

    Now, back to the problem where the UNZIP request submitted to the ZIP service is causing STAFProc to crash.

    You should be able to prevent STAFProc from crashing by using the STAFEXECPROXY when registering the ZIP service by adding the following line to the STAF.cfg file on your Solaris machine:

    SERVICE Zip LIBRARY STAFEXECPROXY EXECUTE STAFZip

    Note that running the ZIP service in a separate executable will ensure that if the service has a fatal error, the error will not kill STAFProc. In addition, this allows monitoring of the ZIP service's system resource utilization, since you can view the utilization for the new executable (otherwise, if the service was running within the STAFProc executable, then the service's resource utilization would be part of the STAFProc resource utilization).

    Registering the ZIP service using STAFEXECPROXY won't prevent the ZIP UNZIP request from failing, but it should prevent STAFProc from crashing.

    My guess is that there could be a bug in the STAF ZIP service. However, it's strange that the problem is intermittent. Are you saying that sometimes unzipping this same 1.2Gb zip file using the STAF ZIP on this Solaris machine works and sometimes it fails? How frequently would you say it is failing (50% of the time, etc). Perhaps it sometimes is running out of memory when other processes are running simultaneously? Are you upgrading the Domino software or running other processes simultaneously when you are unzipping this file?

    Can I access this Solaris machine and this zip file that is causing this problem to recreate this problem and do some debugging? I could do this with a separate instance of STAFProc running so that it shouldn't effect your version of STAF running on it? If so, let me know via a note. Don't post any confidential info on SourceForge.

     
  • Sharon Lucas

    Sharon Lucas - 2010-03-24

    Or, maybe if you give me access to your zip file, I might be able to recreate the UNZIP problem on our own Solaris Sparc 64-bit machine (so I don't have to use your machine).

     
  • Sharon Lucas

    Sharon Lucas - 2010-03-24

    Also, are you sure you are not running out of disk space when unzipping the file? (e.g. after you've upgraded the Domino software and this machine maybe there's not enough disk space left for unzipping this very large file)

    What utility is creating this zip file?

     
  • JoeVeilleux

    JoeVeilleux - 2010-03-24

    Yes I could provide the ZIP file to you as well as give you direct access to this machine, but I will do that via separate in-company mail.

    In response to some of your specific questions:

    > Are you saying that sometimes unzipping this same 1.2Gb zip file using the STAF ZIP on this Solaris machine works and sometimes it fails?

    JoeV: Yes -- that is exactly what I am saying.

    > How frequently would you say it is failing (50% of the time, etc).

    JoeV: Probably ~25% of the time... but don't quote me on a precise number.

    > Perhaps it sometimes is running out of memory when other processes are running simultaneously?

    JoeV: Of course that's possible but probably not that likely. There's nothing else running on this machine apart from this test. So unless something about this operation is causing a massive requirement for memory it's unlikely that the machine is running out (or unless something somehow hung around from some previous step...?)

    > Are you upgrading the Domino software or running other processes simultaneously when you are unzipping this file?

    JoeV: No, those are sequential processes -- our STAX code goes through upgrading software on each of the participating machines (one client and three servers in this case), then enters another phase where it's getting the servers and clients ready for a particular batch of tests and it needs to set up their Notes and Domino data directories. The software-install of Notes and Domino has finished by this point, and the Domino servers have not yet been started. So from the machine's perspective literally all it's doing is going through a sequential set of steps copying some files onto this machine and then putting them in various places on the machine in preparation for starting the server and running tests. In most cases "putting them in various places" means "using the STAF ZIP service to un-zip a ZIPfile from a location on the local machine into a directory which is also on the local machine" (in particular, that's what it means for the step that fails). Interestingly, we use the ZIP service to un-zip several other ZIP files during the course of this same macro-process and they never fail; it's just this one that fails occasionally. I imagine this is probably because of the size -- the other ZIP files are much smaller, this one is ~1.2Gb zipped.

    Elsewhere you asked:

    > Also, are you sure you are not running out of disk space when unzipping the
    file? (e.g. after you've upgraded the Domino software and this machine
    maybe there's not enough disk space left for unzipping this very large
    file)

    JoeV: That's always an issue but not likely the culprit. The partition in question is 67Gb, of which 45Gb is currently available. I didn't check it at the precise point of error but it wouldn't have been any more that ~5Gb different from the current state.

    > What utility is creating this zip file?

    JoeV: The STAF ZIP service is also creating this file, via a Windows BAT file that invokes the STAF ZIP service via command-line.

    I greatly appreciate your clarifications about the 32-bit versus 64-bit versions of STAF. For my purposes I guess I can run either, since I have no Java services registered on this machine nor am I making any service requests to STAF from any processes on this machine (this is a "target" of test, so really all it does is some stuff to set up -- as directed from STAX code executing on the LA server, then runs processes via the PROCESS service and returns STDOUT/STDERR and files -- there are no STAF services other than the basic/bundled ones running, nor are any of our test-target harness programs STAF-aware).

    So -- Tomorrow I will send you an email separately with details about the location of the ZIP file and the machine, and I will plan to probably put the 64-bit version of STAF 3.4.0 on that machine.

     

Log in to post a comment.