Menu

#1108 STAFProc creates lots of threads which never go away

Unix
closed-fixed
5
2008-04-10
2008-04-01
No

I have a perl script which runs in a loop forever (while (1)).
Its purpose is to check the status of the test machines (whether they are up, what type of machine they are, which revision, etc).

Sometimes my process needs to be stopped and restarted.
When stopping the process I notice that STAFProc gets a lot of New thread messages and the virtual memory size increases accordingly.

It might only happen on a stressed system where STAFProc starts getting lots of requests at once.

The main problem is that these threads never quit. They are always waiting for work and eventually cause STAFProc to go over 3G of virtual memory.

Discussion

  • Sharon Lucas

    Sharon Lucas - 2008-04-01

    Logged In: YES
    user_id=285070
    Originator: NO

    We need more information.

    What version of STAF are you running? Note that if it's not the current version of STAF (V3.2.5), we're going to ask you to upgrade to it to see if this problem still occurs. :-)

    What operating system is this happening on?

    Can you provide the output of the following command when run on this machine?
    STAF local VAR LIST

    Can you provide more information on what this Perl script is doing (in terms of what STAF service service requests it submits)?

     
  • prziborowski

    prziborowski - 2008-04-02

    Logged In: YES
    user_id=1535417
    Originator: YES

    Of course 3.2.5 :) I have observed the issue since at least 3.1.5 but think it has been around for all 3.x version.

    I have observed it on various flavors of Linux (RHEL3,4, Debian, Gentoo). I haven't run these scripts on Windows, so I don't know if it happens on Windows.

    This script runs PING, FS, PROCESS, VAR services. Either get file, copy file, run processes (with wait), VAR to get STAF/Config/OS/Name.

    I have an additional question though (which may workaround this issue for now). These workerThreads are waiting for an eventsem (timeout=4294967295). Is there a setting to change the timeout and if it times out, the workerThread will go away? In the end I have ~1500 of these threads and I know I've never had this many concurrent processes.

    # staf local var list system
    Response
    --------
    STAF/Config/BootDrive : /
    STAF/Config/CodePage : UTF-8
    STAF/Config/ConfigFile : /usr/local/staf/bin/STAF.cfg
    STAF/Config/DefaultAuthenticator: none
    STAF/Config/DefaultInterface : tcp
    STAF/Config/InstanceName : STAF
    STAF/Config/Machine : machine1.example.com
    STAF/Config/MachineNickname : machine1.example.com
    STAF/Config/Mem/Physical/Bytes : 3989016576
    STAF/Config/Mem/Physical/KB : 3895524
    STAF/Config/Mem/Physical/MB : 3804
    STAF/Config/OS/MajorVersion : 2.6.9-34.ELsmp
    STAF/Config/OS/MinorVersion : #1 SMP Fri Feb 24 16:54:53 EST 2006
    STAF/Config/OS/Name : Linux
    STAF/Config/OS/Revision : i686
    STAF/Config/Sep/Command : ;
    STAF/Config/Sep/File : /
    STAF/Config/Sep/Line :

    STAF/Config/Sep/Path : :
    STAF/Config/STAFRoot : /usr/local/staf
    STAF/Config/StartupTime : 20080331-16:24:18
    STAF/DataDir : /usr/local/staf/data/STAF
    STAF/Env/_ : ./STAFProc
    STAF/Env/CLASSPATH : /usr/local/staf/lib/JSTAF.jar:/usr/local/staf/samples/demo/STAFDemo.jar:/usr/local/staf/lib/JSTAF.jar:/usr/local/staf/samples/demo/STAFDemo.jar:/usr/local/staf/lib/JSTAF.jar:/usr/local/staf/samples/demo/STAFDemo.jar
    STAF/Env/G_BROKEN_FILENAMES : 1
    STAF/Env/HISTSIZE : 1000
    STAF/Env/HOME : /root
    STAF/Env/HOSTNAME : machine1.example.com
    STAF/Env/INPUTRC : /etc/inputrc
    STAF/Env/KDEDIR : /usr
    STAF/Env/LANG : en_US.UTF-8
    STAF/Env/LD_LIBRARY_PATH : /usr/local/staf/lib:/usr/local/staf/lib
    STAF/Env/LESSOPEN : |/usr/bin/lesspipe.sh %s
    STAF/Env/LOGNAME : root
    STAF/Env/LS_COLORS : no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.mng=01;35:*.xcf=01;35:*.pcx=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.avi=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.mov=01;35:*.qt=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.pdf=00;32:*.ps=00;32:*.txt=00;32:*.patch=00;32:*.diff=00;32:*.log=00;32:*.tex=00;32:*.doc=00;32:*.mp3=00;36:*.wav=00;36:*.mid=00;36:*.midi=00;36:*.au=00;36:*.ogg=00;36:*.flac=00;36:*.aac=00;36:
    STAF/Env/MAIL : /var/spool/mail/root
    STAF/Env/OLDPWD : /root/staf/src/staf
    STAF/Env/PATH : /usr/local/staf/bin:/usr/local/staf/bin:/usr/local/staf/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin
    STAF/Env/PWD : /usr/local/staf/bin
    STAF/Env/SHELL : /bin/bash
    STAF/Env/SHLVL : 1
    STAF/Env/SSH_ASKPASS : /usr/libexec/openssh/gnome-ssh-askpass
    STAF/Env/SSH_CLIENT : ::ffff:10.20.122.162 1911 22
    STAF/Env/SSH_CONNECTION : ::ffff:10.20.122.162 1911 ::ffff:10.17.22.213 22
    STAF/Env/SSH_TTY : /dev/pts/3
    STAF/Env/STAFCONVDIR : /usr/local/staf/codepage
    STAF/Env/TCLLIBPATH : /usr/local/staf/bin:/usr/local/staf/bin
    STAF/Env/TERM : xterm
    STAF/Env/USER : root
    STAF/Version : 3.2.5
    testdb/property/file : /exit14/home/qa/staf/services/Java/esx310/vmqaDB.properties
    testdb/tmpdir : /tmp/staf_service
    vim/keyStorePath : /exit14/home/qa/staf/services/Java/esx310/client.keystore

     
  • prziborowski

    prziborowski - 2008-04-02

    STAF script used for reproducing thread creation

     
  • prziborowski

    prziborowski - 2008-04-02

    Logged In: YES
    user_id=1535417
    Originator: YES

    I am attaching a perl script (stripped down version).
    The hostnames will need to be changed to valid hosts.
    I'd let the script run for at least a few minutes before ^C it.

    It is faster to reproduce from a machine that just started STAFProc since there are worker threads.

    File Added: staf_bug1931665_hostAvail.pl

     
  • prziborowski

    prziborowski - 2008-04-04

    Logged In: YES
    user_id=1535417
    Originator: YES

    I just want to add that I believe this is the reason we are hitting bug 1688297 frequently.
    Even using 2M threads we still hit it about once a week.

    [ 1688297 ] OutOfMemory issues on Linux creating a thread and elsewhere

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-04

    Logged In: YES
    user_id=285070
    Originator: NO

    I haven't had a chance yet to try to recreate the problem using the script you provided.

    As a possible workaround (a way that would delay this problem from occurring for a longer
    period of time) is to set the STAF thread stack size environment variable to 1M (and
    restart STAFProc) instead of using the current maximum default thread size of 2M for Linux.

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-04

    Logged In: YES
    user_id=285070
    Originator: NO

    I ran your Perl program and did Ctl-C and saw that a bunch of new threads get created. I'll try to debug it more next week to see what's happening.

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-04

    Updated Perl script that only creates one STAF handle

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-04

    Logged In: YES
    user_id=285070
    Originator: NO

    Your Perl program is creating lots of new STAF handles instead of creating one
    STAF handle and using it to submit all STAF service requests. This is not how
    we intended/envisioned STAF handles to be used. Generally a process only needs
    to create one STAF handle and use it to submit STAF service requests. For
    example, I modified your program to only create one STAF handle and to use it
    to submit all STAF service requests. When I run it, there won't be lots of
    threads generated when I CTL-C the program. I attached this Perl program as
    staf_bug1931665_hostAvailX.pl. Do you really have a need to create more than
    one STAF handle in your program? If so, please explain why. Otherwise, if
    you change it to only create one STAF handle, you won't see this issue.

    Now, in the situation where many STAF handles are created, like what your
    original Perl script does, I'll investigate further to see if there is a bug
    in STAF regarding this.

    File Added: staf_bug1931665_hostAvailX.pl

     
  • prziborowski

    prziborowski - 2008-04-07

    Logged In: YES
    user_id=1535417
    Originator: YES

    For the hostAvail, it appears that we can use one STAF object and it should be fine.

    Our other script is multi-threaded and it appears the way queues work is that there is no way many threads can share one STAF object without getting the wrong data.
    Even a single threaded process seems difficult in this manner (send 2 requests with notify onend and when I get a response, I'd have to parse to find out which one returned).

    So, it seems that each thread would need one handle if using any async operations (or risk getting another threads messages).

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-07

    Logged In: YES
    user_id=285070
    Originator: NO

    Yes, you can have a multi-threaded program and use a single
    handle to submit STAF service requests via separate threads.
    You want to have a separate thread that handles all of the
    STAF/RequestComplete messages on the handle's queue.

    For example, we provide several STAF services (such as STAX)
    that do exactly that. The service submits STAF service requests
    asynchronously (on separate threads) using a single STAF handle
    that was created for the service. When you submit a request
    asynchronously, you get the request number returned in the
    result. If desired, the requset number can be stored in a map
    (along with any other information needed). In a separate thread,
    the service monitors the queue by submitting a QUEUE GET WAIT
    request in a loop. Each time it gets a message from the queue,
    it checks its type (e.g. STAF/RequestComplete, STAF/Process/End).
    If it's a STAF/RequestComplete message, it gets the requestNumber
    that's provided in the message and matches it with a requestNumber
    in the map the service maintains to find out which STAF service
    request completed and notifies the thread that submitted the
    request that it completed and provides it with the rc and result
    for the service request.

    Note that the format of a messaage of type "STAF/RequestComplete" is:

    Key Name Type
    ------------- --------
    requestNumber <String>
    rc <String>
    result <String>

    Note, you don't "parse" messages, you unmarshall them.

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-07

    Logged In: YES
    user_id=285070
    Originator: NO

    Let me explain a little more about why this problem is
    occurring. I'll start by discussing STAF's Thread Manager.
    STAF maintains a pool of reusable threads. If a new thread is
    required by something and all threads are currently in use,
    then a new thread is created. Threads are never deleted/
    removed from STAF's thread pool by design.

    You're creating lots of handles from a single process (a Perl
    program) and deleting each handle when you're done with it.
    Each time a handle is created by the process, a callback entry
    is added to the process callback notification list for the
    process. The problem is that this entry is not being removed
    when the handle is deleted. When your Perl process is
    terminated, the ProcessMonitorThread() method in
    stafif/STAFProcess.cpp iterates thorough the process
    notification callback list, and for each callback, STAF
    asynchronously submits a callback to run, and each
    callback is run on a separate thread. So, with one callback
    per handle created by your Perl program, this results in a
    flood of callback requests being submitted all at once and
    results in many new threads being created to run the callback
    requests, which in turn increases the memory used. Since
    threads are never removed from the STAF thread pool (though
    they are re-used), STAFProc eventually runs out of memory.
    To fix this problem, I added a 1 milli-second sleep in the
    loop after submitting each callback to prevent a flood of
    callbacks from being submitted all at once. I verified that
    this resolved the problem that you are seeing (as this prevents
    a whole bunch of threads being created when you Ctl-C your Perl
    program). Note that this problem only occurs on Unix machines,
    not Windows, because the implementation of the
    ProcessMonitorThread() method in unix/STAFProcess.cpp is
    different from its implementation in win32/STAFProcess.cpp.

    Another fix that we should make is to remove a callback entry
    from the process callback notification list when a handle is
    removed. This would prevent the process callback notification
    list from growing so large (as long as the handles are removed
    like you are already doing) which increases the memory used by
    STAFProc. However, this memory increase is not very large
    (e.g. it's more like a few bytes for each handle, instead of
    4M for each new thread created). I'll open a separate bug to
    fix this in the future it's going to take longer to code and
    test to make sure handle garbage collection is still working
    properly on all platforms. The change will involve adding a
    new STAFProcessUnRegisterEndCallback() method to
    STAFProcess.cpp and have STAFHandleManager's removeHandle()
    method call it when unregistering a handle to remove an entry
    from the process callback notification list.

    Here's a cvs diff of the change that I made to fix this problem:

    Index: stafif/unix/STAFProcess.cpp

    RCS file: /cvsroot/staf/src/staf/stafif/unix/STAFProcess.cpp,v
    retrieving revision 1.30
    diff -r1.30 STAFProcess.cpp
    356a357,363
    >
    > // Wait for 1 millisecond to prevent a flood of callback
    > // requests being submitted all at once, possibly
    > // resulting in new threads being created to run the
    > // callback requests
    >
    > STAFThreadSleepCurrentThread(1, 0);

    Let me know if you need me to build and send you a private
    fix for STAF on Linux (and to what mail id to send the fix).

    Note that if you changed the design of your Perl script to
    create a single handle (as I previously described), you
    wouldn't have this problem. :-)

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-07
    • labels: --> Service::Process
    • assigned_to: nobody --> slucas
     
  • Sharon Lucas

    Sharon Lucas - 2008-04-07
    • milestone: --> Unix
     
  • prziborowski

    prziborowski - 2008-04-08

    Logged In: YES
    user_id=1535417
    Originator: YES

    I am trying your change on our host.
    I am changing our script to register less handles but it will take a large design change to reduce the handles to 1 (it will be 1 per thread, which is much less than now).
    If the memory usage still goes up, should I increase the sleep by a few more milliseconds?

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-08

    Logged In: YES
    user_id=285070
    Originator: NO

    No. Increasing the sleep won't matter. The sleep is there so that
    the thread processing the process end callbacks gives up its
    timeslice so that other threads can run.

    I believe you're going to be very pleased with the results using
    this fix. And, with the changes you're making to your Perl script,
    to reduce the number of handles being created, you'll be even happier.
    Let me know the results.

    Thanks for providing a recreation script. That really helps a lot
    to debug the problem.

    Note that I opened Bug #1937126 "RemoveHandle should call
    ProcessUnregisterEndCallback method" (as I previously said I would).

     
  • Sharon Lucas

    Sharon Lucas - 2008-04-10
    • status: open --> closed-fixed
     

Log in to post a comment.