I have a perl script which runs in a loop forever (while (1)).
Its purpose is to check the status of the test machines (whether they are up, what type of machine they are, which revision, etc).
Sometimes my process needs to be stopped and restarted.
When stopping the process I notice that STAFProc gets a lot of New thread messages and the virtual memory size increases accordingly.
It might only happen on a stressed system where STAFProc starts getting lots of requests at once.
The main problem is that these threads never quit. They are always waiting for work and eventually cause STAFProc to go over 3G of virtual memory.
Logged In: YES
user_id=285070
Originator: NO
We need more information.
What version of STAF are you running? Note that if it's not the current version of STAF (V3.2.5), we're going to ask you to upgrade to it to see if this problem still occurs. :-)
What operating system is this happening on?
Can you provide the output of the following command when run on this machine?
STAF local VAR LIST
Can you provide more information on what this Perl script is doing (in terms of what STAF service service requests it submits)?
Logged In: YES
user_id=1535417
Originator: YES
Of course 3.2.5 :) I have observed the issue since at least 3.1.5 but think it has been around for all 3.x version.
I have observed it on various flavors of Linux (RHEL3,4, Debian, Gentoo). I haven't run these scripts on Windows, so I don't know if it happens on Windows.
This script runs PING, FS, PROCESS, VAR services. Either get file, copy file, run processes (with wait), VAR to get STAF/Config/OS/Name.
I have an additional question though (which may workaround this issue for now). These workerThreads are waiting for an eventsem (timeout=4294967295). Is there a setting to change the timeout and if it times out, the workerThread will go away? In the end I have ~1500 of these threads and I know I've never had this many concurrent processes.
# staf local var list system
Response
--------
STAF/Config/BootDrive : /
STAF/Config/CodePage : UTF-8
STAF/Config/ConfigFile : /usr/local/staf/bin/STAF.cfg
STAF/Config/DefaultAuthenticator: none
STAF/Config/DefaultInterface : tcp
STAF/Config/InstanceName : STAF
STAF/Config/Machine : machine1.example.com
STAF/Config/MachineNickname : machine1.example.com
STAF/Config/Mem/Physical/Bytes : 3989016576
STAF/Config/Mem/Physical/KB : 3895524
STAF/Config/Mem/Physical/MB : 3804
STAF/Config/OS/MajorVersion : 2.6.9-34.ELsmp
STAF/Config/OS/MinorVersion : #1 SMP Fri Feb 24 16:54:53 EST 2006
STAF/Config/OS/Name : Linux
STAF/Config/OS/Revision : i686
STAF/Config/Sep/Command : ;
STAF/Config/Sep/File : /
STAF/Config/Sep/Line :
STAF/Config/Sep/Path : :
STAF/Config/STAFRoot : /usr/local/staf
STAF/Config/StartupTime : 20080331-16:24:18
STAF/DataDir : /usr/local/staf/data/STAF
STAF/Env/_ : ./STAFProc
STAF/Env/CLASSPATH : /usr/local/staf/lib/JSTAF.jar:/usr/local/staf/samples/demo/STAFDemo.jar:/usr/local/staf/lib/JSTAF.jar:/usr/local/staf/samples/demo/STAFDemo.jar:/usr/local/staf/lib/JSTAF.jar:/usr/local/staf/samples/demo/STAFDemo.jar
STAF/Env/G_BROKEN_FILENAMES : 1
STAF/Env/HISTSIZE : 1000
STAF/Env/HOME : /root
STAF/Env/HOSTNAME : machine1.example.com
STAF/Env/INPUTRC : /etc/inputrc
STAF/Env/KDEDIR : /usr
STAF/Env/LANG : en_US.UTF-8
STAF/Env/LD_LIBRARY_PATH : /usr/local/staf/lib:/usr/local/staf/lib
STAF/Env/LESSOPEN : |/usr/bin/lesspipe.sh %s
STAF/Env/LOGNAME : root
STAF/Env/LS_COLORS : no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.mng=01;35:*.xcf=01;35:*.pcx=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.avi=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.mov=01;35:*.qt=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.pdf=00;32:*.ps=00;32:*.txt=00;32:*.patch=00;32:*.diff=00;32:*.log=00;32:*.tex=00;32:*.doc=00;32:*.mp3=00;36:*.wav=00;36:*.mid=00;36:*.midi=00;36:*.au=00;36:*.ogg=00;36:*.flac=00;36:*.aac=00;36:
STAF/Env/MAIL : /var/spool/mail/root
STAF/Env/OLDPWD : /root/staf/src/staf
STAF/Env/PATH : /usr/local/staf/bin:/usr/local/staf/bin:/usr/local/staf/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin
STAF/Env/PWD : /usr/local/staf/bin
STAF/Env/SHELL : /bin/bash
STAF/Env/SHLVL : 1
STAF/Env/SSH_ASKPASS : /usr/libexec/openssh/gnome-ssh-askpass
STAF/Env/SSH_CLIENT : ::ffff:10.20.122.162 1911 22
STAF/Env/SSH_CONNECTION : ::ffff:10.20.122.162 1911 ::ffff:10.17.22.213 22
STAF/Env/SSH_TTY : /dev/pts/3
STAF/Env/STAFCONVDIR : /usr/local/staf/codepage
STAF/Env/TCLLIBPATH : /usr/local/staf/bin:/usr/local/staf/bin
STAF/Env/TERM : xterm
STAF/Env/USER : root
STAF/Version : 3.2.5
testdb/property/file : /exit14/home/qa/staf/services/Java/esx310/vmqaDB.properties
testdb/tmpdir : /tmp/staf_service
vim/keyStorePath : /exit14/home/qa/staf/services/Java/esx310/client.keystore
STAF script used for reproducing thread creation
Logged In: YES
user_id=1535417
Originator: YES
I am attaching a perl script (stripped down version).
The hostnames will need to be changed to valid hosts.
I'd let the script run for at least a few minutes before ^C it.
It is faster to reproduce from a machine that just started STAFProc since there are worker threads.
File Added: staf_bug1931665_hostAvail.pl
Logged In: YES
user_id=1535417
Originator: YES
I just want to add that I believe this is the reason we are hitting bug 1688297 frequently.
Even using 2M threads we still hit it about once a week.
[ 1688297 ] OutOfMemory issues on Linux creating a thread and elsewhere
Logged In: YES
user_id=285070
Originator: NO
I haven't had a chance yet to try to recreate the problem using the script you provided.
As a possible workaround (a way that would delay this problem from occurring for a longer
period of time) is to set the STAF thread stack size environment variable to 1M (and
restart STAFProc) instead of using the current maximum default thread size of 2M for Linux.
Logged In: YES
user_id=285070
Originator: NO
I ran your Perl program and did Ctl-C and saw that a bunch of new threads get created. I'll try to debug it more next week to see what's happening.
Updated Perl script that only creates one STAF handle
Logged In: YES
user_id=285070
Originator: NO
Your Perl program is creating lots of new STAF handles instead of creating one
STAF handle and using it to submit all STAF service requests. This is not how
we intended/envisioned STAF handles to be used. Generally a process only needs
to create one STAF handle and use it to submit STAF service requests. For
example, I modified your program to only create one STAF handle and to use it
to submit all STAF service requests. When I run it, there won't be lots of
threads generated when I CTL-C the program. I attached this Perl program as
staf_bug1931665_hostAvailX.pl. Do you really have a need to create more than
one STAF handle in your program? If so, please explain why. Otherwise, if
you change it to only create one STAF handle, you won't see this issue.
Now, in the situation where many STAF handles are created, like what your
original Perl script does, I'll investigate further to see if there is a bug
in STAF regarding this.
File Added: staf_bug1931665_hostAvailX.pl
Logged In: YES
user_id=1535417
Originator: YES
For the hostAvail, it appears that we can use one STAF object and it should be fine.
Our other script is multi-threaded and it appears the way queues work is that there is no way many threads can share one STAF object without getting the wrong data.
Even a single threaded process seems difficult in this manner (send 2 requests with notify onend and when I get a response, I'd have to parse to find out which one returned).
So, it seems that each thread would need one handle if using any async operations (or risk getting another threads messages).
Logged In: YES
user_id=285070
Originator: NO
Yes, you can have a multi-threaded program and use a single
handle to submit STAF service requests via separate threads.
You want to have a separate thread that handles all of the
STAF/RequestComplete messages on the handle's queue.
For example, we provide several STAF services (such as STAX)
that do exactly that. The service submits STAF service requests
asynchronously (on separate threads) using a single STAF handle
that was created for the service. When you submit a request
asynchronously, you get the request number returned in the
result. If desired, the requset number can be stored in a map
(along with any other information needed). In a separate thread,
the service monitors the queue by submitting a QUEUE GET WAIT
request in a loop. Each time it gets a message from the queue,
it checks its type (e.g. STAF/RequestComplete, STAF/Process/End).
If it's a STAF/RequestComplete message, it gets the requestNumber
that's provided in the message and matches it with a requestNumber
in the map the service maintains to find out which STAF service
request completed and notifies the thread that submitted the
request that it completed and provides it with the rc and result
for the service request.
Note that the format of a messaage of type "STAF/RequestComplete" is:
Key Name Type
------------- --------
requestNumber <String>
rc <String>
result <String>
Note, you don't "parse" messages, you unmarshall them.
Logged In: YES
user_id=285070
Originator: NO
Let me explain a little more about why this problem is
occurring. I'll start by discussing STAF's Thread Manager.
STAF maintains a pool of reusable threads. If a new thread is
required by something and all threads are currently in use,
then a new thread is created. Threads are never deleted/
removed from STAF's thread pool by design.
You're creating lots of handles from a single process (a Perl
program) and deleting each handle when you're done with it.
Each time a handle is created by the process, a callback entry
is added to the process callback notification list for the
process. The problem is that this entry is not being removed
when the handle is deleted. When your Perl process is
terminated, the ProcessMonitorThread() method in
stafif/STAFProcess.cpp iterates thorough the process
notification callback list, and for each callback, STAF
asynchronously submits a callback to run, and each
callback is run on a separate thread. So, with one callback
per handle created by your Perl program, this results in a
flood of callback requests being submitted all at once and
results in many new threads being created to run the callback
requests, which in turn increases the memory used. Since
threads are never removed from the STAF thread pool (though
they are re-used), STAFProc eventually runs out of memory.
To fix this problem, I added a 1 milli-second sleep in the
loop after submitting each callback to prevent a flood of
callbacks from being submitted all at once. I verified that
this resolved the problem that you are seeing (as this prevents
a whole bunch of threads being created when you Ctl-C your Perl
program). Note that this problem only occurs on Unix machines,
not Windows, because the implementation of the
ProcessMonitorThread() method in unix/STAFProcess.cpp is
different from its implementation in win32/STAFProcess.cpp.
Another fix that we should make is to remove a callback entry
from the process callback notification list when a handle is
removed. This would prevent the process callback notification
list from growing so large (as long as the handles are removed
like you are already doing) which increases the memory used by
STAFProc. However, this memory increase is not very large
(e.g. it's more like a few bytes for each handle, instead of
4M for each new thread created). I'll open a separate bug to
fix this in the future it's going to take longer to code and
test to make sure handle garbage collection is still working
properly on all platforms. The change will involve adding a
new STAFProcessUnRegisterEndCallback() method to
STAFProcess.cpp and have STAFHandleManager's removeHandle()
method call it when unregistering a handle to remove an entry
from the process callback notification list.
Here's a cvs diff of the change that I made to fix this problem:
Index: stafif/unix/STAFProcess.cpp
RCS file: /cvsroot/staf/src/staf/stafif/unix/STAFProcess.cpp,v
retrieving revision 1.30
diff -r1.30 STAFProcess.cpp
356a357,363
>
> // Wait for 1 millisecond to prevent a flood of callback
> // requests being submitted all at once, possibly
> // resulting in new threads being created to run the
> // callback requests
>
> STAFThreadSleepCurrentThread(1, 0);
Let me know if you need me to build and send you a private
fix for STAF on Linux (and to what mail id to send the fix).
Note that if you changed the design of your Perl script to
create a single handle (as I previously described), you
wouldn't have this problem. :-)
Logged In: YES
user_id=1535417
Originator: YES
I am trying your change on our host.
I am changing our script to register less handles but it will take a large design change to reduce the handles to 1 (it will be 1 per thread, which is much less than now).
If the memory usage still goes up, should I increase the sleep by a few more milliseconds?
Logged In: YES
user_id=285070
Originator: NO
No. Increasing the sleep won't matter. The sleep is there so that
the thread processing the process end callbacks gives up its
timeslice so that other threads can run.
I believe you're going to be very pleased with the results using
this fix. And, with the changes you're making to your Perl script,
to reduce the number of handles being created, you'll be even happier.
Let me know the results.
Thanks for providing a recreation script. That really helps a lot
to debug the problem.
Note that I opened Bug #1937126 "RemoveHandle should call
ProcessUnregisterEndCallback method" (as I previously said I would).