#76 Identify running SFCB processes in ps output

1.4.7
fixed
None
None
Function
2013-12-29
2013-09-23
No

"sfcbd is broken! Runs on 100% CPU! Consumes all memory! ..." you probably have heard these complains before. In 100% of these cases, its not sfcbd but a buggy providers causing the cimom run havoc.

This patch provides the actual provider name on the 'ps xf', controlled by the (boolean) 'argvDebug' parameter in sfcb.cfg.

Sample ps output without this patch (resp. with 'argvDebug: false'):
21168 ? Ssl 0:00 /usr/sbin/sfcbd
21172 ? S 0:00 _ /usr/sbin/sfcbd
21174 ? S 0:00 _ /usr/sbin/sfcbd
21175 ? S 0:00 _ /usr/sbin/sfcbd
21177 ? S 0:00 _ /usr/sbin/sfcbd
21179 ? Sl 0:00 _ /usr/sbin/sfcbd
21183 ? S 0:00 _ /usr/sbin/sfcbd
21186 ? Sl 0:00 _ /usr/sbin/sfcbd
21222 ? Sl 0:00 _ /usr/sbin/sfcbd
21306 ? Sl 0:00 _ /usr/sbin/sfcbd

Sample ps output with this patch and 'argvDebug: true':
21815 ? Ssl 0:00 sfcbd provider:ServerProvider -class:CIM_Namespace -location:sfcInteropServerProvider
21817 ? S 0:00 _ /usr/sbin/sfcbd
21821 ? S 0:00 _ /usr/sbin/sfcbd
21822 ? S 0:00 _ /usr/sbin/sfcbd
21824 ? S 0:00 _ sfcbd provider:ClassProvider -class:$ClassProvider$ -location:sfcClassProviderSf
21826 ? Sl 0:00 _ sfcbd provider:InteropProvider -class:$InterOpProvider$ -location:sfcInteropProvider
21830 ? S 0:00 _ sfcbd provider:ProfileProvider -class:$ProfileProvider$ -location:sfcProfileProvider
21833 ? Sl 0:00 _ sfcbd provider:InternalProvider -class:$DefaultProvider$ -location:sfcInternalProvider
21874 ? Sl 0:00 _ sfcbd provider:ServerProvider -class:SFCB_CIMXMLCommunicationMechanism -location:sfcInteropServerProvider

1 Attachments

Discussion

  • Dave Heller

    Dave Heller - 2013-10-03

    This is a really good idea, although not a new one. Chris pointed out to me: there is a patch in the ./extra directory, sfcb-proc-nametags.patch, which was never checked in, that does something similar. Your patch is an improvement in that it provides a bit more detail.

    The problem with the original patch is that it completely overwrites the process cmdline, and so may cause problems for folks doing ps-style lookups to identify sfcb processes, whatever the reason. Your patch adds "sfcbd" back to the cmdline, which mostly aleviates this problem. But I think it's best to preserve the original cmdline completely, with all original arguments, and append the new information to that. This is what my patch does.

    I made a few other changes I think are simplifications, and moved the function call from immediately after the return from getProcess() to within getProcess(), which I assume was your intent.

    If you like this, I will merge it with the older patch to add functionality for the other SFCB processes, Main, Httpd, etc. Maybe we should create another pseudo-parameter to indicate the overall process type, e.g. -proc:main, -proc:httpd, -proc:provider. What do you think?

     
  • Dave Heller

    Dave Heller - 2013-10-03

    Sample output:

    $ sfcbproc --raw
    PID PPID PGID STAT CMD
    12980 4215 12980 S /usr/local/sbin/sfcbd -k -t 8
    12981 12980 12980 S /usr/local/sbin/sfcbd -k -t 8
    12983 12980 12980 S /usr/local/sbin/sfcbd -k -t 8
    12984 12980 12980 S /usr/local/sbin/sfcbd -k -t 8 -provider:ClassProvider -class:$ClassProvider$ -location:sfcClassProviderSf
    12986 12980 12980 S /usr/local/sbin/sfcbd -k -t 8 -provider:InteropProvider -class:$InterOpProvider$ -location:sfcInteropProvider
    12989 12980 12980 S /usr/local/sbin/sfcbd -k -t 8 -provider:InternalProvider -class:$DefaultProvider$ -location:sfcInternalProvider
    12993 12980 12980 S /usr/local/sbin/sfcbd -k -t 8 -provider:ServerProvider -class:CIM_IndicationService -location:sfcInteropServerProvider
    13006 12980 12980 S /usr/local/sbin/sfcbd -k -t 8 -provider:IndCIMXMLHandler -class:CIM_ListenerDestinationCIMXML -location:sfcIndCIMXMLHandler

    $ sfcbproc
    PID PPID PGID STAT ROLE
    12980 4215 12980 S SFCB Main
    12981 12980 12980 S SFCB Logger
    12983 12980 12980 S Http Daemon
    12984 12980 12980 S Class Provider
    12986 12980 12980 S Interop Provider
    12989 12980 12980 S SFCB Internal Provider
    12993 12980 12980 S SFCB Interop Server Provider
    13006 12980 12980 S IndCIMXMLHandler

     
  • Dave Heller

    Dave Heller - 2013-10-04

    This patch needs a bit of cleanup but basically does everything I want.

    $ sfcbproc --raw
    PID PPID PGID STAT CMD
    5249 4215 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:Main
    5250 5249 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:Logger
    5252 5249 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:HttpDaemon -ip:[::]
    5253 5249 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:ClassProvider -class:$ClassProvider$ -location:sfcClassProviderSf
    5255 5249 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:InteropProvider -class:$InterOpProvider$ -location:sfcInteropProvider
    5258 5249 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:InternalProvider -class:$DefaultProvider$ -location:sfcInternalProvider
    5262 5249 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:ServerProvider -class:CIM_IndicationService -location:sfcInteropServerProvider
    5275 5249 5249 S /usr/local/sbin/sfcbd -k -t 8 -p:IndCIMXMLHandler -class:CIM_ListenerDestinationCIMXML -location:sfcIndCIMXMLHandler

     
  • Dave Heller

    Dave Heller - 2013-10-07

    This turned out to be a bit tricky, because the modification of a process' argv will cascade down to any child process. This may or may not be a problem depending on how the child uses argv. Http Daemon, for example, uses argv to print its banner message on startup. (As does main itself.)

    This can be worked around to a degree by waiting to modify argv until after the child is spawned. In the case of provider processes spawned by Main, and new Http Daemons spawned by Main (i.e. on Httpd restart), this is not possible. We want to label Main, but Main may spawn new children in the future.

    It turns out for provider processes, prior modification of argv is not problem. For Httpd it is, due to the banner issue. The same problem occurs on a full SFCB restart, since Main prints a banner based on argv as well. So it's really impossible to get around the requirement to for Main to restore argv to its original state prior to these operations.

    This patch includes a restoreOrigArgv() function for this purpose. It also takes a slightly different approach than my previous patch. Rather than storing the original cmdline in a fixed (or malloced) char[], it implements the append2Argv() function which modifies argv in-place. restoreOrigArgv() is essentially the inverse function, which works on the same principal.

    The code for append2Argv() and restoreOrigArgv() is a little tricky so I added more than the usual share of comments.

     
    • Chris Buccella

      Chris Buccella - 2013-10-07

      I think for safety's sake, append2Argv() should probably check the ultimate string length before appending. The max size is defined as _POSIX_ARG_MAX

       
  • Dave Heller

    Dave Heller - 2013-10-07

    The trouble with that is, there really is less space available, since that limit includes env data, too:

    / Maximum length of arguments to `execve', including environment. /

    define _POSIX_ARG_MAX 4096

    As far as I can tell there is no determinate way to find the argv length limit. I think it's one of those system parameters intended to be "big enough", although we know that can be violated, especially on embedded platforms.

    There is a similar comment under BUGS in "man sysconf(3)"

    When I ran a quick test I was able to pad my argv to a length of about 2.7k bytes before it caused a crash. That on a conventional x86_64 system. I assumme the delta is due to the enviroment data, I didn't investigate further.

    Maybe it's better to take something like _POSIX_ARG_MAX / 2 as the max? Since it's arbitrary, AFAICT.

     
    • Chris Buccella

      Chris Buccella - 2013-10-07

      Checking against some kind of limit would be good I think, even if it is just an internal one that sounds good. Aside from memory allocation, we also want to ensure that the line is short enough to be useful. More information in the ps output is good, but not if it's ugly (-:

       
  • Dave Heller

    Dave Heller - 2013-10-11

    Following Chris' suggestion I added a limit on the total number of chars appended, by changing "labelProcs" from a boolean to an integer representing the desired pad length. If the append data is longer than the value of labelProcs, it will be truncated.

    However, I see now there is a problem with this general approach: By appending to argv we are overwriting the environment data that immediately follows argv. This data is passed by the OS to the executable in the same manner as argv, and is actually accessible as envp. In practice, argv is one contiguous character array consisting of executable + cmdline args + exported enviroment data.

    We could avoid overwriting the env data by capturing all of envp, appending our ID info to argv, then appending back the captured env data: essentially shifting the entire envp to the right. That's not difficult, but I'm afraid by doing that we will really run the risk of overrunning the argv buffer. To be safe, I think the OS really needs to allocate the argv space we require.

    I can see three ways around this:

    1. Do some sort of trick with execv() where SFCB restarts immediately upon startup. Pad argv to the required length beforehand, with some known char string. Then on restart, look for that specific string, remove it and replace it with our ID data. The OS will have allocated the required for us through the call ot execv().

    2. Rather than a .cfg file option, have the user add a cmdline argument of the required padlength to the cmdline, in a manner similar to the original sfcb-proc-nametags.patch, but compatible with all the existing cmdline arguments. So it might be like --label xxxxxxxxxx. We remove this argument from argv and replace it with our ID data. (May require some shuffling of argv if the user-provided option is not last.) The space is allocated because the user provided the extra cmdline argument.

    3. Similar to 2. but have the user provide a an env variable rather than cmdline argument. Like SFCB_PROC_LABEL=xxxxxxxxxx. We capture the entire environment array (by mallocing space for it, which can be freed as soon as we are done) and extract our environment string, leaving the required space. We append our ID info to argv, then append the adjusted env data. The space is allocated because the user provided the extra env variable.

     
  • Dave Heller

    Dave Heller - 2013-10-22

    I tried my suggestion 1. above and it seems to work. It actually turned out to be an elegant approach in that it makes the OS do some of the work for us.

    In essence: when the feature is enabled we re-exec() SFCB immediately upon startup. Before doing so we create a "pad" argument of length "labelProcs" and add it to the end of the argument list. This essentially shifts the environment data (i.e: "env") to the right, allocating the space we need. Now we can overwrite the pad argument without fear of stepping on the env data. All this is invisible to the user.

    As before, the only real complication is in handling the cases where argv needs to be restored to its original form before doing some operation. Specifically these are: a) restarting HTTPD (via SIGUSR2) and b) restarting all of SFCB (via SIGHUP). These cases are similar, with a slight variation: case a) requires converting the inter-arg spaces back to null delimiters; case b) additionally requires nulling out the pad argument.

    However, using the execv() approach I was able to make some simplifications. For example, the storeOrigArgs() function is no longer required.

    I think this does what we want without clobbering the environment space! I will run a few more tests before check-in. Feedback welcome.

     
  • Dave Heller

    Dave Heller - 2013-10-22

    NOTE there is a KNOWN BUG here that will cause the "logger not started" message to appear at the very beginning of SFCB startup:

    logger not started--- Using /usr/local/etc/sfcb/sfcb.cfg

    This is because I am calling getControlUNum() much earlier in the process than any other getControlBlah() function, specifically: before the call to startLogging(). A call to a getControlBlah() function will always trigger a call to setupControl(), to ensure the config file has been read into cache. setupControl() uses a few mlogf() statements, including the one above when reading the config file. Since logging has not yet been started, mlogf() spits out the "logger not started" message.

    There's really no easy way around this, as I am intentionally calling getControlUNum() very early in the process, in order to read "labelProcs". I want to do this before the call to startLogging(). (Calling startLogging() first presents a number of problems, not least of which is that the already-started logger process will stay running across the execv(), which is not what we want.) So if setupControl() attempts to use mlogf() this early, the error message will be seen.

    There are probably some workarounds that could be coded into mlog.c. But I don't think it's worth the trouble. The error message is strictly correct, given the current usage, and I think we can just treat it as warning here. Thoughts?

     
  • Klaus Kämpf

    Klaus Kämpf - 2013-10-23

    Its awesome to see the progress of the argvDebug feature, thank you so much!

    Regarding the logger warning, it actually should be suppressed if there is no workaround. People very often are afraid of warnings. So if there's nothing to be done against this warning, simply don't show it.

     
  • Dave Heller

    Dave Heller - 2013-10-24

    Agreed. In fact, it's worse than I first thought: You will always see the warning, even when the feature is not enabled.

    I think I have a workaround. I considered adding a flag to control the behavior but did not want to add any processing to mlogf(), a frequently called function. Fortunately SFCB already has a flag for just about everything! In this case: sfcbUseSyslog.

    I defer setting sfcbUseSyslog=1 until after the execv() step, just before the call to startLogging(). Any calls to mlogf() prior to that point become a no-op. The trouble is, this not only shuts off the warning, but now we see no "Using sfcb.cfg..." message at all. This is fixed by calling setupControl() and sunsetControl() immediately before and after the execv() step. Calling sunsetControl() ensures that setupControl() will be run again, in it's usual spot, and the message is restored to where it was before!

    Strictly speaking it is not necessary to add the leading setupControl() as I did, since getControlBlah() will call it if it has not run already. But I think it's clearer in the code like this.

     
  • Dave Heller

    Dave Heller - 2013-11-01
    • summary: New option: argvDebug - show provider name in 'ps' output --> Identify running SFCB processes in ps output
    • status: open --> pending
    • assigned_to: Dave Heller
    • Release: backlog --> 1.4.7
     
  • Dave Heller

    Dave Heller - 2013-11-04

    How could I forget to label the Request Handler processes! This patch takes care of that by adding a label like "-reqhandler:n" where n is the request handler number (out of the maximum specified by httpProcs in sfcb.cfg).

    I could have chosen to discard the Http Daemon label before appending the -reqhandler label, but decided to keep it since it helps to identify the http daemon that is the req handler's parent. By default, SFCB only runs one httpd. But with SFT 3597806 there is now the option to run multiple httpd for different addresses. So with multiple adapters running you would get something like this:

    $ ps -ef | grep sfcbd
    root 9858 26931 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:Main
    root 9872 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:Logger
    root 9874 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:127.0.0.1
    root 9875 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:10.0.0.1
    root 9876 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:ClassProvider -class:$ClassProvider$ -location:sfcClassP
    root 9878 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:InteropProvider -class:$InterOpProvider$ -location:sfcIn
    root 9881 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:InternalProvider -class:$DefaultProvider$ -location:sfcI
    root 9885 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:ServerProvider -class:CIM_IndicationService -location:sf
    root 9898 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:IndCIMXMLHandler -class:CIM_ListenerDestinationCIMXML -l
    root 9925 9858 0 21:26 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:OSBase_OperatingSystemProvider -class:Linux_OperatingSys
    root 10470 9874 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:127.0.0.1 -reqhandler: 5
    root 10479 9875 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:10.0.0.1 -reqhandler: 5
    root 10488 9874 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:127.0.0.1 -reqhandler: 1
    root 10497 9875 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:10.0.0.1 -reqhandler: 1
    root 10506 9874 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:127.0.0.1 -reqhandler: 2
    root 10515 9875 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:10.0.0.1 -reqhandler: 2
    root 10524 9874 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:127.0.0.1 -reqhandler: 3
    root 10533 9875 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:10.0.0.1 -reqhandler: 3
    root 10542 9874 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:127.0.0.1 -reqhandler: 4
    root 10551 9875 0 21:27 pts/0 00:00:00 ./sfcbd -4 127.0.0.1,10.0.0.1 -proc:HttpDaemon -ip:10.0.0.1 -reqhandler: 4

     
  • Dave Heller

    Dave Heller - 2013-11-04

    One side-effect of this feature, when viewing the processes with the 'ps' command, is that the cmdine is padded with trailing spaces when the label size does not consume the entire labelProcs limit. This can cause blank lines to appear in the ps output for no apparent reason; really it is the trailing spaces wrapping to the next line.

    There is no way around this AFAICT, since it is the way ps is displaying it. The true argv data is padded with trailing nulls, not spaces, as you can see by examining /proc/pid/cmdline directly. ps is converting the nulls to spaces.

    My "sfcbproc" utility did the same thing, when displaying the cmdline in --raw mode. I have patched sfcbproc to correct this.

    You can pipe the output through "cat -e" to see the difference.

     
  • Dave Heller

    Dave Heller - 2013-11-04

    New patch is commit [4b3fed] for v1.4

     

    Related

    Commit: [4b3fed]

  • Dave Heller

    Dave Heller - 2013-11-27

    Upon further review I have determined that deferring the start of logging (i.e, the setting of the "sfcbUseSyslog" flag) has some unintended consequences. It suppresses all messages from setupControl() at this early stage, so if there is some error such as an unrecognized config property, SFCB will silently fail to start. Similarly, we won't see the message if SFCB aborts because there is already an instance running. This is not good.

    This could be mitigated through additional patches to the Logger or Control... but I have decided not to pursue this path any further. Instead I think it's better to use an environment variable to set the value of labelProcs, rather than attempting to read it from the config file. This removes the need to call setupControl() early, which alleviates the headaches discussed in the last several threads. While I would have preferred to keep the config property as an option for enabling the feature, using an env var serves the same purpose and is simpler.

    This patch removes the "labelProcs" config property and adds the SFCB_LABELPROCS environment variable. It also restores the logging startup sequence to its previous arrangement, prior to commit [2fd9b0]. All other functionality remains the same. The integer value set in SFCB_LABELPROCS works in the same way as it did for "labelProcs". The new env var is documented in the sfcbd man page.

    Note that we are still using the execv() trick to allocate the cmdline space, so this is still option 1 as I described above. The only difference is, we're getting our labelProcs value through an env var rather than the config file.

    New patch is commit [365658] for v1.4

     

    Related

    Commit: [2fd9b0]
    Commit: [365658]

  • Klaus Kämpf

    Klaus Kämpf - 2013-11-29

    Using an environment setting to enable this feature is probably better than changing a configuration file. Thanks, Dave !

    I already noticed the silent failure to start on an unrecognized config property but didn't dig further.

     
  • Dave Heller

    Dave Heller - 2013-12-29
    • Status: pending --> fixed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks