Menu

STAF queue service drops messages

san
2012-08-08
2014-01-03
  • san

    san - 2012-08-08

    I have a perl program (which uses STAF perl API's like STAF::STAFHandle::submit) on CentOS 5.2 (launchhost) which starts iozone workload on 2 different rhel5 testhosts. Below is the command used for this:

    1) staf <testhost1> PROCESS START SHELL SAMECONSOLE RETURNSTDOUT RETURNSTDER
    R NOTIFY ONEND WORKDIR /  COMMAND ///tmp/l5RQdf_7fO/iozone  "-a -g 512M -i0"  -f /tmp/iozonetest_19501_1344316209

    2) staf <testhost2> PROCESS START SHELL SAMECONSOLE RETURNSTDOUT RETURNSTDER
    R NOTIFY ONEND WORKDIR /  COMMAND ///tmp/2fL0HPh_dt/iozone  "-a -g 512M -i0"  -f /tmp/iozonetest_19501_1344316209

    After starting the workloads the perl program waits for the handles in 1) & 2) for completion using:

    3) staf local queue get wait 1000

    So the perl program logic is as follows:

    1) Run for 16 iterations
    2) Start 2 iozone workloads from the launchhost using STAF process "NOTIFY ONEND" on 2 different testhosts.
    3) wait for workloads completion using "staf local queue get wait 1000"

    The problem is I am not getting completed handles (using staf local queue command) for both iozone workloads in all the iterations. Say for eg: iteration 1 -> iteration 2 -> iteration3 -> iteration 4 -> iteration5 -> iteration 6 gets both the handles. Iteration 7 gets only one handle.

    And this happens randomly i.e sometimes I get only one handle in iteration 7, sometimes in iteration 13 etc.

    Any ideas whats happening here.

    Some details about my setup:
    1) I have "SET MAXQUEUESIZE 200" on the launchhost.
    2) launchost has STAF version 3.4.1
    3) The 2 testshosts have 3.3.5

     
  • Sharon Lucas

    Sharon Lucas - 2012-08-08

    I doubt that the STAF QUEUE service is "dropping" any messages on its queue (if its maximum queue size has not been exceeded).  The more likely issue is that an error occurred when the PROCESS service on the testhost tried to send the process completion message to the launchhost that started the process,

    First, you should upgrade STAF to the latest version (3.4.10), especially on the test hosts where the processes are being run as they are using an older version of STAF (3.3.5) and there have been improvements in the PROCESS service when sending a process completion notification message (including retries and better handling of errors).

    Then you should enable some STAF traciing on the test hosts (where the the processes are being run).  I recommend enabling the Error, Warning, ServiceError, and RemoteRequests tracepoints and enable tracing of the QUEUE service.  This way, if any errors occur when the PROCESS service on the testhost submits a QUEUE request to the QUEUE service on the lauchhost to send the STAF/Process/End message, these errors will be logged in STAFProc's trace output which you can redirect to a file.  You can enable this STAF tracing by submitting the following TRACE service requests on each of your testhosts:

    STAF <testhost> TRACE ENABLE TRACEPOINT "Error Warning ServiceError RemoteRequests"
    STAF <testhost> TRACE DISABLE ALL SERVICES
    STAF <testhost> TRACE ENABLE SERVICE QUEUE
    STAF <testhost> TRACE SET DESTINATION TO FILE /tmp/stafproc.trc

    This will cause messages to be logged to file /tmp/stafproc.trc on the testhost when an error occurs when its PROCESS service submits a request to the QUEUE service on the launchhost to send a STAF/Process/End notification message.

    So, run your scenario that recreates the issue and then check the STAFProc trace output files (e.g. /tmp/stafproc.trc) on your testhosts to see if any error messages were logged.

     
  • san

    san - 2012-08-09

    Thanks Sharon.

    I upgraded both the testhosts STAF versions to 3.4.10 & enabled the above mentioned TRACE's in STAF.cfg on both testhosts. I ran the perl program & the program executed fine i.e both the handles were returned in all the iterations. I checked the stafproc.trc & below is the snip from the testhosts:

    <snip>
    20120809-01:46:50;3016670096;00000100;STAFProcess::processMonitorThread: Error opening /dev/tty, errno: 6
    20120809-01:54:31;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 460 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-01:54:34;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 288 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-01:54:35;3020868496;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 583 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: ssl://launchhostipaddr
    20120809-01:54:35;3020868496;00000400;STAFConnectionManager::makeConnection - Attempt #2 of 2 (Delay 0 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: ssl://launchhostipaddr
    20120809-02:01:50;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 808 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:01:52;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 904 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:09:23;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 227 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:09:26;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 869 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:16:43;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 209 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:16:45;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 210 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:23:47;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 919 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:23:50;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 989 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:30:51;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 892 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:37:56;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 639 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:37:58;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 530 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:59:02;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 347 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-02:59:04;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 73 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:06:03;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 102 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:06:05;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 458 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:13:09;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 978 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:20:05;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 79 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:20:07;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 786 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:27:12;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 63 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:27:15;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 981 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:34:19;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 990 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:34:22;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 989 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:41:23;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 579 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    20120809-03:41:25;3025066896;00000400;STAFConnectionManager::makeConnection - Attempt #1 of 2 (Delay 463 milliseconds), RC: 16, Result: STAFConnectionProviderConnect: Error performing test read on connected endpoint: recv() RC=111: 22, Endpoint: tcp://launchhost@6500
    </snip>

    both testhosts stafproc.trc have pretty much same entries.

    I had a look at DNS entry for the launchhost. Below is the snip:

    <snip>
    nslookup launchhost
    Server: dns1
    Address: server1

    Name: launchhost
    Address: x.y.a.b
    Name: launchhost
    Address: x.y.c.d
    </snip>

    There are 2 DNS entries for my launchhost!. So its evident why I am seeing the error messages in stafproc.trc.

    The problem was solved by upgrading to STAF 3.4.10.

     
  • Sharon Lucas

    Sharon Lucas - 2012-08-09

    I'm glad it's working now (though you should get the DNS issues resolved for better performance).  One of the improvements made in the STAF PROCESS service was if sending the STAF/Process/End message failed using the hostname of the system that submitted the PROCESS START NOTIFY ONEND request, to retry sending the message using its IP address.

     
  • san

    san - 2012-08-10

    I have resolved the DNS issue as well. Thanks again.

     
  • Flo K

    Flo K - 2013-12-20

    We see a similar issue as San did.
    Our Staf job starts, runs and finishes successful on a remote machine but then doesn't send the final STAF/Process/End message. It works for several other jobs but until it does not which makes the launchhost wait forever.

    We have enabled trace logging and found the same DNS issue which we resolved. It made things respond faster but it didn't help with the problem that the last message doesn't get sent. We use about 5 remote clients which all work fine except of one which has the described issue.

    Any help is appreciated.

     
  • Sharon Lucas

    Sharon Lucas - 2014-01-03

    Did you enable STAF tracing on the remote client that is encountering this problem? If so, you should see an error logged on a QUEUE request from the QUEUE service when sending a STAF/Process/End message. What is the RC and error message? Also, verify that the location (e.g. host name / IP address) where it is trying to send the message is correct. This will be logged as well in the trace data.

     

Log in to post a comment.