Menu

#205 Too many purples

open
nobody
5
2005-07-05
2005-07-05
No

Running 0.99c3, many of the tcp checks seem to
transit fron their true state, to a purple state. (15
minutes later) followed immediately by a transition back
to the original state.

One reason for this might be that the cumulative effects
of timeouts to services that are not responding might
exceed the frequency at which uxmon-net sends to bbd.

However bigsister did not behave like this in previous
versions. My monitored network consists of a large
number of hosts.

Discussion

  • Thomas Aeby

    Thomas Aeby - 2005-08-03

    Logged In: YES
    user_id=77961

    May I ask which Big Sister version (which did not show this
    effect) you ran before?

     
  • Nobody/Anonymous

    Logged In: NO

    I was running v97. (I'm Not sure of the subversion)
    Admittedly I haven't tested multiple hosts being down on v97.
    But even with just one host down on V98c3 the status
    changes from red to purple after 15 minutes and then
    immediately back to red. With "down" set to red and "up" set
    to yellow this results in a flapping condition which is most
    annoying to the oncall person. Have tried setting timeout to
    a low value but the problem still persists.

     
  • Nobody/Anonymous

    Logged In: NO

    This might be O/S (Solaris 9) or Perl ( Solaris 5.6.1)
    dependent.
    The sequence of events is:
    alarm ('value');
    connect (, ...)

    The alarm does not appear to trap.
    Instead connect exits with ETIMEOUT after about 5 minutes.

    Perhaps under Solaris 9 alarm traps are turned off in connect.

    .

     
  • Nobody/Anonymous

    Logged In: NO

    This might be O/S (Solaris 9) or Perl ( Solaris 5.6.1)
    dependent.
    The sequence of events is:
    alarm ('value');
    connect (, ...)

    The alarm does not appear to trap.
    Instead connect exits with ETIMEOUT after about 5 minutes.

    Perhaps under Solaris 9 alarm traps are turned off in connect.

    .

     
  • Thomas Aeby

    Thomas Aeby - 2005-10-13

    Logged In: YES
    user_id=77961

    Oh no, not again some strange Solaris behaviour.

    But hey, this time I have got the same Solaris and Perl
    versions as you, so I'll try if I can reproduce this.

     
  • Nobody/Anonymous

    Logged In: NO

    I added a DESCR statement and this causes the new tcp
    check code to be activated which sets the socket non
    blocking and allows the alarms to work. But I still don't
    knoiw why old style tcp checks worked on older
    Solaris/Bigsister combinations. Mayve the inbuilt connect
    timeout is shorter or the socket is non blocking by default.
    But what a way to foce use of the DESCR statment. I'd like
    some defaults for this as it doesn't really add any value to tcp
    checks. Anyway use of the new style TCP has stopped the
    alternating purples, but there was an incident when the core
    network went down and everything went puple -
    CONFIGURATION ERROR. They should have stayed red.

     
  • Thomas Aeby

    Thomas Aeby - 2005-10-13

    Logged In: YES
    user_id=77961

    Just another question: is this reproducable for you with a
    few test hosts? I have just ran uxmon against a few hosts,
    one of them switched off with no problem.

    Since you are seeing alarm(): are you using the old tcp
    monitor, thus don't you have DESCRs for each of the target
    hosts? BTW: I assume you are using truss.

    Actually the tcp test should connect() in non-blocking mode
    and then wait for the connection in a select() rather than
    relying on alarm(). Anyway, you would see an alarm() being
    set for compatibility with systems that don't want to
    connect() non-blocking.

    Maybe there's some other issue ... can you have a look at
    netstat -a in order to see if uxmon leaves behind a high
    number of sockets in TIME_WAIT (or something worse :-)).

     

Log in to post a comment.

MongoDB Logo MongoDB