Too many purples
Brought to you by:
aeby
Running 0.99c3, many of the tcp checks seem to
transit fron their true state, to a purple state. (15
minutes later) followed immediately by a transition back
to the original state.
One reason for this might be that the cumulative effects
of timeouts to services that are not responding might
exceed the frequency at which uxmon-net sends to bbd.
However bigsister did not behave like this in previous
versions. My monitored network consists of a large
number of hosts.
Logged In: YES
user_id=77961
May I ask which Big Sister version (which did not show this
effect) you ran before?
Logged In: NO
I was running v97. (I'm Not sure of the subversion)
Admittedly I haven't tested multiple hosts being down on v97.
But even with just one host down on V98c3 the status
changes from red to purple after 15 minutes and then
immediately back to red. With "down" set to red and "up" set
to yellow this results in a flapping condition which is most
annoying to the oncall person. Have tried setting timeout to
a low value but the problem still persists.
Logged In: NO
This might be O/S (Solaris 9) or Perl ( Solaris 5.6.1)
dependent.
The sequence of events is:
alarm ('value');
connect (, ...)
The alarm does not appear to trap.
Instead connect exits with ETIMEOUT after about 5 minutes.
Perhaps under Solaris 9 alarm traps are turned off in connect.
.
Logged In: NO
This might be O/S (Solaris 9) or Perl ( Solaris 5.6.1)
dependent.
The sequence of events is:
alarm ('value');
connect (, ...)
The alarm does not appear to trap.
Instead connect exits with ETIMEOUT after about 5 minutes.
Perhaps under Solaris 9 alarm traps are turned off in connect.
.
Logged In: YES
user_id=77961
Oh no, not again some strange Solaris behaviour.
But hey, this time I have got the same Solaris and Perl
versions as you, so I'll try if I can reproduce this.
Logged In: NO
I added a DESCR statement and this causes the new tcp
check code to be activated which sets the socket non
blocking and allows the alarms to work. But I still don't
knoiw why old style tcp checks worked on older
Solaris/Bigsister combinations. Mayve the inbuilt connect
timeout is shorter or the socket is non blocking by default.
But what a way to foce use of the DESCR statment. I'd like
some defaults for this as it doesn't really add any value to tcp
checks. Anyway use of the new style TCP has stopped the
alternating purples, but there was an incident when the core
network went down and everything went puple -
CONFIGURATION ERROR. They should have stayed red.
Logged In: YES
user_id=77961
Just another question: is this reproducable for you with a
few test hosts? I have just ran uxmon against a few hosts,
one of them switched off with no problem.
Since you are seeing alarm(): are you using the old tcp
monitor, thus don't you have DESCRs for each of the target
hosts? BTW: I assume you are using truss.
Actually the tcp test should connect() in non-blocking mode
and then wait for the connection in a select() rather than
relying on alarm(). Anyway, you would see an alarm() being
set for compatibility with systems that don't want to
connect() non-blocking.
Maybe there's some other issue ... can you have a look at
netstat -a in order to see if uxmon leaves behind a high
number of sockets in TIME_WAIT (or something worse :-)).