[Nodebrain-announce] RE: Project interest
Rule Engine for State and Event Monitoring
Brought to you by:
trettevik
|
From: Trettevik, Ed A <ed....@bo...> - 2003-03-10 06:27:21
|
Hi,
It seems the nod...@li... list was not set =
up properly---I'm still learning how to admin a project on SourceForge. =
I did not receive a copy of your note via mail, but stumbled on to it in =
the archive. I've created a new list, =
nod...@li..., that you may use in the future.
From: Benoit DOLEZ <bdolez@an...>=20
Project interest =20
2003-03-06 15:16 =20
Hi,
=20
Your project is very interesting. We are looking for something like =
that
for our usage.
The document was not synchronized with source
ex : listener declaration, 'type' wasn't recognize, we might use =
'protocol'
protocol 'LOG' doesn't work.
=20
Could someone send me a sample of file reading an analysing?
for example checking for number of line per day.
=20
Benoit
In response to your question, NodeBrain doesn't directly address your =
example problem; that is, NodeBrain will not efficiently count the lines =
in a file. (I'll describe an inefficient direct method later.) =
Depending on what you want to do with the line counts, NodeBrain may, or =
may not, be useful for monitoring it. Let's say you wanted to be =
notified if a particular log exceeded 1,000,000 lines in one day. =
Without NodeBrain, using your favorite scripting language, you could =
write a cron job to issue a "wc -l filename" on daily archives or scan a =
file that is not archived daily counting the lines for the previous day. =
Your cron job could notify you via email. If you have no special =
requirements beyond that, NodeBrain would only complicate the situation.
However, there may be situations where you want to correlate the line =
count with other information before deciding notification is necessary. =
In that case, NodeBrain may be helpful. So I'll give an example, =
realizing this may not match your requirement.
Agent Script:
#!/usr/bin/nb
set log=3D"/myap/myagent.log"; # this is your NodeBrain agent's log
portray default; # don't use default except to experiment =
(insecure)
# the following listener only accepts connections from the local machine
define ear listener type=3D"NBQ",interface=3D"127.0.0.1",port=3D49001;
source /myap/logmon.nb; # include monitor for log lines
# source other monitors here ...
Monitor Rules: (/myap/logmon.nb)
# daily monitor of log file size
define logmon context; # context to monitor size of single log
# To do multiple logs, repeat these rules replacing logmon with =
logmon.'filename'=20
# schedule a probe (note the Perl script performs a very specific and =
simple task)
logmon define r1 on(~(hour(3))):-/myap/logmon.pl
# set a threshold and response (note again a Perl script performs a =
specific task)
logmon define r2 on(lines>1000000):-/myap/alert.pl "log exceeded 1000000 =
lines"
Probe Script: (/myap/logmon.pl)
#!/usr/bin/perl
$size=3D`wc -l /myap/myapp.log`;
if($size=3D~/\s*(\d*)\s/){$size=3D$1;}
else{$size=3D"?";}
# Send the line count to my NodeBrain agent
print("/usr/bin/nb \":declare myagent brain default\@localhost:49001;\"
\":>myagent assert logmon.lines=3D$size;\"");
We would normally declare the brain in our $HOME/.nodebrain/private.nb =
file so it would not be necessary to include the declare in the system() =
call. And we would use a secure identity instead of default.
The notification script, /myap/alert.pl, would do whatever you want. =
You need to change the last rule in the monitor to conform to the syntax =
for your notification script. =20
I should emphasize here that NodeBrain is not a procedural scripting =
language and is not a reasonable alternative to your favorite scripting =
language for solving most problems. Clearly this example is more =
complicated than just testing for the threshold in the Perl script, =
scheduling your script with cron, and leaving NodeBrain out of it. But =
if we change the problem a bit, NodeBrain may be quite helpful. Suppose =
we are monitoring log size on 50 servers and we want to be notified if =
any one exceeds 1,000,000 lines, AND when 5 or more exceed 700,000 =
lines. In that case, each server would report the line count to a =
NodeBrain agent on a central server and the new condition would be =
implemented there. =20
If you elected to only run a NodeBrain agent on the central server and =
use cron on the remote servers, you could modify the Perl script =
slightly to replace the localhost address with the central server name =
and include the remote server name in the variable identifier.
system("/usr/bin/nb \":declare master brain =
default\@centralservername:49001;\"
\":>master logmon.'SERVER' assert lines=3D$size;\"");
Now the central server would have rules to monitor the log size on all =
50 remote servers. To monitor for 5, 10, and 20 servers exceeding =
700,000 lines we could add a cache.
define cLog7Server context cache({5,10,20}:server);
cLog7Server define r1 if(_rowState):-alert.pl "$${_rows} servers have =
logs exceeding 7,000,000 lines"=20
We might assert server names to this cache by including the following =
rules for each of the 50 servers.
logmon.'SERVER' define r3 on(lines>700000):cLogHighServer assert =
("SERVER"); # Is High
logmon.'SERVER' define r4 on(lines<700000):cLogHighServer assert =
!("SERVER"); # Isn't High
=09
Now let's clean this up a bit so we don't have to maintain 50 copies of =
these same rules. We can do better than that. Let's have the remote =
servers ALERT the central server instead of asserting a value to a =
specific variable for each host name. In the monlog.pl script we would =
make this change.
Replace: >master logmon.'SERVER' assert lines=3D$size;
With: >master logmon alert server=3D"SERVER",lines=3D$size;
Now we can reduce the 50 sets of rules down to a single set on our =
central server. The complete rule set for monitoring the logs on the =
central server is shown here.
define cLog7Server context cache({5,10,20}:server);
cLog7Server define r1 if(_rowState):-alert.pl "$${_rows} servers have =
logs exceeding 7,000,000 lines"=20
define logmon context;
logmon define server cell; # Name of remote server [Not required =
but helps to document.] =20
logmon define lines cell; # Number of lines in log file [Not required =
but helps to document.]
logmon define r0 if(lines>1000000):$ -/myap/alert.pl "$${server} log at =
$${lines} lines"
logmon define r1 if(lines>700000):cLog7Server assert(logmon.server);
logmon define r2 if(lines<700000):cLog7Server assert !(logmon.server);
We also have the option of running a NodeBrain agent on each remote =
server and replicating the rules. We could go back to having the script =
report line counts to the local agent and then let the local agent only =
report to the central server when a threshold is exceeded. The command =
prefix ">master" would move from the Perl script to the rule action as =
shown below.
define logmon context;
logmon define server cell; # Name of remote server =20
logmon define lines cell; # Number of lines in log file
logmon define r0 if(lines>1000000):$ >master -/myap/alert.pl "$${server} =
log at $${lines} lines"
logmon define r1 if(lines>700000):>master cLog7Server =
assert(logmon.server);
logmon define r2 if(lines<700000):>master cLog7Server assert =
!(logmon.server);
In addition to distributing the monitoring task, this configuration =
would also enable the master agent to take corrective action via the =
remote agents. I should point out that there is no master/slave concept =
in NodeBrain, the agents are peers. However, there can be "management" =
server and "managed" server relationships in the rules we write.
Perhaps from this discussion, you notice that NodeBrain is not designed =
as a monitor of anything more specific than state and events. That =
means, unless somebody else develops rules and scripts for your specific =
problem, you will need to write them yourself. For Unix system health =
monitoring, I have constructed a set of Perl scripts that, combined with =
NodeBrain, actually do something. :) My hope is that others find =
NodeBrain useful for constructing their own monitoring applications and =
share them with the rest of us.
Now, the LOG listener. You are correct, the document is out of sync =
with the code in this area. I'll release an update soon to correct this =
and other problems. I have been using NodeBrain's "pipe" command for =
monitoring log files myself, but the LOG listener will replace it. For =
this reason I don't want to give an example using "define file" and =
"pipe". Instead I'll give an example using a LOG listener which is now =
working in 0.5.1 which I'll release soon.
NodeBrain is capable of tail'ing a log file and looking for regular =
expression matches. But you need to develop the rules to specify what =
to look for and how to respond. And again, you can do this easily with =
your favorite scripting language (I'm happy with Perl for this type of =
problem). So we would only be motivated to use NodeBrain if we want to =
correlate information from multiple sources and perhaps multiple =
servers. Even then we may have a better tool for monitoring a given log =
file. We can always send alarms from another tool into NodeBrain for =
correlation.
Having said that, lets look at an example using NodeBrain (0.5.1) =
without help from our favorite scripting language. We'll use a =
NodeBrain translator and some correlation rules. Let's say our =
requirement is to alarm on user login failures when a given user fails =
login on a given system more than 5 times in 3 minutes without ultimate =
success within 10 minutes. (I see some deficiencies in the documentation =
here---will update.)
# Cache to support our 10 minute delay for success
define cFailedLoginWait context cache(!~(10m):server,user);
# Rule to establish response to row expiration
cFailedLoginWait define r1 if(_action=3D"expire"):$ -alert.pl "5 failed =
logins by $${user} on $${server}"
# Cache to support our 5 in 3 minute requirement=20
define cFailedLogin context cache(~(3m):server,user(5));
# Rule to establish response to our threshold condition (must be on one =
line even if it wraps here)=20
cFailedLogin define r1 if(user._hitState and not =
cFailedLoginWait(server,user)):$ cFailedLoginWait assert =
("$${server}","$${user}");
These rules solve part of the problem, but we still need a way to send =
events to the cache. Independent of how we detect the events, we need =
to do something like this.
User U1 failed login on server S1:
cFailedLogin assert ("S1","U1"); # assert server and user to failed =
login cache
User U1 successfully logged in on server S1:
cFailedLogin assert !("S1","U1"); # remove server and user from failed =
login cache
cFailedLoginWait assert !("S1","U1"); # remove server and user from 10 =
min wait cache
Now we need a way to detect the actual events so we can report them to =
NodeBrain in this way. It could (and probably should) be done with your =
choice of scripting languages, but I promised we'll do it with NodeBrain =
here. So let's define a LOG listener, assuming the information we need =
is written to a log we'll call login.log.
define logmatch translator /myap/logmatch.nbx;=20
define logwatch listener =
type=3D"LOG",file=3D"login.log",schedule=3D=3D~(20s),translator=3D"logmat=
ch";
Let's assume the entries in the this log identify failed and successful =
logins as follows.
... user USERNAME failed login to SERVER ...
... user USERNAME successful login to SERVER ...
Now we can write our NodeBrain translator, /myap/logmatch.nbx. We use =
extended regular expressions to match on lines in the log file and emit =
NodeBrain commands based on matched conditions. This requires =
familiarity with regular expressions, NodeBrain translator syntax, and =
NodeBrain command syntax.
# Example watching for failed and successful logins
(user ([^ ]*) failed login to ([^ ]*)){
: cFailedLogin assert ("$[2]","$[1]"); # emit NodeBrain command
}
(user ([^ ]*) successful login to ([^ ]*)){
: cFailedLogin assert !("$[2]","$[1]");
: cFailedLoginWait assert !("$[2]","$[1]");
}
From this, you may have figured out that it is possible to have =
NodeBrain monitor the number of lines written to a log file over some =
sliding interval and report when thresholds are reached. I'll give an =
example here, but I would not recommend this solution for high volume =
logs. We'll translate every line that appears in the log into an =
"event" by making an assertion to a NodeBrain event cache.
# Translator - match on anything and just assert the name of the log for =
every line.
(.*){
: cLog assert ("LOGNAME");
}
# Rules - Alarm on 100, 300, and 1000 lines within a 4 hour period.
# If it drops to 50 in 4 hours, we consider it back to normal, so we =
reset to enable
# the cache to alarm again on the next episode of abnormal volume
define cLog context cache(~(4h):log(^50,100,300,1000));
cLog define r1 if(log._hitState):-alert.pl "$${log._hits) lines added to =
$${log} in $${_interval}"
Here's an alternate method that would alarm on a single threshold in =
fixed (not sliding) intervals.
# Translator - set a cell named "count" to the current value of a cell =
named "lines" for every line.
(.*){=20
: assert count=3D{lines}; # add 1 to lines - see rules below
}
# Rules - funny way to count
assert lines=3D=3Dcount+1,count=3D-1;=20
define r1 on(lines>100):-alert.pl "100 lines added to log in 4 hours"
define r2 on(~(4h)):assert count=3D-1; # reset lines to zero every 4 =
hours
Our method of counting in this example may require an illustration. The =
following was pasted from an interactive session. It illustrates how =
the value of lines is modified by changing the value of count. Because =
we define lines to be a function of count, the value of lines changes =
every time count changes.
@> assert lines=3D=3Dcount+1,count=3D-1;
@> show -cells
lines =3D 0 =3D=3D (count+1)
count =3D -1
@> assert count=3D{lines};
@> show -cells
lines =3D 1 =3D=3D (count+1)
count =3D 0
@> assert count=3D{lines};
@> show -cells
lines =3D 2 =3D=3D (count+1)
count =3D 1
@>=20
Again, I don't recommend this approach for simply counting lines in a =
high volume log file (more than 1 per second) because it is not the most =
efficient way to do it. It would be better to export this problem to a =
procedural scripting language, and use NodeBrain to monitor the results =
and correlate it with other information if there is such a requirement. =
It might, however, be appropriate to use NodeBrain on a high volume log =
if we are looking for specific strings and correlating events derived =
from the matching conditions.
Hopefully this addresses your question. I'm working on getting 0.5.1 =
released on SourceForge to resolve the identified defects. If your =
primary interest is in scanning log files with NodeBrain, it would be =
best to wait for the update. If you can obtain the counts with a =
script, and can solve a correlation requirement as described previously, =
then the 0.5.0 release should work as well.
Thanks for your interest.
Ed Trettevik <ea...@no...>=20
|