[Nodebrain-announce] RE: Project interest
Rule Engine for State and Event Monitoring
Brought to you by:
trettevik
From: Trettevik, Ed A <ed....@bo...> - 2003-03-10 06:27:21
|
Hi, It seems the nod...@li... list was not set = up properly---I'm still learning how to admin a project on SourceForge. = I did not receive a copy of your note via mail, but stumbled on to it in = the archive. I've created a new list, = nod...@li..., that you may use in the future. From: Benoit DOLEZ <bdolez@an...>=20 Project interest =20 2003-03-06 15:16 =20 Hi, =20 Your project is very interesting. We are looking for something like = that for our usage. The document was not synchronized with source ex : listener declaration, 'type' wasn't recognize, we might use = 'protocol' protocol 'LOG' doesn't work. =20 Could someone send me a sample of file reading an analysing? for example checking for number of line per day. =20 Benoit In response to your question, NodeBrain doesn't directly address your = example problem; that is, NodeBrain will not efficiently count the lines = in a file. (I'll describe an inefficient direct method later.) = Depending on what you want to do with the line counts, NodeBrain may, or = may not, be useful for monitoring it. Let's say you wanted to be = notified if a particular log exceeded 1,000,000 lines in one day. = Without NodeBrain, using your favorite scripting language, you could = write a cron job to issue a "wc -l filename" on daily archives or scan a = file that is not archived daily counting the lines for the previous day. = Your cron job could notify you via email. If you have no special = requirements beyond that, NodeBrain would only complicate the situation. However, there may be situations where you want to correlate the line = count with other information before deciding notification is necessary. = In that case, NodeBrain may be helpful. So I'll give an example, = realizing this may not match your requirement. Agent Script: #!/usr/bin/nb set log=3D"/myap/myagent.log"; # this is your NodeBrain agent's log portray default; # don't use default except to experiment = (insecure) # the following listener only accepts connections from the local machine define ear listener type=3D"NBQ",interface=3D"127.0.0.1",port=3D49001; source /myap/logmon.nb; # include monitor for log lines # source other monitors here ... Monitor Rules: (/myap/logmon.nb) # daily monitor of log file size define logmon context; # context to monitor size of single log # To do multiple logs, repeat these rules replacing logmon with = logmon.'filename'=20 # schedule a probe (note the Perl script performs a very specific and = simple task) logmon define r1 on(~(hour(3))):-/myap/logmon.pl # set a threshold and response (note again a Perl script performs a = specific task) logmon define r2 on(lines>1000000):-/myap/alert.pl "log exceeded 1000000 = lines" Probe Script: (/myap/logmon.pl) #!/usr/bin/perl $size=3D`wc -l /myap/myapp.log`; if($size=3D~/\s*(\d*)\s/){$size=3D$1;} else{$size=3D"?";} # Send the line count to my NodeBrain agent print("/usr/bin/nb \":declare myagent brain default\@localhost:49001;\" \":>myagent assert logmon.lines=3D$size;\""); We would normally declare the brain in our $HOME/.nodebrain/private.nb = file so it would not be necessary to include the declare in the system() = call. And we would use a secure identity instead of default. The notification script, /myap/alert.pl, would do whatever you want. = You need to change the last rule in the monitor to conform to the syntax = for your notification script. =20 I should emphasize here that NodeBrain is not a procedural scripting = language and is not a reasonable alternative to your favorite scripting = language for solving most problems. Clearly this example is more = complicated than just testing for the threshold in the Perl script, = scheduling your script with cron, and leaving NodeBrain out of it. But = if we change the problem a bit, NodeBrain may be quite helpful. Suppose = we are monitoring log size on 50 servers and we want to be notified if = any one exceeds 1,000,000 lines, AND when 5 or more exceed 700,000 = lines. In that case, each server would report the line count to a = NodeBrain agent on a central server and the new condition would be = implemented there. =20 If you elected to only run a NodeBrain agent on the central server and = use cron on the remote servers, you could modify the Perl script = slightly to replace the localhost address with the central server name = and include the remote server name in the variable identifier. system("/usr/bin/nb \":declare master brain = default\@centralservername:49001;\" \":>master logmon.'SERVER' assert lines=3D$size;\""); Now the central server would have rules to monitor the log size on all = 50 remote servers. To monitor for 5, 10, and 20 servers exceeding = 700,000 lines we could add a cache. define cLog7Server context cache({5,10,20}:server); cLog7Server define r1 if(_rowState):-alert.pl "$${_rows} servers have = logs exceeding 7,000,000 lines"=20 We might assert server names to this cache by including the following = rules for each of the 50 servers. logmon.'SERVER' define r3 on(lines>700000):cLogHighServer assert = ("SERVER"); # Is High logmon.'SERVER' define r4 on(lines<700000):cLogHighServer assert = !("SERVER"); # Isn't High =09 Now let's clean this up a bit so we don't have to maintain 50 copies of = these same rules. We can do better than that. Let's have the remote = servers ALERT the central server instead of asserting a value to a = specific variable for each host name. In the monlog.pl script we would = make this change. Replace: >master logmon.'SERVER' assert lines=3D$size; With: >master logmon alert server=3D"SERVER",lines=3D$size; Now we can reduce the 50 sets of rules down to a single set on our = central server. The complete rule set for monitoring the logs on the = central server is shown here. define cLog7Server context cache({5,10,20}:server); cLog7Server define r1 if(_rowState):-alert.pl "$${_rows} servers have = logs exceeding 7,000,000 lines"=20 define logmon context; logmon define server cell; # Name of remote server [Not required = but helps to document.] =20 logmon define lines cell; # Number of lines in log file [Not required = but helps to document.] logmon define r0 if(lines>1000000):$ -/myap/alert.pl "$${server} log at = $${lines} lines" logmon define r1 if(lines>700000):cLog7Server assert(logmon.server); logmon define r2 if(lines<700000):cLog7Server assert !(logmon.server); We also have the option of running a NodeBrain agent on each remote = server and replicating the rules. We could go back to having the script = report line counts to the local agent and then let the local agent only = report to the central server when a threshold is exceeded. The command = prefix ">master" would move from the Perl script to the rule action as = shown below. define logmon context; logmon define server cell; # Name of remote server =20 logmon define lines cell; # Number of lines in log file logmon define r0 if(lines>1000000):$ >master -/myap/alert.pl "$${server} = log at $${lines} lines" logmon define r1 if(lines>700000):>master cLog7Server = assert(logmon.server); logmon define r2 if(lines<700000):>master cLog7Server assert = !(logmon.server); In addition to distributing the monitoring task, this configuration = would also enable the master agent to take corrective action via the = remote agents. I should point out that there is no master/slave concept = in NodeBrain, the agents are peers. However, there can be "management" = server and "managed" server relationships in the rules we write. Perhaps from this discussion, you notice that NodeBrain is not designed = as a monitor of anything more specific than state and events. That = means, unless somebody else develops rules and scripts for your specific = problem, you will need to write them yourself. For Unix system health = monitoring, I have constructed a set of Perl scripts that, combined with = NodeBrain, actually do something. :) My hope is that others find = NodeBrain useful for constructing their own monitoring applications and = share them with the rest of us. Now, the LOG listener. You are correct, the document is out of sync = with the code in this area. I'll release an update soon to correct this = and other problems. I have been using NodeBrain's "pipe" command for = monitoring log files myself, but the LOG listener will replace it. For = this reason I don't want to give an example using "define file" and = "pipe". Instead I'll give an example using a LOG listener which is now = working in 0.5.1 which I'll release soon. NodeBrain is capable of tail'ing a log file and looking for regular = expression matches. But you need to develop the rules to specify what = to look for and how to respond. And again, you can do this easily with = your favorite scripting language (I'm happy with Perl for this type of = problem). So we would only be motivated to use NodeBrain if we want to = correlate information from multiple sources and perhaps multiple = servers. Even then we may have a better tool for monitoring a given log = file. We can always send alarms from another tool into NodeBrain for = correlation. Having said that, lets look at an example using NodeBrain (0.5.1) = without help from our favorite scripting language. We'll use a = NodeBrain translator and some correlation rules. Let's say our = requirement is to alarm on user login failures when a given user fails = login on a given system more than 5 times in 3 minutes without ultimate = success within 10 minutes. (I see some deficiencies in the documentation = here---will update.) # Cache to support our 10 minute delay for success define cFailedLoginWait context cache(!~(10m):server,user); # Rule to establish response to row expiration cFailedLoginWait define r1 if(_action=3D"expire"):$ -alert.pl "5 failed = logins by $${user} on $${server}" # Cache to support our 5 in 3 minute requirement=20 define cFailedLogin context cache(~(3m):server,user(5)); # Rule to establish response to our threshold condition (must be on one = line even if it wraps here)=20 cFailedLogin define r1 if(user._hitState and not = cFailedLoginWait(server,user)):$ cFailedLoginWait assert = ("$${server}","$${user}"); These rules solve part of the problem, but we still need a way to send = events to the cache. Independent of how we detect the events, we need = to do something like this. User U1 failed login on server S1: cFailedLogin assert ("S1","U1"); # assert server and user to failed = login cache User U1 successfully logged in on server S1: cFailedLogin assert !("S1","U1"); # remove server and user from failed = login cache cFailedLoginWait assert !("S1","U1"); # remove server and user from 10 = min wait cache Now we need a way to detect the actual events so we can report them to = NodeBrain in this way. It could (and probably should) be done with your = choice of scripting languages, but I promised we'll do it with NodeBrain = here. So let's define a LOG listener, assuming the information we need = is written to a log we'll call login.log. define logmatch translator /myap/logmatch.nbx;=20 define logwatch listener = type=3D"LOG",file=3D"login.log",schedule=3D=3D~(20s),translator=3D"logmat= ch"; Let's assume the entries in the this log identify failed and successful = logins as follows. ... user USERNAME failed login to SERVER ... ... user USERNAME successful login to SERVER ... Now we can write our NodeBrain translator, /myap/logmatch.nbx. We use = extended regular expressions to match on lines in the log file and emit = NodeBrain commands based on matched conditions. This requires = familiarity with regular expressions, NodeBrain translator syntax, and = NodeBrain command syntax. # Example watching for failed and successful logins (user ([^ ]*) failed login to ([^ ]*)){ : cFailedLogin assert ("$[2]","$[1]"); # emit NodeBrain command } (user ([^ ]*) successful login to ([^ ]*)){ : cFailedLogin assert !("$[2]","$[1]"); : cFailedLoginWait assert !("$[2]","$[1]"); } From this, you may have figured out that it is possible to have = NodeBrain monitor the number of lines written to a log file over some = sliding interval and report when thresholds are reached. I'll give an = example here, but I would not recommend this solution for high volume = logs. We'll translate every line that appears in the log into an = "event" by making an assertion to a NodeBrain event cache. # Translator - match on anything and just assert the name of the log for = every line. (.*){ : cLog assert ("LOGNAME"); } # Rules - Alarm on 100, 300, and 1000 lines within a 4 hour period. # If it drops to 50 in 4 hours, we consider it back to normal, so we = reset to enable # the cache to alarm again on the next episode of abnormal volume define cLog context cache(~(4h):log(^50,100,300,1000)); cLog define r1 if(log._hitState):-alert.pl "$${log._hits) lines added to = $${log} in $${_interval}" Here's an alternate method that would alarm on a single threshold in = fixed (not sliding) intervals. # Translator - set a cell named "count" to the current value of a cell = named "lines" for every line. (.*){=20 : assert count=3D{lines}; # add 1 to lines - see rules below } # Rules - funny way to count assert lines=3D=3Dcount+1,count=3D-1;=20 define r1 on(lines>100):-alert.pl "100 lines added to log in 4 hours" define r2 on(~(4h)):assert count=3D-1; # reset lines to zero every 4 = hours Our method of counting in this example may require an illustration. The = following was pasted from an interactive session. It illustrates how = the value of lines is modified by changing the value of count. Because = we define lines to be a function of count, the value of lines changes = every time count changes. @> assert lines=3D=3Dcount+1,count=3D-1; @> show -cells lines =3D 0 =3D=3D (count+1) count =3D -1 @> assert count=3D{lines}; @> show -cells lines =3D 1 =3D=3D (count+1) count =3D 0 @> assert count=3D{lines}; @> show -cells lines =3D 2 =3D=3D (count+1) count =3D 1 @>=20 Again, I don't recommend this approach for simply counting lines in a = high volume log file (more than 1 per second) because it is not the most = efficient way to do it. It would be better to export this problem to a = procedural scripting language, and use NodeBrain to monitor the results = and correlate it with other information if there is such a requirement. = It might, however, be appropriate to use NodeBrain on a high volume log = if we are looking for specific strings and correlating events derived = from the matching conditions. Hopefully this addresses your question. I'm working on getting 0.5.1 = released on SourceForge to resolve the identified defects. If your = primary interest is in scanning log files with NodeBrain, it would be = best to wait for the update. If you can obtain the counts with a = script, and can solve a correlation requirement as described previously, = then the 0.5.0 release should work as well. Thanks for your interest. Ed Trettevik <ea...@no...>=20 |