From: Buchan M. <bg...@st...> - 2010-07-27 13:45:38
|
On Friday, 16 July 2010 22:07:37 Stephan Buys wrote: > Hi All > > Hope this email is not too long and boring, just want to share some > stuff and ask some too... Hi Stephan. Sorry for the delay, but I took a trip back to SA at short notice leaving on 15 July ... > Current Infrastructure: > - 3x locations in different countries est. 200 servers each and network > equipment to match > - Currently running 1 x hobbit/xymon per site (with hobbit clients) > > Main goal for Xymon/Devmon implementation > - Single central monitoring station (Lightweight) > - Need to report on historical infrastructure data (availability, > resource utilization etc.) > - Snmp functionality to enable monitoring of network equip, storage and > servers > - Low as possible network load > > Work done: > - Base install of Centos 5.5, 64bit on Dell Poweredge 1955, 8x2ghz, 8gb > ram > - xymon 4.2.3 installation > - Devmon 0.3.1-beta1 (MULTINODE) > CYCLETIME=60 > NUMFORKS=14 (tried up to 20) > MAXPOLLTIME=4 This should be higher based on your other settings. > SNMPTIMEOUT=50 (because currently polling across borders) > SNMPTRIES=4 > Rest Default > - Transferred existing bb-hosts file (+/- 200 devices) > - Customized tests - Linux 1955 - Poweredge template removed fans > and power > - Added Disk, > DiskIO, IFLoad and Processes from linux-netsnmp template I would like to make it easier to support combinations of tests ... > - Win 1955 - Poweredge template > removed fans, power and cpu > - Added Disk, > IFLoad, cpus and Services from Server2003 template > - Win/Lin 1950/2950 - Same as above > except left fans and power in there > > Problems experienced: > - Default UDP buffer too small causing, packet receive errors when > "netstat -su" (SOLVED) > - Set in /etc/sysctl.conf > net.core.rmem_max = 16777216 > net.core.rmem_default = 8388608 > net.core.wmem_max = 16777216 > net.core.wmem_default = 8388608 Cool, I will add this to the devmon wiki (or you could). > - Regularly getting devices that stop monitoring with the following > error (white hobbit): > - This problem is intermittent, across various servers on LAN and > WAN > Missing repeater data for primary OID XXXXXXX This can occur if a timeouts occurred, and not all data was received, and devmon will give up on the device, but with some data. > - Recently started getting the following problem (purple hobbit): > - Also intermittent, started happening since adding some new hosts > to monitor > - Possible cause, think i have reached my limit for this server, > please confirm, data from devmon not reaching hobbit in time Do all devmon tests go purple (including the dm test on the devmon servers)? If so, please try the version in svn, there is some better recovery and logging of problems, and more reporting on the dm test. It may not fix the problem, but may allow us to track it down further. > - Database reports 750 tests on this main node > > Problems with templates: > Hashed these lines out > [10-07-16@22:16:20] Bad SWITCH case type (1.3.6.1.2.1.25.2.1.4) > at /usr/local/devmon/templates/microsoft-win2k3server/disk/transforms, > line 6 > [10-07-16@22:16:20] Bad SWITCH case type (1.3.6.1.2.1.25.2.1.2) > at /usr/local/devmon/templates/microsoft-win2k3server/disk/transforms, > line 6 > [10-07-16@22:16:20] Bad SWITCH case type (1.3.6.1.2.1.25.2.1.3) > at /usr/local/devmon/templates/microsoft-win2k3server/disk/transforms, > line 6 > [10-07-16@22:16:20] Bad SWITCH case type (1.3.6.1.2.1.25.2.1.7) > at /usr/local/devmon/templates/microsoft-win2k3server/disk/transforms, > line 6 > Don't need this templates, just thought i'd share I think I did this one on-site in a hurry, I'll test it again soon. > [10-07-16@22:16:20] Missing 'oids' file > in /usr/local/devmon/templates/cisco-mds9500/experimental, skipping this > test. Hmm, I have this somewhere ... > [10-07-16@22:16:20] Missing 'message' file > in /usr/local/devmon/templates/netscreen-5gt/memory, skipping this test. "Anonymous" user-contributed test which was incomplete, on the tracker I asked for the rest of the files ... > Dell-poweredge template on Windows server cpu don't report, please let > me know the correct oid. I created this test for 1750's running Linux ... I may shorly be able to test on 1950s running Linux and Windows. > Dell-poweredge template on Windows server memory don't report correctly, > problems vary on 2003/2008/2008r2 (Probably OID's too) This affects other templates, I am working on a solution. > Requests: > - Need to know if anybody can help me out with SNMP templates for the > following: > - Dell Chassis Switches PowerConnect 5316M Basics should be easy enough, copy if_* tests from linux-openwrt or any other generic (non-Cisco) template. > - Dell 1955 chassis DRAC > - EMC Clariion CX3 series storage CX3 doesn't provide much useful info over SNMP, mostly just the network interfaces ... and I don't have any with iSCSI, so almost useless ... > - CISCO ACE Not sure exactly what this is, you may want to look at creating your own template with extras/templatebuilder.pl in svn. > - Fortigate (Fortinet) Firewall URL? MIB? > - Anything MS SQL server Via SNMP? How? > - Got the Brocade on thanks!!!! Can someone do the effort of filing a tracker item (ideally, as a logged-in user) and attach the current best template? > - Any poweredge templates specific to Windows Versions and Linux > The ones above i do not find i will attempt to create and share as soon > as this is done. > > Questions: > - Want to manually assign owner-node of a device to be polled in db or > otherwise, and it not be overwritten by auto assignment by devmon > - Reason, each node must only poll on local LAN and send update to > display server (Multi location) You should be able to set the NET tag in bb-hosts, and devmon should respect it, but I don't know how this will work with the multi-node DB stuff, you might need to use separate DB per BBLOCATION > - What is the relative maximum tests per node as per experience (got 750 > on display server) > - What is the relative maximum devices to be monitored on hobbit keeping > RRD in mind I have 14000 RRDs on a HP DL380G4. > - If i disable conn test on hobbit, will this affect polling of devices > if so how can i disable. I need to test this some more ... but you can prevent devmon by checking the conn tests by telling it it is reporting to BB (and not Hobbit). > - If devmon fails to connect to display server to send polling results, > is the results buffered and resent when display server becomes avail or > does it get lost At present results will get lost. > More hobbit related: > - Want to pull custom reports relating to test data results, was > thinking to import all rrd and hobbit history data into MS SQL db via > rrdtool xport function, any ideas would help. I have a local change here which I am playing with, which introduces a "DBTABLE" command, that works mostly like the "TABLE" command, except it updates values in the mysql db used by devmon. It is *very* basic at present, mostly because I haven't decided how to use it. This is what it looks like at present: addition to messages file: DBTABLE: Dest|Iterface|Metric1|Next Hop|Type|Proto|Mask {ipRouteDest}|{ipRouteIfName}|{ipRouteMetric1}|{ipRouteNextHop}| {ipRouteTypeName}|{ipRouteProtoName}|{ipRouteMask} result: mysql> select * from test_data where ( test='ip_route' and val='virbr0'); +-------------+---------------+----------+------------+----------+--------+ | host | instance | test | time | attr | val | +-------------+---------------+----------+------------+----------+--------+ | localdevmon | 192.168.122.0 | ip_route | 1269293660 | Iterface | virbr0 | +-------------+---------------+----------+------------+----------+--------+ 1 row in set (0.00 sec) addition to messages file: DBTABLE: If_name|Address|Netmask|Broadcast {ipAdEntIfName}|{ipAdEntAddr}|{ipAdEntNetMask}|{ipAdEntBcastAddr} result: mysql> select * from test_data where ( test='if_ipv4' and instance='virbr0'); +-------------+----------+---------+------------+-----------+---------------+ | host | instance | test | time | attr | val | +-------------+----------+---------+------------+-----------+---------------+ | localdevmon | virbr0 | if_ipv4 | 1256884620 | Address | 192.168.122.1 | | localdevmon | virbr0 | if_ipv4 | 1256884620 | Netmask | 255.255.255.0 | | localdevmon | virbr0 | if_ipv4 | 1256884620 | Broadcast | 1 | +-------------+----------+---------+------------+-----------+---------------+ 3 rows in set (0.00 sec) At present, it just runs a single SQL query per repeater: insert into test_data (host,test,instance,attr,val,time) values ('$device','$test','$dbvals[0]','$dbkeys[$i]','$dbvals[$i]',$time) on duplicate key update val=values(val),time=values(time) If this is useful to you, I can give you the patch to play with, and you can give some feedback on what changes you think are necessary to make it more useful. > And lastly i love what you guys are doing with devmon keep the good work > up!! Thanks :-). Regards, Buchan |