Munin / News: Recent posts

munin 2.0.24-1 MIGRATED to testing (Britney)

<a href="https://packages.qa.debian.org/m/munin/news/20141106T163914Z.html">[2014-11-06] munin 2.0.24-1 MIGRATED to testing (Britney)</a> link

Posted by SourceForge Robot 2014-11-06

Accepted 2.0.24-1 in unstable (medium) (Stig Sandbeck Mathisen)

<a href="https://packages.qa.debian.org/m/munin/news/20141026T155740Z.html">[2014-10-26] Accepted 2.0.24-1 in unstable (medium) (Stig Sandbeck Mathisen)</a> link

Posted by SourceForge Robot 2014-10-26

munin 2.0.23-1 MIGRATED to testing (Britney)

<a href="https://packages.qa.debian.org/m/munin/news/20141020T163917Z.html">[2014-10-20] munin 2.0.23-1 MIGRATED to testing (Britney)</a> link

Posted by SourceForge Robot 2014-10-20

Accepted 2.0.23-1 in unstable (medium) (Holger Levsen)

<a href="https://packages.qa.debian.org/m/munin/news/20141017T094145Z.html">[2014-10-17] Accepted 2.0.23-1 in unstable (medium) (Holger Levsen)</a> link

Posted by SourceForge Robot 2014-10-17

Accepted 2.0.22-1 in unstable (low) (Holger Levsen)

<a href="https://packages.qa.debian.org/m/munin/news/20141016T230540Z.html">[2014-10-16] Accepted 2.0.22-1 in unstable (low) (Holger Levsen)</a> link

Posted by SourceForge Robot 2014-10-16

Milestone Munin 2.0.22 completed

<p>
The next stable maintenance release
</p> link

Posted by SourceForge Robot 2014-10-16

Monitoring HP servers

<p>Sometimes this blog has something like “columns” for long-term topics that keep re-emerging (no pun intended) from time to time. Since I came back to the US last July you can see that one of the big issues I fight with daily is HP servers.</p>
<p>Why is the company I’m working for using HP servers? Mostly because they didn’t have a resident system administrator before I came on board, and just recently they hired an external consultant to set up new servers … the one who set up my nightmare: <a href="https://blog.flameeyes.eu/2012/11/apple-biggest-screwup">Apple OS X Server</a> so I’m not sure which of the two options I prefer.</p>
<p>Anyway, as you probably know if you follow my blog, I’ve been busy setting up Munin and Icinga to monitor the status of services and servers — and that helped quite a bit over time. Unfortunately, monitoring HP servers is not easy. You probably remember I <a href="https://blog.flameeyes.eu/2012/07/munin-snmp-and-ipmi">wrote a plugin</a> so I could monitor them through <span class="caps">IPMI</span> — it worked nicely until I actually got Albert to expose the thresholds in the <code>ipmi-sensors</code> output, then it broke because HP’s default thresholds are totally messed up and unusable, and it’s not possible to commit new thresholds.</p>
<p>After spending quite some time playing with this, I ended up with write access to Munin’s repositories (thanks, Steve!) and I can now <del>gloat</del> be worried about having authored quite a few new Munin plugins (the second generation FreeIPMI multigraph plugin is an example, but I also have <a href="https://blog.flameeyes.eu/2012/10/munin-sensors-and-ipmi">a sysfs-based hwmon plugin</a> that can get all the sensors in your system in one sweep, a new multigraph-capable Apache plugin, and a couple of <span class="caps">SNMP</span> plugins to add to the list). These actually make my work much easier, as they send me warnings when a problem happens without having to worry about it too much, but of course are not enough.</p>
<p>After finally being able to replace the RHEL5 (without a current subscription) with CentOS 5, I’ve started looking in what tools HP makes available to us — and found out that there are mainly two that I care about: one is hpacucli, which is also available in Gentoo’s tree, and the other is called <code>hp-health</code> and is basically a custom interface to the <span class="caps">IPMI</span> features of the server. The latter actually has a working, albeit not really polished, plugin in the Munin contrib repository – which I guess I’ll soon look to transform into a multigraph capable one; I really like multigraph – and that’s how I ended up finding it.</p>
<p>At any rate at that point I realized that I did not add one of the most important checks: the <span class="caps">SMART</span> status of the harddrives — originally because I couldn’t get smartctl installed. So I went and checked for it — the older servers are almost all running as <span class="caps">IDE</span> (because that’s the firmware’s default.. don’t ask), so those are a different story altogether; the newer servers running CentOS are using an HP controller with <span class="caps">SAS</span> drives, using the <span class="caps">CCISS</span> (block-layer) driver from the kernel, while one is running Gentoo Linux, and uses the newer, <span class="caps">SCSI</span>-layer driver. All of them can’t use <code>smartctl</code> directly, but they have to use a special command: <code>smartctl -d cciss,0</code> — and then either point it to <code>/dev/cciss/c0d0</code> or <code>/dev/sda</code> depending on how which of the two kernel drivers you’re using. They don’t provide all the data that they provide for <span class="caps">SATA</span> drives, but they provide enough for Munin’s <code>hddtemp_smartctl</code> and they do provide an health status…</p>
<p>For what concerns Munin, your configuration would then be something like this in <code>/etc/munin/plugin-conf.d/hddtemp_smartctl</code>:</p>
<div class="CodeRay"><pre>[hddtemp_smartctl]
user root
env.drives hd1 hd2
env.type_hd1 cciss,0
env.type_hd2 cciss,1
env.dev_hd1 cciss/c0d0
env.dev_hd2 cciss/c0d0</pre></div>
<p>Depending on how many drives you have and which driver you’re using you will have to edit it of course.</p>
<p>But when I tried to use the default <code>check_smart.pl</code> script from the nagios-plugins package I had two bad surprises: the first is that they try to validate the parameter passed to the plugin to identify the device type to smartctl, refusing to work for a <code>cciss</code> type, and the other that it didn’t consider the status message that is printed by this particular driver. I was so pissed, that instead of trying to fix that plugin – which still comes with special handling for <span class="caps">IDE</span>-based harddrives! – I decided to write my own, using the Nagios::Plugin Perl module, and releasing it under the <span class="caps">MIT</span> license.</p>
<p>You can find my new plugin in <a href="https://github.com/Flameeyes/nagios-plugins-flameeyes">my github repository</a> where I think you’ll soon find more plugins — as I’ve had a few things to keep under control anyway. The next step is probably using the <code>hp-health</code> status to get a good/bad report, hopefully for something that I don’t get already through standard <span class="caps">IPMI</span>.</p>
<p>The funny thing about HP’s utilities is that they for the most part just have to present data that is already available from the <span class="caps">IPMI</span> interface, but there are a few differences. For instance, the fan speed reported by <span class="caps">IPMI</span> is exposed in RPMs — which is the usual way to expose the speed of fans. But on the HP utility, fan speed is actually exposed as a <em>percentage</em> of the maximum fan speed. And that’s how their thresholds are exposed as well (as I said, the thresholds for fan speed are completely messed up on my HP servers).</p>
<p>Oh well, anything else can happen lately, this would be enough for now.</p> link

Posted by SourceForge Robot 2014-09-09

Updating HP iLO 2.x

<p>As I <a href="http://blog.flameeyes.eu/2012/08/i-m-in-my-network-monitoring">wrote yesterday</a> I’ve been doing system and network administration work here in LA as well, and I’ve set up Munin and Icinga to warn me when something required maintenance.</p>
<p>Now some of the first probes that Munin forwarded to Icinga we knew already about (in <a href="http://blog.flameeyes.eu/2012/08/munin-hp-servers-and-apc-powerstrips">another post</a> I wrote of how the <span class="caps">CMOS</span> battery ran out on two of the servers), but one was something that bothered me before as well: one of the boxes only has one <span class="caps">CPU</span> on board and it reports a value of 0 instead of N/A.</p>
<p>So I decided to look into updating the firmware of the DL140 G3 and see if it would help us at all; the original firmware on <span class="caps">IPMI</span> device was 2.10 while the latest one available is 2.21. Neither support firmware update via <span class="caps">HTML</span>. The firmware download, even when selecting the RedHat Enterprise Linux option is a Windows <span class="caps">EXE</span> file (not an auto-extract archive, which you can extract from Linux, but their usual full-fledged setup software to extract in <code>C:\SWSetup</code>). When you extract it, you’re presented with instructions on how to build an <span class="caps">USB</span> key which you can then use to update the firmware via FreeDOS…</p>
<p>You can guess I wasn’t amused.</p>
<p>After searching around a bit more I found out that there <em>is</em> a way to update this over the network. It’s described in HP’s advanced iLO usage guide, and seems to work fine, <em>but</em> it also requires another step to be taken in Windows (or FreeDOS): you have to use the <code>ROMPAQ.EXE</code> utility to decompress the compressed firmware image.</p>
<p><em>I wonder, why does HP provide you with <strong>two copies</strong> of the compressed firmware image, for a grand total of 3MB, instead of only one of the uncompressed one (2MB)? I suppose the origin of the compressed image is to be found in the 1.44MB floppy disk size limitation, but nowadays you don’t use floppies… oh well.</em></p>
<p>After you have the uncompressed image, you have to set up a <span class="caps">TFTP</span> server.. which luckily I already had laying around from when I updated the firmware of the <span class="caps">APC</span> powerstrips discussed in one of the posts linked above. So I just added the <span class="caps">IPMI</span> firmware image, and moved on to the next step.</p>
<p>The next step consists of connecting via telnet to the box and issue two commands: <code>cd map1/firmware1</code> followed by <code>load -source //$serverip/$filename -oemhpfiletype csr</code> … the file is downloaded via <span class="caps">TFTP</span> and the <span class="caps">BMC</span> rebooted. Afterwards you have to clear out the <span class="caps">SDR</span> cache of FreeIPMI as <code>ipmi-sensors</code> wouldn’t work otherwise.</p>
<p>This did fix the critical notification I was receiving .. to a point. First of all, the fan speed has still bogus thresholds (and I’m not sure if it’s a bug in FreeIPMI or one in the firmware at this point) as it reports the upper limits instead of the lower ones). Second of all the way it fixed the misreported <span class="caps">CPU</span> thermal sensor is by … not reporting any temperature off either thermal sensor! Now both <span class="caps">CPU</span> temperatures are gone and only ambient temperature is available. D’oh!</p>
<p>Another funky issue is that I’m still fighting to get Munin to tell Icinga that “everything’s okay” — the way Munin contacts <code>send_nsca</code> is connected to the limits so if there are no limits that are present, it seems like it simply doesn’t report anything at all. This is something else I have to fix this week.</p>
<p>Now back to doing the firmware updates on the remaining boxes…</p>
<p><strong>Update:</strong> turns out HP updates are worse than the original firmware in some ways. Not only the <span class="caps">CPU</span> Thermal Diodes are no longer represented, but the voltages lost their thresholds altogether! The end result of which is that now it says that it’s all a-ok! Even if the 3V battery is reported at 0.04V!. Which basically means that I have to set my own limits on things, but at least it should work as intended afterwards.</p>
<p>Oh and the DL160 G6? First of all, this time the firmware update <em>has</em> a web interface… to tell it which file to request from which <span class="caps">TFTP</span> server. Too bad that all the firmware updates that I can run on my systems require the bootcode to be updated as well, which means we’ll have to schedule some maintenance time when I come back from VDDs.</p> link

Posted by SourceForge Robot 2014-09-09

Book review — Instant Munin Plugin Starter

<p>This is going to be a bit of a different review than usual, if anything because I actually I already reviewed the book, in the work-in-progress sense. So bear with me.</p>
<p>Today, <a href="http://www.packtpub.com/">Packt</a> published Bart ten Brinkle’s <a href="http://www.packtpub.com/munin-plugin-starter/book">Instant Munin Plugin Starter</a> which I reviewed early this year. Bart has done an outstanding job in expanding from the sparsely-available documentation to a comprehensive and, especially coherent book.</p>
<p>If you happen to use Munin, or are interested to use it, I would say it’s a read well worth the $8 that it’s priced at!</p> link

Posted by SourceForge Robot 2014-09-09

Munin, sensors and IPMI

<p>In my <a href="http://blog.flameeyes.eu/2012/10/asynchronous-munin">previous post</a> about Munin I said that I was still working on making sure that the async support would reach Gentoo in a way that actually worked. Now with version 2.0.7-r5 this is vastly possible, and it’s documented <a href="https://wiki.gentoo.org/wiki/Munin">on the Wiki</a> for you all to use.</p>
<p>Unfortunately, while testing it, I found out that one of the boxes I’m monitoring, the office’s firewall, was going crazy if I used the async spooled node, reporting fan speeds way too low (87 RPMs) or way too high (300K), and with similar effects on the temperatures as well. This also seems to have caused the fans to go out of control and run constantly at their 4KRPM instead of their usual 2KRPM. The kernel log showed that there was something going wrong with the i2c access, which is what the <code>sensors</code> program uses.</p>
<p>I started looking into the <code>sensors_</code> plugin that comes with Munin, which I knew already a bit as I fixed it to match some of my systems before… and the problem is that for each box I was monitoring, it would have to execute <code>sensors</code> six times: twice for each graph (fan speed, temperature, voltages), one for config and one for fetching the data. And since there is no way to tell it to just fetch <em>some</em> of the data instead of all of it, it meant many transactions had to go over the i2c bus, all at the same time (when using munin async, the plugins are fetched in parallel). Understanding that the situation is next to unsolvable with that original code, and having one day “half off” at work, I decided to write a new plugin.</p>
<p>This time, instead of using the <code>sensors</code> program, I decided to just access <code>/sys</code> directly. This is quite faster and allows to pinpoint what data you need to fetch. In particular during the config step, there is no reason to fetch the actual value, which saves many i2c transactions even just there. While at it, I also made it a multigraph plugin, instead of the old wildcard one, so that you only need to call it once, and it’ll prepare, serially, all the available graphs: in addition to those that were supported before, which included power – as it’s exposed by the CPUs on Excelsior – I added a few that I haven’t been able to try but are documented by the hwmon sysfs interface, namely current and humidity.</p>
<p>The new plugin is available on <a href="https://github.com/munin-monitoring/contrib">the contrib repository</a> – which I haven’t found a decent way to package yet – as <code>sensors/hwmon</code> and is still written in Perl. It’s definitely faster, has fewer dependencies and it’s definitely more reliable at leas ton my firewall. Unfortunately, there is one feature that is missing: <code>sensors</code> would sometimes report an explicit label for temperature data.. but that’s entirely handled in userland. Since we’re reading the data straight from the kernel, most of those labels are lost. For drivers that do expose those labels, such as <code>coretemp</code>, they are used, though.</p>
<p>Also we lose the ability to ignore the values from the get-go, like I <a href="http://blog.flameeyes.eu/2011/08/munin-and-lm_sensors">describe before</a> but you can’t always win. You’ll have to ignore the graph data from the master instead. Otherwise you might want to find a way to tell the kernel to not report that data. The same probably is true for the names, although unfortunately…</p>
<blockquote><p><br />
[temp*_label] Should only be created if the driver has hints about what this temperature channel is being used for, and user-space doesn’t. In all other cases, the label is provided by user-space.</p></blockquote>
<p>But I wouldn’t be surprised if it was possible to change that a tinsy bit. Also, while it does forfeit some of the labeling that the <code>sensors</code> program do, I was able to make it nicer when anonymous data is present — it wasn’t so rare to have more than one <strong>temp1</strong> value as it was the first temperature channel for each of the (multiple) controllers, such as the Super I/O, <span class="caps"><span class="caps">ACPI</span></span> Thermal Zone, and video card. My plugin outputs the controller <em>and</em> the channel name, instead of just the channel name.</p>
<p>After I’ve completed and tested my <code>hwmon</code> plugin I moved on to re-rewrite the <span class="caps"><span class="caps">IPMI</span></span> plugin. If you remember <a href="http://blog.flameeyes.eu/2012/07/munin-snmp-and-ipmi">the saga</a> I first rewrote the original <code>ipmi_</code> wildcard plugin in <code>freeipmi_</code>, including support for the same wildcards as <code>ipmisensor_</code>, so that instead of using OpenIPMI (and gawk), it would use FreeIPMI (and awk). The reason was that FreeIPMI can cache <span class="caps"><span class="caps">SDR</span></span> information automatically, whereas OpenIPMI does have support, but you have to tackle it manually. The new plugin was also designed to work for virtual nodes, akin to the various <span class="caps"><span class="caps">SNMP</span></span> plugins, so that I could monitor some of the servers we have in production, where I can’t install Munin, or I can’t install FreeIPMI. I have replaced the original <span class="caps"><span class="caps">IPMI</span></span> plugin, which I was never able to get working on any of my servers, with my version on Gentoo for Munin 2.0. I expect Munin 2.1 to come with the FreeIPMI-based plugin by default.</p>
<p>Unfortunately, like for the <code>sensors_</code> plugin, my plugin was calling the command six times per host — although this allows you to filter for the type of sensors you want to receive data for. And that became even worse when you have to monitor foreign virtual nodes. How do I solve that? I decided to rewrite it to be multigraph as well… but shell script then was difficult to handle, which means that it’s now <em>also</em> written in Perl. The new <code>freeipmi</code>, non-wildcard, virtual node-capable plugin is available in the same repository and directory as <code>hwmon</code>. My network switch thanks me for that.</p>
<p>Of course unfortunately the async node still does not support multiple hosts, that’s something for later on. In the mean time though, it does spare me lots of grief and I’m happy I took the time working on these two plugins.</p> link

Posted by SourceForge Robot 2014-09-09

Munin, HP servers and APC powerstrips

<p>Yes, I know I start to get boring.</p>
<p>Today I spent at least half of my work day working on Munin plugins to monitor effectively some of the equipment we currently have at our co-location. This boils down to two <em>metered</em> <span class="caps">APC</span> <del>powerstrips</del> PDUs (let’s use their term, silly as it might sound). I think it’s worth to note the difference: <span class="caps">APC</span> provides switched and metered PDUs; the former should allow for having per-plug load data, and powering on and off of the single plug; the latter (what we have here) is much cheaper, does not allow you to turn them on and off, and simply give you a reading of the load per-phase. Given that our co-location has only single-phase power, we only get a reading per strip, which is still okay, it gives us more information than we had before at least.</p>
<p>Now, there are a few funny things with these strips: they have a network interface, which is cool, but <em>they don’t use <span class="caps">DHCP</span> by default</em>! You either have to set them up with the serial interface (which obviously is still <em>very</em> serial, not an <span class="caps">USB</span> adapter — and my laptop doesn’t have any serial port), or use a Windows software (which is actually written in Java and spend 98% of the install time copying an extra install of the <span class="caps">JRE</span> to the drive), or finally note down the <span class="caps">MAC</span> address when you install them, and then poisoning a system’s <span class="caps">ARP</span> table to “fake” an IP to the strip, sealing the deal by sending a 113 bytes <span class="caps">ICMP</span> packet to the strip via <code>ping</code> … no there is no use for a watermelon or a chimp, sorry Luca.</p>
<p>After finally completing the IP settings, I had to find my way to get the data out; the strips support either SNMPv1 or SNMPv3 — I discarded the former simply because it’s extremely insecure and I’d rather not even have that around, so I set up an user for munin. Next problem? <code>snmpwalk</code> did not report <em>any</em> useful data. The reason is actually quite simple: it doesn’t know which OIDs to probe for. Download the <span class="caps">MIB</span> data from <span class="caps">APC</span> and install it in the system, and it’s much happier.</p>
<p>Then I had to write a plugin for it. Which wasn’t too bad; the data is simple, too bad I couldn’t find a way to get, through <span class="caps">SNMP</span>, the high limit of current drain on the strip — it did report the configured (default) limits for near-overload and overload, which makes it very nice to set them up in Munin. Unfortunately only <em>after</em> writing the plugin I found out that the <a href="https://github.com/Flameeyes/munin-contrib/">Munin contrib repository</a> had already not one but <em>two</em> plugins trying to do the same. Neither is very good with it though: neither supported Munin’s <span class="caps">SNMP</span> framework, one had a very unclear licensing situation (which is unfortunately common on the contrib repository), and used sh and net-snmp’s command-line utilities to access the strip.</p>
<p>So after adding my plugin, and removing the two bad ones, I also looked into cleaning up the contrib tree a little bit. It’s far from perfect, there are still miscategorized plugins and duplicates, and others (such as one of the net-p2p/transmission monitors) which rely on external script files instead of being written in a single one. But at least I was able to remove and recategorize enough of them that it starts to make some sense. If you’re a Munin user and would like for Gentoo to provide more, better plugins, then please take your time to see which of the plugins currently in the contrib tree are trying to reimplement something and failing at it (lots of them I’m afraid will be, especially those related to <span class="caps">APC</span> UPSes), and get rid of them. There is also work to be done to bring even only the documentation of the plugins up to speed with the format used by Munin proper, and this is without talking about improving them to follow the right code style or anything.</p>
<p>I also spent some time improving my <span class="caps">IPMI</span> plugin (which you can find now on the contrib repository if you’re not a Gentoo user – if you’re a Gentoo user it takes the place of the original <span class="caps">IPMI</span> plugins shipped with Munin – after I removed all the others that were trying to do the same thing sometimes with twice as many lines of code than mine), and now it can monitor foreign hosts as well. How is this useful? Well, among other things it lets you monitor Windows boxes and other boxes where you either lack access or you can’t install any <span class="caps">IPMI</span> tool (I have a couple of systems that are running RHEL4 to monitor, okay?).</p>
<p>One interesting thing I learnt out of this experience is that it makes total sense to monitor voltages at least on HP servers. Beside the idea of monitoring for a <span class="caps">PSU</span> gone wrong, HP has one probe set to the <span class="caps">CMOS</span> battery, which is a <a href="http://en.wikipedia.org/wiki/CR2032">3V CR2032 Lithium Battery</a> which will provide decreasing voltage, and thus will show in the list when it has to be replaced — unfortunately it also seems like their newest servers don’t have a probe there, which is <em>bad</em> (Excelsior has a <span class="caps">VBAT</span> which seems to be just the same thing).</p>
<p>This is all for today!</p> link

Posted by SourceForge Robot 2014-09-09

I'm in my network, monitoring!

<p>While I was originally supposed to come here in Los Angeles to work as a firmware <del>developer</del> engineer, I’ve ended up doing a bit more than I was called for.. in particular it seems like I’ve been enlisted to work as a system/network administrator as well, which is not something that bad to be honest, even though it still means that I have to deal with a number of old RedHat and derivative systems.</p>
<p>As I said before this is good because it means that I can work on open-source projects, and Gentoo maintenance, during work hours, as the monitoring is done with Munin, Gentoo and, lately, Icinga. The main issue is of course having to deal with so many different versions of RedHat (there is at least one RHEL3, a few RHEL4, a couple of RHEL5, – and almost all of them don’t have subscriptions – some CentOS 5, plus the new servers that are Gentoo, luckily), but there are others.</p>
<p>Starting last week I started looking into <a href="https://www.icinga.org/">Icinga</a> to monitor the status of services: while Munin is good to know how things move over time and to have an idea of “what happened at that point”, it’s still not extremely good if you just want to know “is everything okay now or not?”. I also find most Munin plugins being simpler to handle than Nagios’s (which are what Icinga would be using), and since I already want the data available on graphs, I might just as well forward the notifications. This of course does not apply to boolean checks that are pretty silly on Munin.</p>
<p>There is <a href="http://munin-monitoring.org/wiki/HowToContactNagios">some documentation</a> in the Munin website on how to set up Nagios notifications, and it mostly works flawlessly for Icinga. With the one difference being that you have to change the <span class="caps">NSCA</span> configuration, as Icinga uses a different command file path, and a different user, which means you have to set up</p>
<div class="CodeRay"><pre>nsca_user=icinga
nsca_group=icinga... read more

Posted by SourceForge Robot 2014-09-09

Why my Munin plugins are now written in Perl

<p><em>This post is an interlude between Gentoo-related posts. The reason is that I have one in drafts that requires me to produce some results that I have not yet, so it’ll have to wait for the weekend or so.</em></p>
<p>You might remember that <a href="http://blog.flameeyes.eu/2012/08/munin-again-sorry">my original <span class="caps">IPMI</span> plugin</a> was written in <span class="caps">POSIX</span> sh and awk, rather than bash and gawk as the original one. Since then, the new plugin (that as it turns out might become part of the 2.1 series but not to replace both the old ones, since <span class="caps">RHEL</span> and Fedora don’t package a new enough version of Freeipmi) has been rewritten in Perl, so using neither sh nor awk. Similarly, I’ve written a <a href="http://blog.flameeyes.eu/2012/10/munin-sensors-and-ipmi">new plugin for sensors</a> which I also wrote in Perl (although in this case the original one also used it).</p>
<p>So why did I learn a new language (since I never programmed in Perl before six months ago) just to get these plugins running? Well, as I said in the other post, the problem was calling the same command so many times, which is why I wanted to go multigraph — but when dealing with variables, sticking to <span class="caps">POSIX</span> sh is a huge headache. One of the common ways to handle this is to save to a temporary directory the output of a command and parse that multiple times, but that’s quite a pain, as it might require I/O to disk, and it also means that you have to execute more and more commands. Doing the processing in Perl means that you can save things in variables, or even just parse it once and split it into multiple objects, to be later used for output, which is what I’ve been doing for parsing FreeIPMI’s output.</p>
<p>But why Perl? Well, Munin itself is written in Perl, so while my usual language of choice is Ruby, the plugins are much more usable if doing it in Perl. Yes, there are some alternative nodes written in C and shell, but in general it’s a safe bet that these plugins will be executed on a system that at least supports Perl — the only system I can think of that wouldn’t be able to do so would be OpenWRT, but that’s a whole different story.</p>
<p>There are a number of plugins written in Python and Ruby, some in the official package, but most in the <a href="https://github.com/munin-monitoring/contrib">contrib repository</a> and they could use some rewriting. Especially those that use <code>net-snmp</code> or other <span class="caps">SNMP</span> libraries, instead of Munin’s Net::<span class="caps">SNMP</span> wrapper.</p>
<p>But while the language is of slight concern, some of the plugins could use some rewriting simply to improve their behaviour. As I’ve said, using <a href="http://munin-monitoring.org/wiki/MultigraphSampleOutput">multigraphs</a> it’s possible to reduce the number of times that the plugin is executed, and thus the number of calls to the backend, whatever that is (a program, or access to <code>/sys</code>), so in many cases plugins that support multiple “modes” or targets through wildcarding can be improved by making them a single plugin. In some cases, it’s even possible to reduce multiple plugins into one, as I did to the various <code>apache_*</code> plugins shipping with Munin itself, replaced on my system with <code>apache_status</code> as provided by the contrib repository, that fetches the server status page only once and then parses it to produce the three graphs that were, before that, created by three different plugins with three different fetches.</p>
<p>Another important trick up our sleeves while working on Munin plugins is <a href="http://munin-monitoring.org/wiki/protocol-dirty-fetch">dirty config</a> which basically means that (under indication from the node itself), you can make the plugin output the values as well as the configuration itself during the config execution — this saves you one full trip to the node (to fetch the data), and usually that also means it saves you from having to send one more call to the backend. In particular with these changes my <span class="caps">IPMI</span> plugin went from requiring six calls to <code>ipmi-sensors</code> per update, for the three graphs, to just one. And since it’s either <span class="caps">IPMI</span> on the local bus (which might require some time to access) or over <span class="caps">LAN</span> (which takes more time), the difference is definitely visible both in timing, and in traffic — in particular one of the servers at my day job is monitoring another seven servers (which can’t be monitored through the plugin locally), which means that we went from 42 to 7 calls per update cycle.</p>
<p>So if you use Munin, and either have had timeout issues in the past or recently, or you have some time at hand to improve some plugins, you might want to follow what I’ve been doing, and start improving or re-writing plugins to support multigraph or dirtyconfig, and thus improve its performance.</p> link

Posted by SourceForge Robot 2014-09-09

Asynchronous Munin

<p>If you’re a Munin user in Gentoo and you look at ChangeLogs you probably noticed that yesterday I did commit quite a few changes to the latest ~arch ebuild of it. The main topic for these changes was async support, which unfortunately I think is still not ready yet, but let’s take a step back. Munin 2.0 brought one feature that was clamored for, and one that was simply extremely interesting: the former is the native <span class="caps">SSH</span> transport, the others is what is called “Asynchronous Nodes”.</p>
<p>On a classic node whenever you’re running the update, you actually have to connect to each monitored node (real or virtual), get the list of plugins, get the config of each plugin (which is not cached by the node), and then get the data for said plugin. For things that are easy to get because they only require you to get data out of a file, this is okay, but when you have to actually contact services that take time to respond, it’s a huge pain in the neck. This gets even worse when <span class="caps">SNMP</span> is involved, because then you have to actually make multiple requests (for multiple values) both to get the configuration, and to get the values.</p>
<p>To the mix you have to add that the default timeout on the node, for various reason, is 10 seconds which, <a href="http://blog.flameeyes.eu/2012/07/munin-snmp-and-ipmi">as I wrote before</a> makes it impossible to use the original <span class="caps">IPMI</span> plugin for most of the servers available out there (my plugin instead seem to work just fine, thanks to FreeIPMI). You can increase the timeout, even though this is not really documented to begin with (unfortunately like most of the things about Munin) but that does not help in many cases.</p>
<p>So here’s how the Asynchronous node should solve this issue: on a standard node, the requests to the single node are serialized so you’re actually waiting for each to complete before the next one is fetched, as I said, and since this can make the connection to the node take, all in all, a few minutes, and if the connection is severed then, you lose your data. The Asynchronous node, instead, has a different service polling the actual node on the same host, and saves the data in its spool file. The master in this case connects via <span class="caps">SSH</span> (it could theoretically work using xinetd but neither me nor Steve care about that), launches the asynchronous client, and then requests all the data that was fetched since the last request.</p>
<p>This has two side-effects: the first is that your foreign network connection is much faster (there is no waiting for the plugins to config and fetch the data), which in turn means that the overall <code>munin-update</code> transaction is faster, but also, if for whatever reason the connection fails at one point (a <span class="caps">VPN</span> connection crashes, a network cable is unplugged, …), the spooled data will cover the time that the network was unreachable as well, removing the “holes” in the monitoring that I’ve been seeing way too often lately. The second side effect is that you can actually spool data every five minutes, but only request it every, let’s say, 15, for hosts which does not require constant monitoring, even though you want to keep granularity.</p>
<p>Unfortunately, the async support is not as tested as it should be and there are quite a few things that are not ironed out yet, which is why the support for it in the ebuild has been this much in flux up to this point. Some things have been changed upstream as well: before, you had only one user, and that was used for both the <span class="caps">SSH</span> connections and for the plugins to fetch data — unfortunately one of the side effect of this is that you might have given your munin user more access (usually read-only, but often times there’s no way to ensure that’s the case!) to devices, configurations or things like that… and you definitely don’t want to allow direct access to said user. Now we have two users, munin and munin-async, and the latter needs to have an actual shell.</p>
<p>I tried toying with the idea of using the munin-async client as a shell, but the problem is that there are no ways to pass options to it that way so you can’t use <code>--spoolfetch</code> which makes it vastly useless. On the other hand, I was able to get the <span class="caps">SSH</span> support a bit more reliable without having to handle configuration files on the Gentoo side (so that it works for other distributions as well, I need that because I have a few CentOS servers at this point), including the ability to use this without requiring netcat on the other side of the <span class="caps">SSH</span> connection (using <a href="http://blog.flameeyes.eu/2011/01/mostly-unknown-openssh-tricks">one old trick</a> with OpenSSH). But this is not yet ready, it’ll have to wait for a little longer.</p>
<p>Anyway as usual you can expect updates to <a href="https://wiki.gentoo.org/wiki/Munin">the Munin page on the Gentoo Wiki</a> when the new code is fully deployed. The big problem I’m having right now is making sure I don’t screw up with the work’s monitors while I’m playing with improving and fixing Munin itself.</p> link

Posted by SourceForge Robot 2014-09-09

The unsolved problem of the init scripts

<p>One of probably the biggest problems with maintaining software in Gentoo where a daemon is involved, is dealing with init scripts. And it’s not really that much of a problem with just Gentoo, as almost every distribution or operating system has its own to handle init scripts. I guess this is one of the nice ideas behind systemd: having a single standard for daemons to start, stop and reload is definitely a positive goal.</p>
<p>Even if I’m not sure myself whether I want the whole init system to be collapsed into a single one for every single operating system out there, there at least is a chance that upstream developers will provide a standard command-line for daemons so that init scripts no longer have to write a hundred lines of pre-start setup code commands. Unfortunately I don’t have much faith that this is going to change any time soon.</p>
<p>Anyway, leaving the daemons themselves alone, as that’s a topic for a post of its own and I don care about writing it now. What remains is the init script itself. Now, while it seems quite a few people didn’t know about this before, OpenRC has been supporting since almost ever a more declarative approach to init scripts by setting just a few variables, such as <code>command</code>, <code>pidfile</code> and similar, so that the script works, as long as the daemon follows the most generic approach. A whole documentation for this kind of scripts is present in the <code>runscript</code> man page and I won’t bore you with the details of it here.</p>
<p>Beside the declaration of what to start, there are a few more issues that are now mostly handled to different degrees depending on the init script, rather than in a more comprehensive and seamless fashion. Unfortunately, I’m afraid that this is likely going to stay the same way for a long time, as I’m sure that some of my fellow developers won’t care to implement the trickiest parts that can implemented, but at least i can try to give a few ideas of what I found out while spending time on said init scripts.</p>
<p>So the number one issue is of course the need to create the directories the daemon will use beforehand, if they are to be stored on temporary filesystems. What happens is that one of the first changes that came with the whole systemd movements was to create <code>/run</code> and use that to store pidfiles, locks and other runtime stateless files, mounting it as tmpfs at runtime. This was something I was very interested in to begin with because I was doing something similar before, on the router with a CF card (through an <span class="caps">EIDE</span> adapter) as harddisk, to avoid writing to it at runtime. Unfortunately, more than an year later, we still have lots of ebuilds out there that expects <code>/var/run</code> paths to be maintained from the merge to the start of the daemon. At least now there’s enough consensus about it that I can easily open bugs for them instead of just ignore them.</p>
<p>For daemons that need <code>/var/run</code> it’s relatively easy to deal with the missing path; while a few scripts do use <code>mkdir</code>, <code>chown</code> and <code>chmod</code> to handle the creation of the missing directories , there is a real neat helper to take care of it, <code>checkpath</code> — which is also documented in the aforementioned man page for <code>runscript</code>. But there has been many other places where the two directories are used, which are not initiated by an init script at all. One of these happens to be my dear <a href="http://blog.flameeyes.eu/tag/munin">Munin’s</a> cron script used by the Master — what to do then?</p>
<p>This has actually been among the biggest issues regarding the transition. It was the original reason why <code>screen</code> was changed to save its sockets in the users’ home instead of the previous <code>/var/run/screen</code> path — with relatively bad results all over, including me deciding to just move to <code>tmux</code>. In Munin, I decided to solve the issue by installing a script in <code>/etc/local.d</code> so that on start the <code>/var/run/munin</code> directory would be created … but this is far from a decent standard way to handle things. Luckily, there actually is a way to solve this that has been standardised, to some extents — it’s called <code>tmpfiles.d</code> and was also introduced by systemd. While OpenRC implements the same basics, because of the differences in the two init systems, not all of the features are implemented, in particular the automatic cleanup of the files on a running system <del>-</del> on the other hand, that feature is not fundamental for the needs of either Munin or screen.</p>
<p><em>There is an issue with the way these files should be installed, though. For most packages, the correct path to install to would be <code>/usr/lib/tmpfiles.d</code>, but the problem with this is that on a multilib system you’d end up easily with having both <code>/usr/lib</code> and <code>/usr/lib64</code> as directories, causing Portage’s symlink protection to kick in. I’d like to have a good solution to this, but honestly, right now I don’t.</em></p>
<p>So we have the tools at our disposal, what remains to be done then? Well, there’s still one issue: which path should we use? Should we keep <code>/var/run</code> to be compatible, or should we just decide that <code>/run</code> is a good idea and run with it? My guts say the latter at this point, but it means that we have to migrate quite a few things over time. I actually started now on porting my packages to use <code>/run</code> directly, starting from pcsc-lite (since I had to bump it to 1.8.8 yesterday anyway) — Munin will come with support for tmpfiles.d in 2.0.11 (unfortunately, it’s unlikely I’ll be able to add support for it upstream in that release, but in Gentoo it’ll be). Some more of my daemons will be updated as I bump them, as I already spent quite a lot of time on those init scripts to hone them down on some more issues that I’ll delineate in a moment.</p>
<p>For some, but not all!, of the daemons it’s actually possible to decide the pidfile location on the command line — for those, the solution to handle the move to the new path is dead easy, as you just make sure to pass something equivalent to <code>-p ${pidfile}</code> in the script, and then change the <code>pidfile</code> variable, and done. Unfortunately that’s not always an option, as the pidfile can be either hardcoded into the compiled program, or read from a configuration file (the latter is the case for Munin). In the first case, no big deal: you change the configuration of the package, or worse case you patch the software, and make it use the new path, update the init script and you’re done… in the latter case though, we have trouble at hand.</p>
<p>If the location of the pidfile is to be found in a configuration file, even if you change the configuration file that gets installed, you can’t count on the user actually updating the configuration file, which means your init script might get out of sync with the configuration file easily. Of course there’s a way to work around this, and that is to actually get the pidfile path from the configuration file itself, which is what I do in the <code>munin-node</code> script. To do so, you need to see what the syntax of the configuration file is. In the case of Munin, the file is just a set of key-value pairs separated by whitespace, which means a simple <code>awk</code> call can give you the data you need. In some other cases, the configuration file syntax is so messed up, that getting the data out of it is impossible without writing a full-blown parser (which is not worth it). In that case you have to rely on the user to actually tell you where the pidfile is stored, and that’s <em>quite</em> unreliable, but okay.</p>
<p>There is of course one thing now that needs to be said: what happens when the pidfile changes in the configuration between one start and the stop? If you’re reading the pidfile out of a configuration file it is possible that the user, or the ebuild, changed it in between causing quite big headaches trying to restart the service. Unfortunately my users experienced this when I changed Munin’s default from <code>/var/run/munin/munin-node.pid</code> to <code>/var/run/munin-node.pid</code> — the change was possible because the node itself runs as root, and then drops privileges when running the plugins, so there is no reason to wait for the subdirectory, and since most nodes will not have the master running, <code>/var/run/munin</code> wouldn’t be useful there at all. As I said, though, it would cause the started node to use a pidfile path, and the init script another, failing to stop the service before starting it new.</p>
<p>Luckily, William corrected it, although it’s still not out — the next OpenRC release will save some of the variables used at start time, allowing for this kind of problems to be nipped in the bud without having to add tons of workarounds in the init scripts. It will require some changes in the functions for graceful reloading, but that’s in retrospective a minor detail.</p>
<p>There are a few more niceties that you could do with init scripts in Gentoo to make them more fool proof and more reliable, but I suppose this would cover the main points that we’re hitting nowadays. I suppose for me it’s just going to be time to list and review all the init scripts I maintain, which are quite a few.</p> link

Posted by SourceForge Robot 2014-09-09

Munin and IPv6

<p>Okay here it comes another post about <a href="http://munin-monitoring.org/">Munin</a> for those who are using this awesome monitoring solution (okay I think I’ve been involved in upstream development more than I expected when Jeremy pointed me at it). While the main topic of this post is going to be IPv6 support, I’d like first to spend a few words for context of what’s going on.</p>
<p>Munin in Gentoo has been slightly patched in the 2.0 series — most of the patches were sent upstream the moment when they were introduced, and most of them have been merged in for the following release. Some of them though, including the one bringing <a href="http://blog.flameeyes.eu/2012/07/munin-snmp-and-ipmi">my FreeIPMI plugin</a> to replace the OpenIPMI plugins, or at least the first version of it, and those dealing with changes that wouldn’t have been kosher for other distributions (namely, Debian) at this point, were also not merged in the 2.0 branch upstream.</p>
<p>But now Steve opened a new branch for 2.0, which means that the development branch (Munin does not use the master branch, for a simple logistic reason of having a <code>master/</code> directory in <span class="caps">GIT</span> I suppose) is directed toward the 2.1 series instead. This meant not only that I can finally push some of my recent <a href="http://blog.flameeyes.eu/2012/12/why-my-munin-plugins-are-now-written-in-perl">plugin rewrites</a> but also that I could make some more deep changes to it, including rewriting the <strong>seven</strong> asterisk plugins into a single one, and work hard on the <span class="caps">HTTP</span>-based plugins (for web servers and web services) so that they use a shared backend, like <span class="caps">SNMP</span>. This actually completely solved an issue that, in Gentoo, we solved only partially before — my <a href="http://www.flameeyes.eu/projects/modsec">ModSecurity ruleset</a> blacklists the default libwww-perl user agent, so with the partial and complete fix, Munin advertises itself in the request; with the new code it includes also the plugin that is currently making the request so that it’s possible to know which requests belongs to what).</p>
<p>Speaking of Asterisk, by the way, I have to thank <a href="https://sysadminman.net/">Sysadminman</a> for lending me a test server for working on said plugins — this not only got us the current new Asterisk plugin (7-in-1!) but also let me modify just a tad said seven plugins, so that instead of using Net::Telnet, I could just use IO::Socket::<span class="caps">INET</span>. This has been merged for 2.0, which in turn means that the next ebuild will have one less dependency, and one less <span class="caps">USE</span> flag — the asterisk flag for said ebuild only added the Net::Telnet dependency.</p>
<p>To the main topic — how did I get to IPv6 in Munin? Well, I was looking at which other plugins need to be converted to “modernity” – which to me means re-using as much code possible, collapse multiple plugins in one through multigraph, and support virtual-nodes – and I found the squid plugins. This was interesting to me because I actually have one squid instance running, on the tinderbox host to avoid direct connection to the network from the tinderboxes themselves. These plugins do not use libwww-perl like the other <span class="caps">HTTP</span> plugins, I suppose (but I can’t be sure, for what I’m going to explain in a moment) because the <code>cache://objects</code> request that has to be done might or might not work with the noted library. Since as I said I have a squid instance, and these (multiple) plugins look exactly like the kind of target that I was looking for to rewrite, I started looking into them.</p>
<p>But once I started, I had a nasty surprise: my Squid instance only replies over IPv6, and that’s intended (the tinderboxes are only assigned IPv6 addresses, which makes it easier for me to access them, and have no <span class="caps">NAT</span> to the outside as I want to make sure that all network access is filtered through said proxy). Unfortunately, by default, libwww-perl does not support accessing IPv6. And indeed, neither do most of the other plugins, including the Asterisk I just rewrote, since they use IO::Socket::<span class="caps">INET</span> (instead of IO::Socket::INET6). A quick searching around, and <a href="http://eintr.blogspot.it/2009/03/bad-state-of-ipv6-in-perl.html">this article</a> turned up — although then <a href="http://www.perl.org/about/whitepapers/perl-ipv6.html">this also turned up</a> that relates to IPv6 support in Perl core itself.</p>
<p>Unfortunately, even with the core itself supporting IPv6, libwww-perl seems to be of different ideas, and that is a showstopper for me I’m afraid. At least, I need to find a way to get libwww-perl to play nicely if I want to use it over IPv6 (yes I’m going to work this around for the moment and just write the new squid plugins against the IPv4). On the other hand, using IO::Socket::IP would probably solve the issue for the remaining parts of the node and that will for sure at least give us some better support. Even better, it might be possible to abstract and have a Munin::Plugin::Socket that will fall-back to whatever we need. As it is, right now it’s a big question mark of what we can do there.</p>
<p>So what can be said about the current status of IPv6 support in Munin? Well, the Node uses Net::Server, and that in turn is not using IO::Socket::IP, but rather IO::Socket::<span class="caps">INET</span> or INET6 if installed — that basically means that <strong>the node itself</strong> will support IPv6 as long as INET6 is installed, and would call for using it as well, instead of using IO::Socket::IP ­— but the latter is the future and, for most people, will be part of the system anyway… The async support, in 2.0, will always use IPv4 to connect to the local node. This is not much of a problem, as Steve is working on merging the node and the async daemon in a single entity, which makes the most sense. Basically it means that in 2.1, all nodes will be spooled, instead of what we have right now.</p>
<p>The master, of course, also uses IPv6 — via IO::Socket::INET6 – yet another nail in the coffin of IO::Socket::IP? Maybe. – this covers all the communication between the two main components of Munin, and could be enough to declare it fully IPv6 compatible — and that’s what 2.0 is saying. But alas, this is not the case yet. On an interesting note, the fact that right now Munin supports arbitrary commands as transports, as long as they provide an I/O interface to the socket, make the fact that it supports IPv6 quite moot. Not only you just need an IPv6-capable <span class="caps">SSH</span> to handle it, but you can probably use <span class="caps">SCTP</span> instead of <span class="caps">TCP</span> simply by using a hacked up netcat! I’m not sure if monitoring would get any improvement of using <span class="caps">SCTP</span>, although I guess it might overcome some of the overhead related to establishing the connection, but.. well it’s a different story.</p>
<p>Of course, Munin’s own framework is only half of what has to support IPv6 for it to be properly supported; the heart of Munin is the plugins, which means that if they don’t support IPv6, we’re dead in the water. Perl plugins, as noted above, have quite a few issues with finding the right combination of modules for supporting IPv6. Bash plugins, and indeed any other language that could be used, would support IPv6 as good as the underlying tools — indeed, even though libwww-perl does not work with IPv6, plugins written with <code>wget</code> would work out of the box, on an IPv6-capable wget… but of course, the gains we have by using Perl are major enough that you don’t want to go that route.</p>
<p>All in all, I think what’s going to happen is that as soon as I’m done with the weekend’s work (which is quite a bit since the Friday was filled with a couple of server failures, and me finding out that one of my backups was <strong>not</strong> working as intended) I’ll prepare a branch and see how much of IO::Socket::IP we can leverage, and whether wrapping around that would help us with the new plugins. So we’ll see where this is going to lead us, maybe 2.1 will really be 100% IPv6 compatible…</p> link

Posted by SourceForge Robot 2014-09-09

Accepted 1.4.5-3+deb6u1 in squeeze-lts (low) (Holger Levsen)

<a href="https://packages.qa.debian.org/m/munin/news/20140807T155433Z.html">[2014-08-07] Accepted 1.4.5-3+deb6u1 in squeeze-lts (low) (Holger Levsen)</a> link

Posted by SourceForge Robot 2014-08-07

Accepted 2.1.9-1 in experimental (medium) (Stig Sandbeck Mathisen)

<a href="https://packages.qa.debian.org/m/munin/news/20140730T153901Z.html">[2014-07-30] Accepted 2.1.9-1 in experimental (medium) (Stig Sandbeck Mathisen)</a> link

Posted by SourceForge Robot 2014-07-30

munin 2.0.21-2 MIGRATED to testing (Britney)

<a href="http://packages.qa.debian.org/m/munin/news/20140517T163913Z.html">[2014-05-17] munin 2.0.21-2 MIGRATED to testing (Britney)</a> link

Posted by SourceForge Robot 2014-05-17

Accepted 2.0.21-2 in unstable (medium) (Stig Sandbeck Mathisen)

<a href="http://packages.qa.debian.org/m/munin/news/20140511T131848Z.html">[2014-05-11] Accepted 2.0.21-2 in unstable (medium) (Stig Sandbeck Mathisen)</a> link

Posted by SourceForge Robot 2014-05-11

Accepted 2.0.21-1 in unstable (low) (Stig Sandbeck Mathisen)

<a href="http://packages.qa.debian.org/m/munin/news/20140510T183502Z.html">[2014-05-10] Accepted 2.0.21-1 in unstable (low) (Stig Sandbeck Mathisen)</a> link

Posted by SourceForge Robot 2014-05-10

Accepted 2.1.6.1-1 in experimental (medium) (Stig Sandbeck Mathisen)

<a href="http://packages.qa.debian.org/m/munin/news/20140501T215014Z.html">[2014-05-01] Accepted 2.1.6.1-1 in experimental (medium) (Stig Sandbeck Mathisen)</a> link

Posted by SourceForge Robot 2014-05-01

Munin helps showing performance changes

<p>Using btrfs on a networked backup server looked like a good idea, what with the
data integrity checksumming and all.</p>

<p>Reformatting it to ext4 gave a decent increase in write performance, and will
hopefully give fewer server crashes per week (from "many" to "none" is the goal)
Just before this wipe-and-reinstall, "umount" had been hanging for a few hours,
and the admin got a tad annoyed.</p>... read more

Posted by SourceForge Robot 2014-02-21

Accepted 2.1.5-1 in experimental (medium) (Holger Levsen)

<a href="http://packages.qa.debian.org/m/munin/news/20140216T112041Z.html">[2014-02-16] Accepted 2.1.5-1 in experimental (medium) (Holger Levsen)</a> link

Posted by SourceForge Robot 2014-02-16

munin 2.0.19-3 MIGRATED to testing (Britney)

<a href="http://packages.qa.debian.org/m/munin/news/20140131T163916Z.html">[2014-01-31] munin 2.0.19-3 MIGRATED to testing (Britney)</a> link

Posted by SourceForge Robot 2014-01-31