#531 Inconsistent results from check_disk

open
nobody
5
2012-05-14
2012-05-14
Anonymous
No

Disk space check can produce a status that is inconsistent with the threshold values in its performance data. So it can return a CRITICAL status yet according to the performance data it should only be a warning.

For example:

[root@ui-142 root]# df /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 3652680 3180864 286268 92% /

[root@ui-142 root]# /usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
-vvvvvvv -u kB
Thresholds(pct) for / warn: 20.000000 crit 10.000000
calling stat on /
For /, total=913170, available=71567, available_to_root=117954, used=795216,
fsp.fsu_files=232000, fsp.fsu_ffree=180847
For /, used_pct=92 free_pct=8 used_units=3.18086e+06 free_units=286268
total_units=3.65268e+06 used_inodes_pct=23 free_inodes_pct=77
fsp.fsu_blocksize=4096 mult=1024
Freespace_units result=0
Freespace% result=2
Usedspace_units result=0
Usedspace_percent result=0
Usedinodes_percent result=0
Freeinodes_percent result=0
DISK CRITICAL - free space: / 286268 kB (8% inode=77%);|
/=3180864kB;2922144;3287412;0;3652680

What is happening is that (like df) check_disk is calculating the %free to give the CRITICAL status based on space available to a non-root user.
However in the performance data it calculates the thresholds in blocks from the %levels using the total disk space available to root. Comparing with these indicates it should only be a warning.

I attach a patch that calculates the thresholds in the performance data from the total available to non-root user so they are consistent with the status calculation. I have left the total size in the performance data as the total available to root as this is consistent with df and will be the value the user expects to see. It does mean that if someone tries to check the threshold calculation they might think it wrong - but this will not be very obvious if they are just looking at if in a graphical display based on the performance data. Whatever one does will involve some compromise - my view is that this is the most useful for most people. There is still a small band where the status and the performance data will disagree - this seems to be due to the status calculation calculating the %free as integer and doing the comparison on that rather than converting the % value passed in as an option to blocks and doing the comparison on that (as is done with the performance data). There are various ways to resolve this but need bigger changes to the code.

Discussion


  • Anonymous
    2012-05-14

    Patch to change threshold calculation