This morning one of my VPS servers crashed - kernel panic, out of memory - due to hundreds of perl interpreter processes spawning and not dying from webmin.
Centos 6.4 in Xen virtual environment, Virtualmin repository enabled and a recent update was applied.
Got lots of email to root.
Subject: Cron <root@webserver> /etc/webmin/status/monitor.pl
Message:
Error: Failed to lock file /etc/webmin/status/oldstatus after 5 minutes. Last error was :
Error
-----
Failed to lock file /etc/webmin/status/oldstatus after 5 minutes. Last error was :
-----
I haven't looked at the code but I suggest that if a fail to lock occurs after five minutes then the script should kill itself, maybe resetting the mutexed object after a couple of hours somehow, This would reduce the build up of perl processes waiting for a lock which is blocking for some reason.
Cheers,
Chris
How often do you have scheduled monitoring setup to run? If it is more often than once per every 5 minutes, I can see how this kind of process build up could happen.
The real question is why /etc/webmin/status/oldstatus could not be locked. The file /etc/webmin/status/oldstatus.lock should contain the PID of the process that was holding the lock, although if your system has been rebooted that won't be very useful anymore.
I had a look through the cron log ... lots of
INFO (Job execution of per-minute job scheduled for 00:27 delayed into subsequent minute 00:28. Skipping job run.)
ERROR (setreuid failed): Resource temporarily unavailable
messages there.
The scheduled monitoring period is set as 5 minutes on my system which I'm assuming is a default on installation as I can't recall ever changing it. I'll up it to twenty minutes, so at least if the error occurs again I should have more time to
catch the behaviour and be able to find what's holding the lock. Because the kernel panicked and we couldn't get a console session, I couldn't investigate without a reboot when we found the issue.
I had the same error starting last night at 00:40 (webmin 1.65)
Oldstatus.lock contains 10577 - it's the PID of
/usr/bin/perl /usr/libexec/webmin-1.650/status/monitor.pl running under root.
The status monitor appears to be working OK, however the process at 10577 is basically stuck, (started at 00:30).
When status monitor runs there are 2 processes, hence the warning.
I have seen this before, can't remember what cleared it, perhaps restarting webmin.
Any further diagnosis you need?
Richard - did you also see the same problem of monitor.pl using up all the memory on your system?
No, only the one stuck process, all the others terminate OK, so no significant memory use.
FWIW One significant thing I was doing at the time the monitor got stuck was taking my Internet connection up and down (I have a ping monitor for the WAN address which would have been changing - its ADSL dynamic IP)
Update - the same happened again this evening when the broadband was going up and down by itself. Seems even more like it's the ping monitor that gets stuck
Did you see the same OOM errors again when your broadband went down? Or did you just see locking errors from monitor.pl ?
No, just a stuck monitor process as before
OK, it looks like the issue is that in some cases ping never terminates, even when run with the -w flag. I will add an additional timeout enforced by Webmin in the next release to handle this case.