I have Webmin 1.970 running on 6 CentOS servers. On one of 3 CentOS8 servers, Webmin crashes with OOM, rendering the server unusable for several seconds.
The server has 10GB RAM, and typically runs at about 2GB "Used" (from free -m)
Sequence:
Open Webmin on the server from a browser, it tries to load the dashboard
Menu loads, main dashboard does not
On the server, RAM usage climbs rapidly, until it hits 100% full in both real RAM and 3GB swap, followed by:
'watch free -m' stops responding, as do the services provided by the server, until the oom kill completes and the server responds again. Problem is that this server provides LDAP, DNS and DHCP to my network... so what actually happens is that pretty much everything stops for a few seconds.
Once Webmin reloads, all is OK - I can reload Webmin dashboard with no problem, but come back tomorrow and it will repeat the same issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, fixed it...
My BIND setup does not (did not) use a separate rndc.conf file, only using a controls statement in named.conf. The sync cmd being called by webmin was therefore failing to run. I have generated an rndc.conf file and the dashboard loads ok without memory going up.
So - definitely an issue to be caught and fixed, but work-around in place for now.
Last edit: Simon Wilson 2021-01-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Doing some more digging...
BIND has a valid configuration wherein RNDC is managed solely with configuration settings in named.conf and an rndc.key file, which is how mine was setup. Note I have had this config for several years, including using webmin, with no issues. Recently adding the large rpz zone looks like it has shown this as an issue.
I have now moved the RNDC config into an rndc.conf file, and Webmin dashboard now loads OK.
Still not sure why it is running a sync - but at least it completes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Because Webmin 1.971 doesn't include this theme fix. You would need to manually update the theme to the latest dev version using Theme Configuration page.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Because it's only when I load the Webmin dashboard page that the issue occurs. The server otherwise runs fine, with no other occurrences of oom. It ONLY happens when that page loads. systemctl restart webmin immediately releases all of the memory and stops the race to oom.
And this line in the system log kinda tells a story:
kernel: Out of memory: Killed process 94511 (/usr/libexec/we) total-vm:6233108kB, anon-rss:5678960kB, file-rss:0kB, shmem-rss:0kB, UID:0
Either it's a Webmin issue or it's not gracefully catching an externally caused issue which is resulting in Webmin generating a memory race to oom. Either way, I'd have thought it would be of interest to resolve... :)
You can try switching themes and see what it does - however, I don't think it will change anything.
Killed process 94511 (/usr/libexec/we)
It works for me on dozens of servers. We cannot help without knowing real details about what's happening. Though, do not get me wrong, we are interested in fixing issues, if we can figure out what we're doing wrong (where the bug is).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Noted...
And I know - I've run Webmin on dozens of servers with no issues. I've reset the debug log, will see what happens. What is the best way for me to take the BIND module out of the picture for testing? Export, Delete? I tried changing the name of the named.conf config file it was pointing to in module config, but the debug log shows it is still running rndc syncs on my zones, including the very large 500k records rpz zone - which is a 200MB "raw" DNS file. I think Webmin was running OK before I added this zone to BIND.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's an interaction with the BIND module.
When I launch Webmin after not having been opened for a while, it fires off RNDC commands to sync BIND zones. One of the BIND zones is a large raw file, and Webmin appears to trigger a 'compile' of this zone (see attached screenshot with named-compile processes running). Until that compile process is completed or killed, Webmin memory usage continues to climb until we hit OOM. If I kill webmin before the named-compile process finishes, relaunching webmin re-initiates a named-compile process.
This is probably a fringe use case, but wouldn't be that rare I would not have thought.
Should I log a bug with more info?
Hi folks, I understand my fringe case issue is not a priority but I am genuinely trying to provide information needed to track this down, with very little response... I've been able to narrow it down to a specific set of circumstances which should not be difficult to reproduce - BIND module and a large raw DNS zone which is having a compile triggered when Webmin is opened.
I can open a bug report if that is preferred - point me in the preferred direction if there is a documented way to provide required information.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Line 35 of /usr/libexec/bind8/records-lib.pl is where I see named-compilezone getting called:
if (&is_raw_format_records($rootfile)) {
# Convert from raw format first
&has_command("named-compilezone") ||
&error("Zone file $rootfile is in raw format, but the ".
"named-compilezone command is not installed");
open($FILE, "named-compilezone -f raw -F text -o - $origin $rootfile |");
}
else {
# Can read text format records directly
open($FILE, "<", $rootfile);
}
Commenting out the raw conversion "if" causes an (expected) error when loading webmin - but no OOM cycle initiates, and the rest of Webmin works as expected. So - it is this "named-compilezone" being called causing my issue when it hits my rpz-malicious zone, which is a 170MB raw file.
The perl script's command when called from a command line takes about 10 seconds to complete:
[root@emp81slaves]# named-compilezone -f raw -F text -o ~/test rpz-malicious-domain rpz-malicious-domain.zone.rawzonerpz-malicious-domain/IN:loadedserial2021031101dumpzoneto/root/test...doneOK
That writes the raw file out to a 163MB text file.
So I guess the question is what is Webmin doing while it waits for records-lib.pl to complete its run through zone files?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I should clarify one thing - the memory runaway happens when first loading the Webmin Dashboard on the server. Once the dashboard has loaded (e.g. by seeing out the OOM issues, killing the named-compilezone process, etc.) the Dashboard then works OK for what appears to be up to a day or so.
After commenting out the named-compilezone line in records-lib.pl the dashboard presents with:
bind8::list_system_info failed : Undefined subroutine &bind8::read_zone_file called at /usr/libexec/webmin/bind8/bind8-lib.pl line 4227. ...but as noted, the rest of webmin now functions fine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi again, Simon. Sorry for not replying sooner. I was busy preparing a new theme release at first, and later doing tons of other min related work.
By the way, if you haven't installed new Webmin 1.973, please do and see if that changes anything in regard to your problem.
I should clarify one thing - the memory runaway happens when first loading the Webmin Dashboard on the server.
Instead, could you provide the output of top -c command, so there would be more details provided? Besides, what is the output of findmnt and df -H commands?
Try going to Theme Configuration > Dashboard and real-time monitoring page and disable first Enable stats history and see if it changes anything, if not, later go back disable Enable for disks and see if it changes anything, and again if doesn't, go back and disable Enable real-time monitoring completely - share details on this process please.
Eventually, I must say, that I am not very good at talking about the issues (which is very time consuming) but rather preferred to interact alive with the system, so if you could provided me with a login details (in case non of the above helps), and share quick steps to reproduce an issue, it would've helped and saved time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Ilia. I have already tried 1.973 - no change. And I previously turned off real-time monitoring (per suggestion made earlier in the thread). No change.
Not sure if you saw my comments about the BIND module issues.
I can give you access, but you'll take down the server and (as it does DNS, DHCP, LDAP for the network) you will take down my entire network when you launch Webmin and it runs named-compilezone.
I have proved this issue only happens when the server is running BIND with a large raw BIND zone - this should be very easy to replicate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's got 10GB and runs using 2... you're kidding on RAM, right?
I can run named-compilezone on a command line and it takes 10 seconds to write the 200MB zone, without RAM usage altering perceptibly. BIND is scripted to update and recompile the raw zone and reload it every day, which it does with no issues.
In fact everything I do on this server using BIND is flawless, until Webmin launches and triggers not-needed named-compilezone, at which point the Webmin process goes rogue until it hits OOM.
Have you attempted to replicate? I can share my zone file if you don't have a large enough raw zone. Run BIND, setup a RPZ zone definition with a suitably big zone file, have BIND update it daily in the background so it is serial incremented (I can share the script). And then use Webmin over a couple of days after updates.
I can happily manage around this by just not using Webmin. I'm not reporting here for the good of my health... I'm trying to help you track down an issue. At least try and replicate it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have Webmin 1.970 running on 6 CentOS servers. On one of 3 CentOS8 servers, Webmin crashes with OOM, rendering the server unusable for several seconds.
The server has 10GB RAM, and typically runs at about 2GB "Used" (from free -m)
Sequence:
'watch free -m' stops responding, as do the services provided by the server, until the oom kill completes and the server responds again. Problem is that this server provides LDAP, DNS and DHCP to my network... so what actually happens is that pretty much everything stops for a few seconds.
Once Webmin reloads, all is OK - I can reload Webmin dashboard with no problem, but come back tomorrow and it will repeat the same issue.
Hi,
There was a bug that we've fixed for Webmin 1.971.
Meanwhile, if you disable real-time monitoring in theme configuration and reload all opened Webmin tabs, does it solve the problem?
By the way, how many Webmin tabs you had opened?
One webmin tab active for that server.
Thanks, when will 1.971 be available?
I'll check disabling real-time monitoring.
Last edit: Simon Wilson 2021-01-30
Disabling real-time monitoring in theme configuration does not stop the problem.
OK, the problem is a large BIND zone.
With Webmin in debug logging mode, the last line in the log as memory usage goes through the roof is this:
`303955 [30/Jan/2021 14:48:33.000000] root 192.168.1.1 bind8 CMD "cmd=rndc -c /etc/rndc.conf sync rpz-malicious-domain 2>&1
OK, fixed it...
My BIND setup does not (did not) use a separate rndc.conf file, only using a controls statement in named.conf. The sync cmd being called by webmin was therefore failing to run. I have generated an rndc.conf file and the dashboard loads ok without memory going up.
So - definitely an issue to be caught and fixed, but work-around in place for now.
Last edit: Simon Wilson 2021-01-30
Doing some more digging...
BIND has a valid configuration wherein RNDC is managed solely with configuration settings in named.conf and an rndc.key file, which is how mine was setup. Note I have had this config for several years, including using webmin, with no issues. Recently adding the large rpz zone looks like it has shown this as an issue.
I have now moved the RNDC config into an rndc.conf file, and Webmin dashboard now loads OK.
Still not sure why it is running a sync - but at least it completes.
Not fixed, came back after a couple of days and again Webmin memory-hogged to crash. I have debug logs, but doesn't seem to indicate much useful.
I've installed the 1.971 devel build, will see if that stops it.
Still not fixed in 1.971
Because Webmin 1.971 doesn't include this theme fix. You would need to manually update the theme to the latest dev version using Theme Configuration page.
I've pulled down the dev version from the theme configuration page, and the issue is still there.
installed version 19.71-beta2
What makes you think then that it's Webmin issue?
Because it's only when I load the Webmin dashboard page that the issue occurs. The server otherwise runs fine, with no other occurrences of oom. It ONLY happens when that page loads. systemctl restart webmin immediately releases all of the memory and stops the race to oom.
And this line in the system log kinda tells a story:
kernel: Out of memory: Killed process 94511 (/usr/libexec/we) total-vm:6233108kB, anon-rss:5678960kB, file-rss:0kB, shmem-rss:0kB, UID:0
Either it's a Webmin issue or it's not gracefully catching an externally caused issue which is resulting in Webmin generating a memory race to oom. Either way, I'd have thought it would be of interest to resolve... :)
I'm still leaning to the BIND module ( see comment at https://sourceforge.net/p/webadmin/discussion/600155/thread/8d78d189e1/#cccf). I will try and disable that module and see if it stops it.
You can try switching themes and see what it does - however, I don't think it will change anything.
It works for me on dozens of servers. We cannot help without knowing real details about what's happening. Though, do not get me wrong, we are interested in fixing issues, if we can figure out what we're doing wrong (where the bug is).
Noted...
And I know - I've run Webmin on dozens of servers with no issues. I've reset the debug log, will see what happens. What is the best way for me to take the BIND module out of the picture for testing? Export, Delete? I tried changing the name of the named.conf config file it was pointing to in module config, but the debug log shows it is still running rndc syncs on my zones, including the very large 500k records rpz zone - which is a 200MB "raw" DNS file. I think Webmin was running OK before I added this zone to BIND.
"What is the best way for me to take the BIND module out of the picture for testing? Export, Delete?"
It's an interaction with the BIND module.
When I launch Webmin after not having been opened for a while, it fires off RNDC commands to sync BIND zones. One of the BIND zones is a large raw file, and Webmin appears to trigger a 'compile' of this zone (see attached screenshot with named-compile processes running). Until that compile process is completed or killed, Webmin memory usage continues to climb until we hit OOM. If I kill webmin before the named-compile process finishes, relaunching webmin re-initiates a named-compile process.
This is probably a fringe use case, but wouldn't be that rare I would not have thought.
Should I log a bug with more info?
Hi folks, I understand my fringe case issue is not a priority but I am genuinely trying to provide information needed to track this down, with very little response... I've been able to narrow it down to a specific set of circumstances which should not be difficult to reproduce - BIND module and a large raw DNS zone which is having a compile triggered when Webmin is opened.
I can open a bug report if that is preferred - point me in the preferred direction if there is a documented way to provide required information.
Line 35 of /usr/libexec/bind8/records-lib.pl is where I see named-compilezone getting called:
Commenting out the raw conversion "if" causes an (expected) error when loading webmin - but no OOM cycle initiates, and the rest of Webmin works as expected. So - it is this "named-compilezone" being called causing my issue when it hits my rpz-malicious zone, which is a 170MB raw file.
The perl script's command when called from a command line takes about 10 seconds to complete:
That writes the raw file out to a 163MB text file.
So I guess the question is what is Webmin doing while it waits for records-lib.pl to complete its run through zone files?
I should clarify one thing - the memory runaway happens when first loading the Webmin Dashboard on the server. Once the dashboard has loaded (e.g. by seeing out the OOM issues, killing the named-compilezone process, etc.) the Dashboard then works OK for what appears to be up to a day or so.
After commenting out the named-compilezone line in records-lib.pl the dashboard presents with:
bind8::list_system_info failed : Undefined subroutine &bind8::read_zone_file called at /usr/libexec/webmin/bind8/bind8-lib.pl line 4227. ...but as noted, the rest of webmin now functions fine.
Hi again, Simon. Sorry for not replying sooner. I was busy preparing a new theme release at first, and later doing tons of other min related work.
By the way, if you haven't installed new Webmin 1.973, please do and see if that changes anything in regard to your problem.
Instead, could you provide the output of
top -ccommand, so there would be more details provided? Besides, what is the output offindmntanddf -Hcommands?Try going to Theme Configuration > Dashboard and real-time monitoring page and disable first Enable stats history and see if it changes anything, if not, later go back disable Enable for disks and see if it changes anything, and again if doesn't, go back and disable Enable real-time monitoring completely - share details on this process please.
Eventually, I must say, that I am not very good at talking about the issues (which is very time consuming) but rather preferred to interact alive with the system, so if you could provided me with a login details (in case non of the above helps), and share quick steps to reproduce an issue, it would've helped and saved time.
Thanks Ilia. I have already tried 1.973 - no change. And I previously turned off real-time monitoring (per suggestion made earlier in the thread). No change.
Not sure if you saw my comments about the BIND module issues.
I can give you access, but you'll take down the server and (as it does DNS, DHCP, LDAP for the network) you will take down my entire network when you launch Webmin and it runs named-compilezone.
I have proved this issue only happens when the server is running BIND with a large raw BIND zone - this should be very easy to replicate.
Hi,
Look, I woukd suggest increasing RAM for starters and see how it goes.
I am not really certain it is a Webmin issue.
It's got 10GB and runs using 2... you're kidding on RAM, right?
I can run named-compilezone on a command line and it takes 10 seconds to write the 200MB zone, without RAM usage altering perceptibly. BIND is scripted to update and recompile the raw zone and reload it every day, which it does with no issues.
In fact everything I do on this server using BIND is flawless, until Webmin launches and triggers not-needed named-compilezone, at which point the Webmin process goes rogue until it hits OOM.
Have you attempted to replicate? I can share my zone file if you don't have a large enough raw zone. Run BIND, setup a RPZ zone definition with a suitably big zone file, have BIND update it daily in the background so it is serial incremented (I can share the script). And then use Webmin over a couple of days after updates.
I can happily manage around this by just not using Webmin. I'm not reporting here for the good of my health... I'm trying to help you track down an issue. At least try and replicate it.