Webmin / Bugs / #874 Cluster Webmin servers gets confused

Jamie Cameron - 2002-12-10

Logged In: YES
user_id=129364

Does this always happen to a particular host or type of host?
It sounds like webmin is having trouble fetching the list of
users or groups
from some of the remote servers ..
One way of debugging it is to add the line rpcdebug=1 to
/etc/webmin/config
on the remote server, and see what is written to the remote
/var/webmin/miniserv.error file when you try to do a refresh.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Pope - 2002-12-11

Logged In: YES
user_id=635054

Could all this be a timing hole? The server I am running the
administrative copy of webmin on is not very grunty and could
do with more memory. Webmin spawns a separate
miniserv.pl for each machine it has to connect to (17 + 1
fastrpc in this case) and this sudden deluge drives the
loadaverage up on the box quite quickly when I do a refresh. I
changed all my servers to use fastrpc and although some bits
seemed to work better, doing a refresh started dropping some
of the faster hosts from the list (ie. I get 'Removed
1028737120 from server list' for some of the better performing
hosts.
So what would happen if the remote server started responding
before the local miniserv.pl was ready? Is this possible?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jamie Cameron - 2002-12-11

Logged In: YES
user_id=129364

The forking of multiple processes is intentional, in order
to parallelise the refresh.. and I don't think it could be
the cause of a problem.
Try turning on the rpcdebug=1 option on one of the
troublesome remote servers and see what is logged ..

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Pope - 2002-12-11

Logged In: YES
user_id=635054

When I turn rpcdebug on on one of the servers that is being
dropped from the list after a refresh, the last entry I see in the
miniserv.errors log is
"fastrpc: call webmin::get_all_module_infos done = HASH
(0x87ca490), (More HASH references here)....".
This seems to happen almost exclusively with the faster
servers. I havn't yet had the problem with a couple of Pentium
133's, but with some of the faster servers, they are being
dropped more often than they get refreshed properly. This
may be a 'Red Herring', but it seems a little suspicious.
Maybe the forked processes are dying for some reason after
they get back the response to the
remote 'get_all_module_infos' call and are not reporting back
to the master miniserv.pl?
Is there anyway to increase the debugging level of these
forked processes on the main webmin server to try and
diagnose this possiblity? Also, it would be handy to be able
to restrict the maxium number of child processes forked. This
would allow me to reduce the load on the main webmin server
and see if that improved things? Also, I imagine that the
number of webmin servers I will be looking after will grow
beyond the 17 I have now and I don't want loads of miniserv
process being forked simultaneously and bogging the master
webmin server down.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Pope - 2002-12-11

Logged In: YES
user_id=635054

I have been doing a few strace's to try and diagnose what
might be going on. On the main server I saw the following:

<snip master webmin server strace>
[pid 17053] write(13, "127 98f4dc01c01e82f74edfd0be8940"...,
37 <unfinished ...>
[pid 17053] <... write resumed> ) = 37
[pid 17053] write(13, "HASH,VAL%2Csession,UNDEF,VAL%
2Ca"..., 127 <unfinished ...>
[pid 17053] <... write resumed> ) = -1 EPIPE (Broken
pipe)
[pid 17053] --- SIGPIPE (Broken pipe) ---
</snip master webmin server strace>

On the remote server I could see that the write did get
throught and that the server responded with the appropriate
response and then closed the tcp connection:
<snip remote webmin server strace>
[pid 9204] write(2, "fastrpc: call webmin::get_all_mo"..., 50)
= 50
[pid 9204] write(2, "HASH(0x87cdf00),HASH(0x87cdee8),"...,
1263) = 1263
[pid 9204] write(2, "\n", 1) = 1
[pid 9204] write(9, "143445\n", 7) = 7
[pid 9204] write(9, "HASH,VAL%2Cstatus,VAL%2C1,VAL%
2C"..., 4096) = 4096
[pid 9204] write(9, "5252CUsu%252525E1rios%
25252520do"..., 139264) = 139264
[pid 9204] write(9, "%25252Cversion%252CVAL%25252C1%
2"..., 85) = 85
[pid 9204] munmap(0x4023a000, 147456) = 0
[pid 9204] select(16, [9], NULL, NULL, {30, 0} <unfinished ...>
[pid 9204] <... select resumed> ) = 0 (Timeout)
[pid 9204] time(NULL) = 1039641207
[pid 9204] fstat64(4, {st_mode=S_IFREG|0600,
st_size=30807, ...}) = 0
[pid 9204] _llseek(4, 30807, [30807], SEEK_SET) = 0
[pid 9204] write(4, "cnwchcm5.cnw.co.nz - admin [12/D"...,
92) = 92
[pid 9204] close(4) = 0
</snip master webmin server strace>

Is it possible that the connection was closed by the remote
server before the master could read the response?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Pope - 2002-12-11

Logged In: YES
user_id=635054

If I removed 4 of the 17 servers from the managed list and did
a refresh, things seemed to work fine (ie 5 out of 5 refreshes
in a row worked), regardless of which servers I removed.
When I added these servers back in, I was back to having
servers dropped on a refresh (4 out of 5 refresh's dropped
servers).
When I loaded the main webmin box up with extra work
before doing a refresh, the number of servers dropped from
the list increased further, leaving only my two slow old
pentium 133's and a box on a slow-ish WAN link.
This suggests to me, that the ability of the forked miniserv.pl
processes to respond in a timely fashion to the remote
servers is a factor in all of this. Maybe the remote servers
shouldn't close their TCP connections until the requestor has
read the data from them?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cluster Webmin servers gets confused

A web-based interface for system administration of UNIX

Group

Searches

Help

#874 Cluster Webmin servers gets confused

Discussion