#874 Cluster Webmin servers gets confused

1.030
open
5
2002-12-09
2002-12-09
Roland Pope
No

I have several issues with the 'Cluster Webmin Servers'
module, but as yet am unable to completely diagnose
what is going on.
The first relates to doing an update across all servers of
say, a webmin group. I select the group I want to edit
and click edit group. I get taken to a second screen
where I can make changes to the group. At the bottom
is a list of all the servers which have this group. I make
my changes and click 'Save'. I am taken to a screen
where I am given a list of the hosts where the updates
have taken place. This list is incomplete and appears to
have been inturrupted,but if I look at one of the hosts
which should have been updated but was not on the list
returned, I see that the update has in fact been done on
that host?
Also in this sam area, the Cluster Webmin module
seems to get confused sometimes and only show one
or two hosts as having a particular user and/or group,
even though that item appears on a number of the
servers. When I look at the host directories
in /etc/webmin/cluster-webmin/hosts, I see that webmin
no longer has all the apropriate files for a given host (ie
missing .user and .group files) . A 'Refresh Servers'
doesn't sort this problem out. I have to delete all the
hosts and add them back in again?

Discussion

  • Jamie Cameron

    Jamie Cameron - 2002-12-10

    Logged In: YES
    user_id=129364

    Does this always happen to a particular host or type of host?
    It sounds like webmin is having trouble fetching the list of
    users or groups
    from some of the remote servers ..
    One way of debugging it is to add the line rpcdebug=1 to
    /etc/webmin/config
    on the remote server, and see what is written to the remote
    /var/webmin/miniserv.error file when you try to do a refresh.

     
  • Roland Pope

    Roland Pope - 2002-12-11

    Logged In: YES
    user_id=635054

    Could all this be a timing hole? The server I am running the
    administrative copy of webmin on is not very grunty and could
    do with more memory. Webmin spawns a separate
    miniserv.pl for each machine it has to connect to (17 + 1
    fastrpc in this case) and this sudden deluge drives the
    loadaverage up on the box quite quickly when I do a refresh. I
    changed all my servers to use fastrpc and although some bits
    seemed to work better, doing a refresh started dropping some
    of the faster hosts from the list (ie. I get 'Removed
    1028737120 from server list' for some of the better performing
    hosts.
    So what would happen if the remote server started responding
    before the local miniserv.pl was ready? Is this possible?

     
  • Jamie Cameron

    Jamie Cameron - 2002-12-11

    Logged In: YES
    user_id=129364

    The forking of multiple processes is intentional, in order
    to parallelise the refresh.. and I don't think it could be
    the cause of a problem.
    Try turning on the rpcdebug=1 option on one of the
    troublesome remote servers and see what is logged ..

     
  • Roland Pope

    Roland Pope - 2002-12-11

    Logged In: YES
    user_id=635054

    When I turn rpcdebug on on one of the servers that is being
    dropped from the list after a refresh, the last entry I see in the
    miniserv.errors log is
    "fastrpc: call webmin::get_all_module_infos done = HASH
    (0x87ca490), (More HASH references here)....".
    This seems to happen almost exclusively with the faster
    servers. I havn't yet had the problem with a couple of Pentium
    133's, but with some of the faster servers, they are being
    dropped more often than they get refreshed properly. This
    may be a 'Red Herring', but it seems a little suspicious.
    Maybe the forked processes are dying for some reason after
    they get back the response to the
    remote 'get_all_module_infos' call and are not reporting back
    to the master miniserv.pl?
    Is there anyway to increase the debugging level of these
    forked processes on the main webmin server to try and
    diagnose this possiblity? Also, it would be handy to be able
    to restrict the maxium number of child processes forked. This
    would allow me to reduce the load on the main webmin server
    and see if that improved things? Also, I imagine that the
    number of webmin servers I will be looking after will grow
    beyond the 17 I have now and I don't want loads of miniserv
    process being forked simultaneously and bogging the master
    webmin server down.

     
  • Roland Pope

    Roland Pope - 2002-12-11

    Logged In: YES
    user_id=635054

    I have been doing a few strace's to try and diagnose what
    might be going on. On the main server I saw the following:

    <snip master webmin server strace>
    [pid 17053] write(13, "127 98f4dc01c01e82f74edfd0be8940"...,
    37 <unfinished ...>
    [pid 17053] <... write resumed> ) = 37
    [pid 17053] write(13, "HASH,VAL%2Csession,UNDEF,VAL%
    2Ca"..., 127 <unfinished ...>
    [pid 17053] <... write resumed> ) = -1 EPIPE (Broken
    pipe)
    [pid 17053] --- SIGPIPE (Broken pipe) ---
    </snip master webmin server strace>

    On the remote server I could see that the write did get
    throught and that the server responded with the appropriate
    response and then closed the tcp connection:
    <snip remote webmin server strace>
    [pid 9204] write(2, "fastrpc: call webmin::get_all_mo"..., 50)
    = 50
    [pid 9204] write(2, "HASH(0x87cdf00),HASH(0x87cdee8),"...,
    1263) = 1263
    [pid 9204] write(2, "\n", 1) = 1
    [pid 9204] write(9, "143445\n", 7) = 7
    [pid 9204] write(9, "HASH,VAL%2Cstatus,VAL%2C1,VAL%
    2C"..., 4096) = 4096
    [pid 9204] write(9, "5252CUsu%252525E1rios%
    25252520do"..., 139264) = 139264
    [pid 9204] write(9, "%25252Cversion%252CVAL%25252C1%
    2"..., 85) = 85
    [pid 9204] munmap(0x4023a000, 147456) = 0
    [pid 9204] select(16, [9], NULL, NULL, {30, 0} <unfinished ...>
    [pid 9204] <... select resumed> ) = 0 (Timeout)
    [pid 9204] time(NULL) = 1039641207
    [pid 9204] fstat64(4, {st_mode=S_IFREG|0600,
    st_size=30807, ...}) = 0
    [pid 9204] _llseek(4, 30807, [30807], SEEK_SET) = 0
    [pid 9204] write(4, "cnwchcm5.cnw.co.nz - admin [12/D"...,
    92) = 92
    [pid 9204] close(4) = 0
    </snip master webmin server strace>

    Is it possible that the connection was closed by the remote
    server before the master could read the response?

     
  • Roland Pope

    Roland Pope - 2002-12-11

    Logged In: YES
    user_id=635054

    If I removed 4 of the 17 servers from the managed list and did
    a refresh, things seemed to work fine (ie 5 out of 5 refreshes
    in a row worked), regardless of which servers I removed.
    When I added these servers back in, I was back to having
    servers dropped on a refresh (4 out of 5 refresh's dropped
    servers).
    When I loaded the main webmin box up with extra work
    before doing a refresh, the number of servers dropped from
    the list increased further, leaving only my two slow old
    pentium 133's and a box on a slow-ish WAN link.
    This suggests to me, that the ability of the forked miniserv.pl
    processes to respond in a timely fashion to the remote
    servers is a factor in all of this. Maybe the remote servers
    shouldn't close their TCP connections until the requestor has
    read the data from them?

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks