Menu

#7 server freeze with dmucs 0.5 - 0.6.1

open
nobody
5
2015-08-10
2006-08-07
No

After a random amount of time, the dmucs server hangs
up and stops to accept any incoming connection. The
process is still alive and logging stuff, but it does
not respond anymore (load from nodes or host request
from clients). In the logs, the only noticeable thing
is that it says that it has 0 nodes available;
activating debug logs doesn't show anything particular,
except that it stops in the middle of a request :

------- Server: calling select ---------
select returned 1
New request from 172.16.110.104:6120
host 172.16.110.104: ldAvg1 3.07, ldAvg5 3.15, ldAvg10 3.01
Got load average mesg: 172.16.110.104:6120

------- Server: calling select ---------
select returned 1
New request from 172.16.110.119:46271
host 172.16.110.119: ldAvg1 4.68, ldAvg5 3.52, ldAvg10 2.70
Got load average mesg: 172.16.110.119:46271

------- Server: calling select ---------
select returned 1
New request from 172.16.110.56:50595
[Mon Jun 26 15:02:06 2006] Hosts Served: 0 Max/Avail: 0/50
[Mon Jun 26 15:03:08 2006] Hosts Served: 0 Max/Avail: 0/0
[Mon Jun 26 15:04:10 2006] Hosts Served: 0 Max/Avail: 0/0
[Mon Jun 26 15:05:11 2006] Hosts Served: 0 Max/Avail: 0/0
[Mon Jun 26 15:06:12 2006] Hosts Served: 0 Max/Avail: 0/0
[Mon Jun 26 15:07:12 2006] Hosts Served: 0 Max/Avail: 0/0
[Mon Jun 26 15:08:12 2006] Hosts Served: 0 Max/Avail: 0/0

These logs are from dmucs 0.5 but nothing changes with
dmucs 0.6.1 .

I think the problem is that the set of machines in the
compilation farm and the set of clients is the same :
we would like to use dmucs to do some load sharing
between all our development boxes. To me, it looks like
a race condition when a particular node sends its load
average, and ask for a host through gethost at the same
time, or with a particular timing.

Discussion

  • Victor Norman

    Victor Norman - 2006-08-07

    Logged In: YES
    user_id=1399635

    Gilles,

    Thanks for reporting this to me. I'll have to think about this.

    Your setup with the set of compilation hosts being the same
    as the clients is something I haven't tried, but certainly
    should be OK.

    Can we email directly to talk more about this? Please
    respond to vic.norman@gmail.com.

    Thanks.

    Vic

     
  • Greg Bertin

    Greg Bertin - 2015-08-10

    Is anyone looking at this issue? For me it is easy to reproduce. A misbehaving client can tie up the dmucs server indefinitely. You can simulate that by just telnetting to the dmucs server port and not sending any data. The dmucs will hang indefinitely in a select() call which does not have an associated timeout:

    (gdb) bt

    0 0xffffe410 in __kernel_vsyscall ()

    1 0x4a3314b1 in select () from /lib/libc.so.6

    2 0xb7fe3f9e in Swait (skt=0x80cec30) at Swait.c:60

    3 0xb7fe1446 in Sgets (buf=0xbf9e9000 "", maxbuf=1024, skt=0x80cec30)

    at Sgets.c:91
    

    4 0x08058add in handleReq (sock_req=0x80cec30, db=0x80c6e18) at main.cc:287

    5 0x08058869 in main (argc=2, argv=0xbf9e9504) at main.cc:193

    Arguably the root cause is in the client which is failing to send the dmucs server a request But a server should be robust enough to handle misbehaving clients.

    This is happening about once a week for in our build farm. We are running 0.6.1 of dmucs

     

Log in to post a comment.