From: Jon M. <jon...@er...> - 2004-06-09 17:56:54
|
You can check it in. I have still not merged with my code, so send me a note when it is ready. Thanks /jon Mark Haverkamp wrote: On Tue, 2004-06-08 at 13:21, Mark Haverkamp wrote: On Tue, 2004-06-08 at 11:45, Jon Maloy wrote: I took a little closer look at the recvbcast code, and notice a couple of things: First, the code does consistently use buf_safe_discard() when it seems to be sufficient with buf_discard(). This function is more expensive to use, but should not cause any problems if it were correctly implemented. Unfortunately, it is not. I have forgotten to protect the quarantine queue with a lock, and this may quite well cause havoc in the both this buffer queue and elsewhere. My guess is that the very strange messages we see in the dump in reality are invalid, -maybe a mix of different messages. Otherwise I can not explain the destination port number zero in the messages, which seems impossible if one follows the call chain bcast_port_recv_msg()->nameseq_deliver_msg()-> port_recv_msg()->net_route_msg()->net_route_named_msg(). An extra lock for the quarantine queue is needed, and this will hopefully fix the problem, but buf_safe_discard() should anyway be changed to buf_discard() if there is no particular reason for using it. The code that I was testing had a lock on the quarantine queue. One thing that may be the cause of problems in this case was that I did have page alloc debug turned on after all. It uses a whole page regardless of the allocation size as a debug tool. We may have just run out of pages. I am trying out the test once again without the page alloc debug compiled into the kernel. I still had two of the machines panic about 4 AM today even with page alloc debug compiled out. Here is the code I have been using with the lock for the quarantine list. It should probably be checked in anyway since it does seem to fix the hang that I was seeing. cvs diff -u buf.c Index: buf.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/buf.c,v retrieving revision 1.12 diff -u -r1.12 buf.c --- buf.c 5 May 2004 19:09:03 -0000 1.12 +++ buf.c 9 Jun 2004 17:18:46 -0000 @@ -106,10 +106,13 @@ * queue instead. */ +static spinlock_t qb_lock = SPIN_LOCK_UNLOCKED; void buf_safe_discard(struct sk_buff *buf) { - struct sk_buff *qbuf = quarantine_head; + struct sk_buff *qbuf; + spin_lock_bh(&qb_lock); + qbuf = quarantine_head; while (qbuf) { struct sk_buff *next = buf_next(qbuf); if (buf_busy(qbuf)) @@ -118,6 +121,7 @@ qbuf = next; } quarantine_head = qbuf; + spin_unlock_bh(&qb_lock); if (!buf) return; @@ -126,12 +130,14 @@ return; } buf_set_next(buf, 0); + spin_lock_bh(&qb_lock); if (!quarantine_head) { quarantine_head = quarantine_tail = buf; } else { buf_set_next(quarantine_tail, buf); quarantine_tail = buf; } + spin_unlock_bh(&qb_lock); } void buf_stop(void) ------------------------------------------------------------------------ ---------------------------- Also, how do I interpret the output from tipc when the trouble happens. e.g.: net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): |