From: Mark H. <ma...@os...> - 2004-06-09 17:25:01
|
On Tue, 2004-06-08 at 13:21, Mark Haverkamp wrote: > On Tue, 2004-06-08 at 11:45, Jon Maloy wrote: > > I took a little closer look at the recvbcast code, and notice > > a couple of things: > > First, the code does consistently use buf_safe_discard() when it seems > > to be sufficient with buf_discard(). This function is more > > expensive to use, but should not cause any problems if it > > were correctly implemented. > > Unfortunately, it is not. I have forgotten to protect the quarantine > > queue with a lock, and this may quite well cause havoc in the > > both this buffer queue and elsewhere. My guess is that the very > > strange messages we see in the dump in reality are invalid, > > -maybe a mix of different messages. Otherwise I can not > > explain the destination port number zero in the messages, which > > seems impossible if one follows the call chain > > bcast_port_recv_msg()->nameseq_deliver_msg()-> > > port_recv_msg()->net_route_msg()->net_route_named_msg(). > > > > An extra lock for the quarantine queue is needed, and this will hopefully > > fix the problem, but buf_safe_discard() should anyway be changed to > > buf_discard() if there is no particular reason for using it. > > The code that I was testing had a lock on the quarantine queue. One > thing that may be the cause of problems in this case was that I did have > page alloc debug turned on after all. It uses a whole page regardless > of the allocation size as a debug tool. We may have just run out of > pages. I am trying out the test once again without the page alloc debug > compiled into the kernel. I still had two of the machines panic about 4 AM today even with page alloc debug compiled out. Here is the code I have been using with the lock for the quarantine list. It should probably be checked in anyway since it does seem to fix the hang that I was seeing. cvs diff -u buf.c Index: buf.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/buf.c,v retrieving revision 1.12 diff -u -r1.12 buf.c --- buf.c 5 May 2004 19:09:03 -0000 1.12 +++ buf.c 9 Jun 2004 17:18:46 -0000 @@ -106,10 +106,13 @@ * queue instead. */ +static spinlock_t qb_lock = SPIN_LOCK_UNLOCKED; void buf_safe_discard(struct sk_buff *buf) { - struct sk_buff *qbuf = quarantine_head; + struct sk_buff *qbuf; + spin_lock_bh(&qb_lock); + qbuf = quarantine_head; while (qbuf) { struct sk_buff *next = buf_next(qbuf); if (buf_busy(qbuf)) @@ -118,6 +121,7 @@ qbuf = next; } quarantine_head = qbuf; + spin_unlock_bh(&qb_lock); if (!buf) return; @@ -126,12 +130,14 @@ return; } buf_set_next(buf, 0); + spin_lock_bh(&qb_lock); if (!quarantine_head) { quarantine_head = quarantine_tail = buf; } else { buf_set_next(quarantine_tail, buf); quarantine_tail = buf; } + spin_unlock_bh(&qb_lock); } void buf_stop(void) ---------------------------------------------------------------------------------------------------- Also, how do I interpret the output from tipc when the trouble happens. e.g.: net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): -- Mark Haverkamp <ma...@os...> |