From: Wilton W. <ww...@ha...> - 2002-10-10 01:47:45
|
We have been trying to integrate this for sometime without much success, there seems to be a deadlock in the kernel, somewhere someone is locking using the wrong lock or in the wrong context or something.. when we run more than one process per node, in this case "bpsh <node> yes".. eventually (within a matter of seconds the kernel is too busy to handle and requests such as responding to the bproc heartbeat) A forced kernel stack dump using lcrash reveals: .... dc61c000 0 1462 1418 0x01 0x00000000 0:0 yes dc64c000 0 1463 1415 0x00 0x00000040 402:127 yes dc5e6000 0 1464 1418 0x02 0x00000000 25:6 yes dc5c2000 0 1465 1415 0x00 0x00000040 402:127 yes >> trace dc5e6000 ================================================================ STACK TRACE FOR TASK: 0xdc5e6000(yes) 0 schedule+901 [0xc01197d5] 1 schedule_timeout+18 [0xc0126582] 2 [bproc]bproc_response_wait+115 [0xe08c3697] 3 [bproc]send_process+163 [0xe08c20e3] 4 [bproc]do_execmove+126 [0xe08c6eee] 5 [bproc]do_bproc+980 [0xe08c7744] 6 system_call+44 [0xc0108f94] ebx: 00000000 ecx: 00000000 edx: 00000000 esi: 00000000 edi: 00000000 ebp: 00000000 eax: 00000000 ds: 002b es: 002b eip: 40000b50 cs: 0023 eflags: 00000216 esp: bffffb50 ss: 002b ================================================================ And of course if we remove the O(1) scheduler everything works fine.. any help in where to look for this problem would be appreciated. Thanks - Wilton ----[ Wilton William Wong ]--------------------------------------------- 11060-166 Avenue Ph : 01-780-456-9771 High Performance UNIX Edmonton, Alberta FAX: 01-780-456-9772 and Linux Solutions T5X 1Y3, Canada URL: http://www.harddata.com -------------------------------------------------------[ Hard Data Ltd. ]---- |