Re: [Dpcl-develop] Using 'diag' test
Brought to you by:
dpcl-admin,
dwootton
|
From: Steve C. <sl...@sg...> - 2004-06-02 20:36:38
|
DaveW suggested looking at the shared memory changes developed at SGI. Yes,
we are running with these changes. DaveW also suggested looking at the ia64
specific locking code developed at SGI. This code has been stress tested quite
at bit. However, my (SteveC's) option of the locking code and shared memory
changes my bit just a <tad> bit biased. So I am going back to first look at the
shared memory changes since the SEGV is occurring right in this area. Per BillH
here at SGI (who developed the shared memory changes), the change from using
'0' .vs. page->object_size is an <immutable> bug, i.e. he's pretty confident about
it. The 'loop'
change in routine shmFObjectAllocV is encountered less frequently per BillH, but
he is pretty confident about it as well. So, I started playing with things. I
to the original 'ShmManager.C' and the only change I made was to change '0' to
'page->object_size' in both shmObjectFreeV and shmFObjectAlloc. Without this
change, thousands of messages are simply thrown away (the original bug) and the
SEGV does not occur because there is no stress on the message queueing. But with
just this change in two places ( 0 -> page->object_size ), I was able to get the
SEGV. This implies at the very least that Bill's additional 'loop' change in
shmFObjectAllocV is not culpable. Again, the '9' to page0>object_size is an
obvious bug and without it there is no message stressing because messages are
being discarded by the billions (per BillH).
DaveW asked about various values printed by gdb at the sight of the SEGV.
THe problem is that p_free_object is nonNULL but a dereference of it IS NULL
and thus ( I think) the SEGV, to wit;
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManager.C:826
826 *p_free_object =
(gdb) where
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManager.C:826
#1 0x2000000000e11340 in shm_processObjectAllocV (shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManagerAPI_app.C:50
#2 0x2000000000e11cb0 in Ais_send (msg_handle_id=0x2000000000f225a4 "",
message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:334
#3 0x2000000000e11950 in Ais_send_int (msg_handle_id=0x2000000000f225a4 "",
message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:89
#4 0x2000000001f37ae0 in ?? ()
Previous frame identical to this frame (corrupt stack?)
(gdb) p p_free_object
$1 = (freeFObjectH **) 0x2000000000e2a758
(gdb) p *p_free_object
$2 = (freeFObjectH *) 0x0
(gdb)
And so it goes. BillH developed, some time ago, a couple unit tests to stress his
shared memory changes as well as the ia64 locking mechanism. However, his stress
test for the shared memory changes was single-threaded (the locking unit test was
however, clearly, multi-threaded). So BillH is going to enhance his unit test for
shared memory to become multi-threaded. We'll see what results.
Here is the BEST QUESS that Bill and myself have at this time:
It is a DPCL locking problem (original problem - not the new ia64 locking code).
Bill's fixes are ok and using them causing the message queuing to become stressed,
thus exposing the lack of a page lock somewhere, somehow. But this is a best random
speculation at this point. Something I'm (SteveC) perfectly capable of doing, heh-heh.
Thanks, Dave & BillH
SteveC
|