Re: [Dpcl-develop] Using 'diag' test

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

    DaveW suggested looking at the shared memory changes developed at SGI. Yes,
  we are running with these changes. DaveW also suggested looking at the ia64
  specific locking code developed at SGI. This code has been stress tested quite
  at bit. However, my (SteveC's) option of the locking code and shared memory
  changes my bit just a <tad> bit biased. So I am going back to first look at the
  shared memory changes since the SEGV is occurring right in this area. Per BillH
  here at SGI (who developed the shared memory changes), the change from using
  '0' .vs. page->object_size is an <immutable> bug, i.e. he's pretty confident about
  it. The 'loop'
  change in routine shmFObjectAllocV is encountered less frequently per BillH, but
  he is pretty confident about it as well. So, I started playing with things. I
  to the original 'ShmManager.C' and the only change I made was to change '0' to
  'page->object_size' in both  shmObjectFreeV and shmFObjectAlloc. Without this
  change, thousands of messages are simply thrown away (the original bug) and the
  SEGV does not occur because there is no stress on the message queueing. But with
  just this change in two places ( 0 -> page->object_size ), I was able to get the
  SEGV. This implies at the very least that Bill's additional 'loop' change in
  shmFObjectAllocV is not culpable. Again, the '9' to page0>object_size is an
  obvious bug and without it there is no message stressing because messages are
  being discarded by the billions (per BillH).

    DaveW asked about various values printed by gdb at the sight of the SEGV.
  THe problem is that p_free_object is nonNULL but a dereference of it IS NULL
  and thus ( I think) the SEGV, to wit;

#0  shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
      {daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000}, 
    object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
    at ../src/os/linux/ShmManager.C:826
826	      *p_free_object = 
(gdb) where
#0  shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
      {daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000}, 
    object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
    at ../src/os/linux/ShmManager.C:826
#1  0x2000000000e11340 in shm_processObjectAllocV (shm_key=
      {daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000}, 
    object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
    at ../src/os/linux/ShmManagerAPI_app.C:50
#2  0x2000000000e11cb0 in Ais_send (msg_handle_id=0x2000000000f225a4 "", 
    message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:334
#3  0x2000000000e11950 in Ais_send_int (msg_handle_id=0x2000000000f225a4 "", 
    message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:89
#4  0x2000000001f37ae0 in ?? ()
Previous frame identical to this frame (corrupt stack?)
(gdb) p p_free_object
$1 = (freeFObjectH **) 0x2000000000e2a758
(gdb) p *p_free_object
$2 = (freeFObjectH *) 0x0
(gdb) 

    And so it goes. BillH developed, some time ago, a couple unit tests to stress his
  shared memory changes as well as the ia64 locking mechanism. However, his stress
  test for the shared memory changes was single-threaded (the locking unit test was
  however, clearly, multi-threaded). So BillH is going to enhance his unit test for
  shared memory to become multi-threaded. We'll see what results.

    Here is the BEST QUESS that Bill and myself have at this time:

   It is a DPCL locking problem (original problem - not the new ia64 locking code).
  Bill's fixes are ok and using them causing the message queuing to become stressed,
  thus exposing the lack of a page lock somewhere, somehow. But this is a best random
  speculation at this point. Something I'm (SteveC) perfectly capable of doing, heh-heh.

                                              Thanks, Dave & BillH 

                                              SteveC