dpcl-develop Mailing List for Dynamic Probe Class Library (Page 3)
Brought to you by:
dpcl-admin,
dwootton
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
(9) |
Aug
(1) |
Sep
(3) |
Oct
(5) |
Nov
(5) |
Dec
(1) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
|
Feb
|
Mar
(2) |
Apr
(1) |
May
(2) |
Jun
|
Jul
|
Aug
(4) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2003 |
Jan
|
Feb
|
Mar
|
Apr
(3) |
May
(6) |
Jun
(4) |
Jul
(18) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(26) |
Dec
(31) |
| 2004 |
Jan
(14) |
Feb
(5) |
Mar
(6) |
Apr
(1) |
May
(4) |
Jun
(8) |
Jul
(2) |
Aug
|
Sep
(4) |
Oct
(3) |
Nov
(7) |
Dec
|
| 2005 |
Jan
(8) |
Feb
(8) |
Mar
(1) |
Apr
(6) |
May
(2) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2006 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2015 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Dave W. <dwo...@us...> - 2004-06-08 16:21:08
|
Steve
Do these new locking functions help with your problem? Do the symptoms
change when using them? If so, how?
I have had limited time to look at your info from last week, but may have
some time tomorrow if the problem persists.
Dave
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
06/08/2004 09:11 AM
To
dpc...@ww...
cc
sl...@sg...
Subject
[Dpcl-develop] Re: Using 'diag' test -SEGV problem
The ia64 locking primitives 'safe_fetch' and 'check_lock', it turns out,
also require 'memory fence' protection similar to that already in the
'check_lock' routine. It was thought some months ago that simple loads
and
stores would work for 'safe_fetch' and 'check_lock'. Not so. Sigh. So
courtesy of Bill Hachfeld (SGI), here are more accurate versions of
'safe_fetch' and 'clear_lock'. Bill also has developed a macro version of
the previous iA-64 assembler version of 'check_lock' which requires less
maintenance but provides no more functionality than the previous ia64
assembler version.
SteveC - SGI
Compilers/Tools
New 'locking primitives' for shared memory:
#include <asm/system.h>
/** Pointer to an atomically-accessed integer. */
typedef int* atomic_p;
/**
* Conditionally updates a single word variable atomically.
*
* The _check_lock subroutine performs an atomic (uninterruptible)
sequence of
* operations. The compare_and_swap subroutine is similar, but does not
issue
* synchronization instructions and therefore is inappropriate for
updating
* lock words.
*
* @note The word variable must be aligned on a full word boundary.
*
* @sa http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/libs/
* basetrf1/_check_lock.htm
*
* @param word_addr Specifies the address of the single word variable.
* @param old_val Specifies the old value to be checked against the
value
* of the single word variable.
* @param new_val Specifies the new value to be conditionally
assigned to
* the single word variable.
* @return "FALSE" indicates that the single word variable was
equal
* to the old value and has been set to the new value.
* "TRUE" indicates that the single word variable was
not
* equal to the old value and has been left unchanged.
*/
int _check_lock(atomic_p word_addr, int old_val, int new_val)
{
volatile int* addr = (volatile int*)word_addr;
int prev_val = cmpxchg_acq(addr, old_val, new_val);
return prev_val != old_val;
}
/**
* Reads the value of a single word variable protected by a lock.
*
* The _safe_fetch subroutine safely reads and returns a single word value
that
* is protected by a lock. This subroutine is used to read protected data
before
* releasing the lock word with the _clear_lock subroutine. If _safe_fetch
is
* not used, instructions that access data just before a lock release
could
* actually before performed after the lock release.
*
* @note The word variable must be aligned on a full word boundary.
*
* @sa http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/libs/
* basetrf2/_safe_fetch.htm
*
* @param word_addr Specifies the address of the single word variable.
* @return This subroutine returns the value of the single
word
* variable.
*/
int _safe_fetch(atomic_p word_addr)
{
volatile int* addr = (volatile int*)word_addr;
return *addr;
}
/**
* Stores a value in a single word variable atomically.
*
* The _clear_lock subroutine performs an atomic (uninterruptible)
sequence of
* operations.
*
* @note The word variable must be aligned on a full word boundary.
*
* @sa http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/libs/
* basetrf1/_clear_lock.htm
*
* @param word_addr Specifies the address of the single word variable.
* @param val Specifies the value to store in the single word
variable.
*/
void _clear_lock(atomic_p word_addr, int val)
{
volatile int* addr = (volatile int*)word_addr;
*addr = val;
}
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Steve C. <sl...@sg...> - 2004-06-08 16:11:48
|
The ia64 locking primitives 'safe_fetch' and 'check_lock', it turns out,
also require 'memory fence' protection similar to that already in the
'check_lock' routine. It was thought some months ago that simple loads and
stores would work for 'safe_fetch' and 'check_lock'. Not so. Sigh. So
courtesy of Bill Hachfeld (SGI), here are more accurate versions of
'safe_fetch' and 'clear_lock'. Bill also has developed a macro version of
the previous iA-64 assembler version of 'check_lock' which requires less
maintenance but provides no more functionality than the previous ia64
assembler version.
SteveC - SGI Compilers/Tools
New 'locking primitives' for shared memory:
#include <asm/system.h>
/** Pointer to an atomically-accessed integer. */
typedef int* atomic_p;
/**
* Conditionally updates a single word variable atomically.
*
* The _check_lock subroutine performs an atomic (uninterruptible) sequence of
* operations. The compare_and_swap subroutine is similar, but does not issue
* synchronization instructions and therefore is inappropriate for updating
* lock words.
*
* @note The word variable must be aligned on a full word boundary.
*
* @sa http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/libs/
* basetrf1/_check_lock.htm
*
* @param word_addr Specifies the address of the single word variable.
* @param old_val Specifies the old value to be checked against the value
* of the single word variable.
* @param new_val Specifies the new value to be conditionally assigned to
* the single word variable.
* @return "FALSE" indicates that the single word variable was equal
* to the old value and has been set to the new value.
* "TRUE" indicates that the single word variable was not
* equal to the old value and has been left unchanged.
*/
int _check_lock(atomic_p word_addr, int old_val, int new_val)
{
volatile int* addr = (volatile int*)word_addr;
int prev_val = cmpxchg_acq(addr, old_val, new_val);
return prev_val != old_val;
}
/**
* Reads the value of a single word variable protected by a lock.
*
* The _safe_fetch subroutine safely reads and returns a single word value that
* is protected by a lock. This subroutine is used to read protected data before
* releasing the lock word with the _clear_lock subroutine. If _safe_fetch is
* not used, instructions that access data just before a lock release could
* actually before performed after the lock release.
*
* @note The word variable must be aligned on a full word boundary.
*
* @sa http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/libs/
* basetrf2/_safe_fetch.htm
*
* @param word_addr Specifies the address of the single word variable.
* @return This subroutine returns the value of the single word
* variable.
*/
int _safe_fetch(atomic_p word_addr)
{
volatile int* addr = (volatile int*)word_addr;
return *addr;
}
/**
* Stores a value in a single word variable atomically.
*
* The _clear_lock subroutine performs an atomic (uninterruptible) sequence of
* operations.
*
* @note The word variable must be aligned on a full word boundary.
*
* @sa http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/libs/
* basetrf1/_clear_lock.htm
*
* @param word_addr Specifies the address of the single word variable.
* @param val Specifies the value to store in the single word variable.
*/
void _clear_lock(atomic_p word_addr, int val)
{
volatile int* addr = (volatile int*)word_addr;
*addr = val;
}
|
|
From: Steve C. <sl...@sg...> - 2004-06-02 20:46:53
|
Some additional debug info provided by gdb (do I dare trust it??) at the
site of the SEGV per the corefile:
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManager.C:826
826 *p_free_object =
(gdb) where
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManager.C:826
#1 0x2000000000e11340 in shm_processObjectAllocV (shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManagerAPI_app.C:50
#2 0x2000000000e11cb0 in Ais_send (msg_handle_id=0x2000000000f225a4 "",
message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:334
#3 0x2000000000e11950 in Ais_send_int (msg_handle_id=0x2000000000f225a4 "",
message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:89
#4 0x2000000001f37ae0 in ?? ()
Previous frame identical to this frame (corrupt stack?)
(gdb) p p_free_object
$1 = (freeFObjectH **) 0x2000000000e2a758
(gdb) p *p_free_object
$2 = (freeFObjectH *) 0x0
(gdb) p object_holder
$3 = (void **) 0x2000000000e2a750
(gdb) p *object_holder
$4 = (void *) 0x8956726d00f01000
(gdb) p object_counter
$5 = 0
(gdb) p object_number
$6 = 1
(gdb) p memory_buffer
$7 = (memBuffer *) 0x2000000000f01000
(gdb) p page
No symbol "page" in current context.
(gdb) p page->free_space
No symbol "page" in current context.
|
|
From: Steve C. <sl...@sg...> - 2004-06-02 20:36:38
|
DaveW suggested looking at the shared memory changes developed at SGI. Yes,
we are running with these changes. DaveW also suggested looking at the ia64
specific locking code developed at SGI. This code has been stress tested quite
at bit. However, my (SteveC's) option of the locking code and shared memory
changes my bit just a <tad> bit biased. So I am going back to first look at the
shared memory changes since the SEGV is occurring right in this area. Per BillH
here at SGI (who developed the shared memory changes), the change from using
'0' .vs. page->object_size is an <immutable> bug, i.e. he's pretty confident about
it. The 'loop'
change in routine shmFObjectAllocV is encountered less frequently per BillH, but
he is pretty confident about it as well. So, I started playing with things. I
to the original 'ShmManager.C' and the only change I made was to change '0' to
'page->object_size' in both shmObjectFreeV and shmFObjectAlloc. Without this
change, thousands of messages are simply thrown away (the original bug) and the
SEGV does not occur because there is no stress on the message queueing. But with
just this change in two places ( 0 -> page->object_size ), I was able to get the
SEGV. This implies at the very least that Bill's additional 'loop' change in
shmFObjectAllocV is not culpable. Again, the '9' to page0>object_size is an
obvious bug and without it there is no message stressing because messages are
being discarded by the billions (per BillH).
DaveW asked about various values printed by gdb at the sight of the SEGV.
THe problem is that p_free_object is nonNULL but a dereference of it IS NULL
and thus ( I think) the SEGV, to wit;
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManager.C:826
826 *p_free_object =
(gdb) where
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManager.C:826
#1 0x2000000000e11340 in shm_processObjectAllocV (shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e2a750, rc=0x60000fffffff8180)
at ../src/os/linux/ShmManagerAPI_app.C:50
#2 0x2000000000e11cb0 in Ais_send (msg_handle_id=0x2000000000f225a4 "",
message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:334
#3 0x2000000000e11950 in Ais_send_int (msg_handle_id=0x2000000000f225a4 "",
message=0x2000000001f37a20, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:89
#4 0x2000000001f37ae0 in ?? ()
Previous frame identical to this frame (corrupt stack?)
(gdb) p p_free_object
$1 = (freeFObjectH **) 0x2000000000e2a758
(gdb) p *p_free_object
$2 = (freeFObjectH *) 0x0
(gdb)
And so it goes. BillH developed, some time ago, a couple unit tests to stress his
shared memory changes as well as the ia64 locking mechanism. However, his stress
test for the shared memory changes was single-threaded (the locking unit test was
however, clearly, multi-threaded). So BillH is going to enhance his unit test for
shared memory to become multi-threaded. We'll see what results.
Here is the BEST QUESS that Bill and myself have at this time:
It is a DPCL locking problem (original problem - not the new ia64 locking code).
Bill's fixes are ok and using them causing the message queuing to become stressed,
thus exposing the lack of a page lock somewhere, somehow. But this is a best random
speculation at this point. Something I'm (SteveC) perfectly capable of doing, heh-heh.
Thanks, Dave & BillH
SteveC
|
|
From: Dave W. <dwo...@us...> - 2004-05-28 23:27:23
|
Steve
If you could send me the code for the benchmark, that would save me the
time trying to track down a copy. Also, if there's anything special about
building or running it I need to know that.
Also, what happens with target (such as the hello sample) that has delays
between loops? You may need to let it run for a long time.
Replying to the mailing list and optionally copying me is the best
approach for questions. It's probably best to just send the benchmark code
directly to me.
Dave
Steve Collins <sl...@sg...>
05/28/2004 04:17 PM
To
Dave Wootton/Poughkeepsie/IBM@IBMUS
cc
dpc...@ww...
Subject
Re: [Dpcl-develop] Using 'diag' test
Dave -
Thanks for the reply. I feel better already!! I'll answer
your questions below and copy the newsgroup. Is that ok? If
not, I'll go back thru the newsgroup next time.
Thanks again - SteveC
>
> Steve
> This looks like the target program is what is crashing and you are
ending
> up with a core file for that process. Is that correct?
Yes. The 'mutatee' in Dyninst parlance is getting the SEGV. And again,
the walkback changes ever so slightly every coupld of attempts at
running
./eut_diag, and sometimes the 'mutatee' does not SEGV at all. Very
flaky.
> If not, can you
> clarify what is crashing and how you are getting the traceback?
I am in the directory where I have run the mutatee, and I do:
gdb ../build/eon core.nnnnn
where 'eon' is the SPEC eon benchmark. It is run out of ../run/run.csh
and built out of ../build/build.csh. There is nothing company private
about 'eon' AFAIK. I <think> I could just ship it to you. Can you get
a copy out of IBM's benchmark suite?
> Where are
> you placing print statements in eut_diag.C?
Actually, I was putting 'fd writes' into the shared memory routines in
~dpcl/src/daemon_RT/src/os/linux/ShmManager.C, etc.. These slow things
down
just enough to prevent the SEGV.
> The callback functions
> dcall_back and dcall_back2 or somewhere else?
From what I can tell the 'SOT_data' code is not executed because there
is
no DATA (I guess) in DEFAULT_MODULE. Just functions are processed via
./eut_diag.
>
> Assuming this is the target, nothing that eut_diag.C should be causing
> timing problems with the possible exception of the dcall_back and
> dcall_back2 functions, and even then I have a hard time seeing a problem
> with the printf slowing down things enough to influence target
execution.
> The path from the target program thru the daemon back to the client is
> pretty long anyway, and buffering of messages which happen to arrive so
> fast they cause Ais_send to slow down should be buffered in the daemon
> anyway.
>
Yeah, that was my thinking too. But I have been able to just #if(0) the
body of 'func_cb' in eut_diag.C and get the thing to run sometimes. Or
adjust the INTERVAL to 20 or so and things go.
I know it dies upon doing the 'resume_cb' every time. I suppose it
could
be a case of bad instrumentation and the 'libunwind' on ia64 is somewhat
marginal (thus, bad walkbacks?).
> Basically, what eut_diag does is loads or attaches to the application,
> suspends it, finds all the function entry and exit points, instruments
> them, then resumes target execution.
Oh yeah. I've been thru eut_diag.C and know it almost like a friend. :^)
>
> The traceback you included does appear to be from within a call to
> Ais_send within the target after the target has been resumed. What is
the
> target program logic? Are there functions being called in tight loops,
> such that the entry and exit probes are being called rapidly,
Oh yeah the "USER_CB" messages just fly by for a second or two before
the
SEGV. And since the target is the SPEC benchmark, I suspect there are
tight
loops everywhere - such is the nature of benchmark codes.
> or are there
> things like sleep() calls slowing down execution. Does that make a
> difference?
I've played with the 'sleep' in eut_diag.C and even the 'alarm', and
sometimes it makes a difference, sometimes it doesn't. Sorry, I'm not
trying to be evasive!!
Are you loading or connecting to the application?
Connecting: ./eut_diag d pid 'nnnnnnn'
> Is the
> target application single threaded?
>
Yes, single threaded.
> Those questions will help get us started. I will find some time next
week
> to look at this further.
That would be terrific, Dave. I'll try my best to support you with
information.
I have spend a lot of cpu/human time trying to finger this one.
Thanks Again - SteveC
>
>
> Hello, DPCL'ers. I am using DPCL/Dyninst on ia64 and, for some
> time now, have been trying to get the 'diag' (eut_diag) sample DPCL
> test running. It seems to (non-deterministically) be dying in the
> daemon_RT shared memory code, to wit:
>
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000,
process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> 687 *p_free_object =
> (gdb) where
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000,
process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> #1 0x2000000000e11090 in shm_processObjectAllocV
(shm_key=
> {daemon_address = 0x2000000000e15000,
process_address
> = 0x2000000000f15000},
> object_number=0, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManagerAPI_app.C:48
> #2 0x2000000000e119d0 in Ais_send
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:319
> #3 0x2000000000e11670 in Ais_send_int
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:85
> #4 0x200000000083e3e0 in DYNINSTstaticHeap_4M_anyHeap_1
()
> from
/scratch/slc/dpcl-install/lib/libdyninstAPI_RT.so.1
>
> This walkback from gdb is, itself, suspicious, but about all I have
> available.
> Again, the failure doesn't happen every time. If I tweak 'diag'
> (eut_diag.C)
> in a small, seemingly innocuous, way, then things work. If I put any
> sort of
> debug prints in place, then the SEGV doesn't appear. It really seems
> like a
> timing thing, but I can't finger the culprit.
>
> This is admittedly a fishing expedition. Does 'diag' have some sort
of
> known
> race condition, or flaky behaviour? Just fishing....
>
> SteveC - SGI
>
>
> Steve Collins <sl...@sg...>
> Sent by: dpc...@ww...
> 05/28/2004 02:08 PM
>
> To
> dpc...@ww...
> cc
> sl...@sg...
> Subject
> [Dpcl-develop] Using 'diag' test
>
>
>
>
>
>
>
> Hello, DPCL'ers. I am using DPCL/Dyninst on ia64 and, for some
> time now, have been trying to get the 'diag' (eut_diag) sample DPCL
> test running. It seems to (non-deterministically) be dying in the
> daemon_RT shared memory code, to wit:
>
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000,
process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> 687 *p_free_object =
> (gdb) where
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000,
process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> #1 0x2000000000e11090 in shm_processObjectAllocV
(shm_key=
> {daemon_address = 0x2000000000e15000,
process_address
> = 0x2000000000f15000},
> object_number=0, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManagerAPI_app.C:48
> #2 0x2000000000e119d0 in Ais_send
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:319
> #3 0x2000000000e11670 in Ais_send_int
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:85
> #4 0x200000000083e3e0 in DYNINSTstaticHeap_4M_anyHeap_1
()
> from
/scratch/slc/dpcl-install/lib/libdyninstAPI_RT.so.1
>
> This walkback from gdb is, itself, suspicious, but about all I have
> available.
> Again, the failure doesn't happen every time. If I tweak 'diag'
> (eut_diag.C)
> in a small, seemingly innocuous, way, then things work. If I put any
> sort of
> debug prints in place, then the SEGV doesn't appear. It really seems
> like a
> timing thing, but I can't finger the culprit.
>
> This is admittedly a fishing expedition. Does 'diag' have some sort
of
> known
> race condition, or flaky behaviour? Just fishing....
>
> SteveC - SGI
> Tools/Compilers
> _______________________________________________
> Dpcl-develop mailing list
> Dpc...@ww...
> http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
>
>
|
|
From: Steve C. <sl...@sg...> - 2004-05-28 23:18:41
|
Dave -
Thanks for the reply. I feel better already!! I'll answer
your questions below and copy the newsgroup. Is that ok? If
not, I'll go back thru the newsgroup next time.
Thanks again - SteveC
>
> Steve
> This looks like the target program is what is crashing and you are ending
> up with a core file for that process. Is that correct?
Yes. The 'mutatee' in Dyninst parlance is getting the SEGV. And again,
the walkback changes ever so slightly every coupld of attempts at running
./eut_diag, and sometimes the 'mutatee' does not SEGV at all. Very flaky.
> If not, can you
> clarify what is crashing and how you are getting the traceback?
I am in the directory where I have run the mutatee, and I do:
gdb ../build/eon core.nnnnn
where 'eon' is the SPEC eon benchmark. It is run out of ../run/run.csh
and built out of ../build/build.csh. There is nothing company private
about 'eon' AFAIK. I <think> I could just ship it to you. Can you get
a copy out of IBM's benchmark suite?
> Where are
> you placing print statements in eut_diag.C?
Actually, I was putting 'fd writes' into the shared memory routines in
~dpcl/src/daemon_RT/src/os/linux/ShmManager.C, etc.. These slow things down
just enough to prevent the SEGV.
> The callback functions
> dcall_back and dcall_back2 or somewhere else?
From what I can tell the 'SOT_data' code is not executed because there is
no DATA (I guess) in DEFAULT_MODULE. Just functions are processed via ./eut_diag.
>
> Assuming this is the target, nothing that eut_diag.C should be causing
> timing problems with the possible exception of the dcall_back and
> dcall_back2 functions, and even then I have a hard time seeing a problem
> with the printf slowing down things enough to influence target execution.
> The path from the target program thru the daemon back to the client is
> pretty long anyway, and buffering of messages which happen to arrive so
> fast they cause Ais_send to slow down should be buffered in the daemon
> anyway.
>
Yeah, that was my thinking too. But I have been able to just #if(0) the
body of 'func_cb' in eut_diag.C and get the thing to run sometimes. Or
adjust the INTERVAL to 20 or so and things go.
I know it dies upon doing the 'resume_cb' every time. I suppose it could
be a case of bad instrumentation and the 'libunwind' on ia64 is somewhat
marginal (thus, bad walkbacks?).
> Basically, what eut_diag does is loads or attaches to the application,
> suspends it, finds all the function entry and exit points, instruments
> them, then resumes target execution.
Oh yeah. I've been thru eut_diag.C and know it almost like a friend. :^)
>
> The traceback you included does appear to be from within a call to
> Ais_send within the target after the target has been resumed. What is the
> target program logic? Are there functions being called in tight loops,
> such that the entry and exit probes are being called rapidly,
Oh yeah the "USER_CB" messages just fly by for a second or two before the
SEGV. And since the target is the SPEC benchmark, I suspect there are tight
loops everywhere - such is the nature of benchmark codes.
> or are there
> things like sleep() calls slowing down execution. Does that make a
> difference?
I've played with the 'sleep' in eut_diag.C and even the 'alarm', and
sometimes it makes a difference, sometimes it doesn't. Sorry, I'm not
trying to be evasive!!
Are you loading or connecting to the application?
Connecting: ./eut_diag d pid 'nnnnnnn'
> Is the
> target application single threaded?
>
Yes, single threaded.
> Those questions will help get us started. I will find some time next week
> to look at this further.
That would be terrific, Dave. I'll try my best to support you with information.
I have spend a lot of cpu/human time trying to finger this one.
Thanks Again - SteveC
>
>
> Hello, DPCL'ers. I am using DPCL/Dyninst on ia64 and, for some
> time now, have been trying to get the 'diag' (eut_diag) sample DPCL
> test running. It seems to (non-deterministically) be dying in the
> daemon_RT shared memory code, to wit:
>
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000, process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> 687 *p_free_object =
> (gdb) where
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000, process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> #1 0x2000000000e11090 in shm_processObjectAllocV (shm_key=
> {daemon_address = 0x2000000000e15000, process_address
> = 0x2000000000f15000},
> object_number=0, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManagerAPI_app.C:48
> #2 0x2000000000e119d0 in Ais_send
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:319
> #3 0x2000000000e11670 in Ais_send_int
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:85
> #4 0x200000000083e3e0 in DYNINSTstaticHeap_4M_anyHeap_1 ()
> from /scratch/slc/dpcl-install/lib/libdyninstAPI_RT.so.1
>
> This walkback from gdb is, itself, suspicious, but about all I have
> available.
> Again, the failure doesn't happen every time. If I tweak 'diag'
> (eut_diag.C)
> in a small, seemingly innocuous, way, then things work. If I put any
> sort of
> debug prints in place, then the SEGV doesn't appear. It really seems
> like a
> timing thing, but I can't finger the culprit.
>
> This is admittedly a fishing expedition. Does 'diag' have some sort of
> known
> race condition, or flaky behaviour? Just fishing....
>
> SteveC - SGI
>
>
> Steve Collins <sl...@sg...>
> Sent by: dpc...@ww...
> 05/28/2004 02:08 PM
>
> To
> dpc...@ww...
> cc
> sl...@sg...
> Subject
> [Dpcl-develop] Using 'diag' test
>
>
>
>
>
>
>
> Hello, DPCL'ers. I am using DPCL/Dyninst on ia64 and, for some
> time now, have been trying to get the 'diag' (eut_diag) sample DPCL
> test running. It seems to (non-deterministically) be dying in the
> daemon_RT shared memory code, to wit:
>
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000, process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> 687 *p_free_object =
> (gdb) where
> #0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
> {daemon_address = 0x2000000000e15000, process_address
> = 0x2000000000f15000},
> object_number=1, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManager.C:687
> #1 0x2000000000e11090 in shm_processObjectAllocV (shm_key=
> {daemon_address = 0x2000000000e15000, process_address
> = 0x2000000000f15000},
> object_number=0, object_holder=0x2000000000e263c8,
> rc=0x60000fffffff8100)
> at ../src/os/linux/ShmManagerAPI_app.C:48
> #2 0x2000000000e119d0 in Ais_send
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:319
> #3 0x2000000000e11670 in Ais_send_int
> (msg_handle_id=0x2000000000f225a4 "",
> message=0x200000000083e320, message_size=30) at
> ../src/os/linux/ShmMessageAPI_app.C:85
> #4 0x200000000083e3e0 in DYNINSTstaticHeap_4M_anyHeap_1 ()
> from /scratch/slc/dpcl-install/lib/libdyninstAPI_RT.so.1
>
> This walkback from gdb is, itself, suspicious, but about all I have
> available.
> Again, the failure doesn't happen every time. If I tweak 'diag'
> (eut_diag.C)
> in a small, seemingly innocuous, way, then things work. If I put any
> sort of
> debug prints in place, then the SEGV doesn't appear. It really seems
> like a
> timing thing, but I can't finger the culprit.
>
> This is admittedly a fishing expedition. Does 'diag' have some sort of
> known
> race condition, or flaky behaviour? Just fishing....
>
> SteveC - SGI
> Tools/Compilers
> _______________________________________________
> Dpcl-develop mailing list
> Dpc...@ww...
> http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
>
>
|
|
From: Dave W. <dwo...@us...> - 2004-05-28 22:31:41
|
Steve
This looks like the target program is what is crashing and you are ending
up with a core file for that process. Is that correct? If not, can you
clarify what is crashing and how you are getting the traceback? Where are
you placing print statements in eut_diag.C? The callback functions
dcall_back and dcall_back2 or somewhere else?
Assuming this is the target, nothing that eut_diag.C should be causing
timing problems with the possible exception of the dcall_back and
dcall_back2 functions, and even then I have a hard time seeing a problem
with the printf slowing down things enough to influence target execution.
The path from the target program thru the daemon back to the client is
pretty long anyway, and buffering of messages which happen to arrive so
fast they cause Ais_send to slow down should be buffered in the daemon
anyway.
Basically, what eut_diag does is loads or attaches to the application,
suspends it, finds all the function entry and exit points, instruments
them, then resumes target execution.
The traceback you included does appear to be from within a call to
Ais_send within the target after the target has been resumed. What is the
target program logic? Are there functions being called in tight loops,
such that the entry and exit probes are being called rapidly, or are there
things like sleep() calls slowing down execution. Does that make a
difference? Are you loading or connecting to the application? Is the
target application single threaded?
Those questions will help get us started. I will find some time next week
to look at this further.
Dave
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
05/28/2004 02:08 PM
To
dpc...@ww...
cc
sl...@sg...
Subject
[Dpcl-develop] Using 'diag' test
Hello, DPCL'ers. I am using DPCL/Dyninst on ia64 and, for some
time now, have been trying to get the 'diag' (eut_diag) sample DPCL
test running. It seems to (non-deterministically) be dying in the
daemon_RT shared memory code, to wit:
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address
= 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e263c8,
rc=0x60000fffffff8100)
at ../src/os/linux/ShmManager.C:687
687 *p_free_object =
(gdb) where
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address
= 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e263c8,
rc=0x60000fffffff8100)
at ../src/os/linux/ShmManager.C:687
#1 0x2000000000e11090 in shm_processObjectAllocV (shm_key=
{daemon_address = 0x2000000000e15000, process_address
= 0x2000000000f15000},
object_number=0, object_holder=0x2000000000e263c8,
rc=0x60000fffffff8100)
at ../src/os/linux/ShmManagerAPI_app.C:48
#2 0x2000000000e119d0 in Ais_send
(msg_handle_id=0x2000000000f225a4 "",
message=0x200000000083e320, message_size=30) at
../src/os/linux/ShmMessageAPI_app.C:319
#3 0x2000000000e11670 in Ais_send_int
(msg_handle_id=0x2000000000f225a4 "",
message=0x200000000083e320, message_size=30) at
../src/os/linux/ShmMessageAPI_app.C:85
#4 0x200000000083e3e0 in DYNINSTstaticHeap_4M_anyHeap_1 ()
from /scratch/slc/dpcl-install/lib/libdyninstAPI_RT.so.1
This walkback from gdb is, itself, suspicious, but about all I have
available.
Again, the failure doesn't happen every time. If I tweak 'diag'
(eut_diag.C)
in a small, seemingly innocuous, way, then things work. If I put any
sort of
debug prints in place, then the SEGV doesn't appear. It really seems
like a
timing thing, but I can't finger the culprit.
This is admittedly a fishing expedition. Does 'diag' have some sort of
known
race condition, or flaky behaviour? Just fishing....
SteveC - SGI
Tools/Compilers
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Steve C. <sl...@sg...> - 2004-05-28 21:08:55
|
Hello, DPCL'ers. I am using DPCL/Dyninst on ia64 and, for some
time now, have been trying to get the 'diag' (eut_diag) sample DPCL
test running. It seems to (non-deterministically) be dying in the
daemon_RT shared memory code, to wit:
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e263c8, rc=0x60000fffffff8100)
at ../src/os/linux/ShmManager.C:687
687 *p_free_object =
(gdb) where
#0 shmFObjectAllocV (buffer=0x2000000000f15000, shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=1, object_holder=0x2000000000e263c8, rc=0x60000fffffff8100)
at ../src/os/linux/ShmManager.C:687
#1 0x2000000000e11090 in shm_processObjectAllocV (shm_key=
{daemon_address = 0x2000000000e15000, process_address = 0x2000000000f15000},
object_number=0, object_holder=0x2000000000e263c8, rc=0x60000fffffff8100)
at ../src/os/linux/ShmManagerAPI_app.C:48
#2 0x2000000000e119d0 in Ais_send (msg_handle_id=0x2000000000f225a4 "",
message=0x200000000083e320, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:319
#3 0x2000000000e11670 in Ais_send_int (msg_handle_id=0x2000000000f225a4 "",
message=0x200000000083e320, message_size=30) at ../src/os/linux/ShmMessageAPI_app.C:85
#4 0x200000000083e3e0 in DYNINSTstaticHeap_4M_anyHeap_1 ()
from /scratch/slc/dpcl-install/lib/libdyninstAPI_RT.so.1
This walkback from gdb is, itself, suspicious, but about all I have available.
Again, the failure doesn't happen every time. If I tweak 'diag' (eut_diag.C)
in a small, seemingly innocuous, way, then things work. If I put any sort of
debug prints in place, then the SEGV doesn't appear. It really seems like a
timing thing, but I can't finger the culprit.
This is admittedly a fishing expedition. Does 'diag' have some sort of known
race condition, or flaky behaviour? Just fishing....
SteveC - SGI Tools/Compilers
|
|
From: Dave W. <dwo...@us...> - 2004-04-02 00:18:08
|
DPCL version 3.3.4 is the current production release of DPCL for AIX and Linux as of 4/1/2004. This release is available for download from the DPCL open source website at http://oss.software.ibm.com/dpcl This version contains the following fixes and enhancements 1) Implement inclusive instrumentation points (sorted by instruction address) 2) Control type checking in probe expressions 3) Unwanted files appearing when calling expand() function 4) Function call sites missing from source block objects 5) Add message size information to debug log 6) Fix runaway daemons in Load Leveler environment 7) Fix problem where closing stdin causes application disconnect 8) Fix problem with applications linked with -bmaxdata option 9) Block object missing when target application function has no local variables Dave |
|
From: Steve C. <sl...@sg...> - 2004-03-15 22:43:32
|
It occurred to me that I had sent my 64-bit changes to JamesW
but had not posted them here. Just so DaveW is in the loop, I'll
reiterate them here.
*************
DPCL MODS:
*************
NOTE: diffs are generated from the hybrid_121503 delivered by
by JamesW.
1.
~dpcl/src/config.sub:
343a344,346
> ia64-*)
> basic_machine=ia64-pc
> ;;
2.
~dpcl/src/lib/include/ExpTree.h:
209a210
> # ifndef __int8_t_defined
210a212
> #endif
212a215,217
> # if __WORDSIZE == 64
> typedef long int64_t;
> #else
214c219
<
---
> #endif
218a224,226
> # if __WORDSIZE == 64
> typedef unsigned long uint64_t;
> #else
219a228
> #endif
3.
~dpcl/src/lib/include/ModuleObj.h:
71c71
< static void ModuleObj::expand_cb(GCBSysType, GCBTagType, GCBObjType,
---
> static void expand_cb(GCBSysType, GCBTagType, GCBObjType,
4.
~dpcl/src/daemon_RT/include/os/linux/ShmUsage_RT.h:
40a41,44
> #if __WORDSIZE == 64
> typedef unsigned long uint64_t;
> typedef signed long int64_t;
> #else
42a47
> #endif
5.
~dpcl/src/daemon/src/main.C:
880c880
< if ((rc = getrlimit (which_limit, rlimit_p)) != -1) {
---
> if ((rc = getrlimit ((__rlimit_resource)which_limit, rlimit_p)) != -1) {
888c888
< rc = setrlimit (which_limit, rlimit_p);
---
> rc = setrlimit ((__rlimit_resource)which_limit, rlimit_p);
6.
~dpcl/src/daemon/src/os/linux/PModEntry.C:
332c332
< unsigned int adr = point->getPointAddress();
---
> unsigned int adr = (unsigned int)point->getPointAddress();
7.
~dpcl/src/daemon_RT/src/os/linux/lock.c:
96a97,98
> #else
> #include <stdio.h>
97a100,155
> typedef int * atomic_p;
>
> int
> _check_lock (atomic_p addr, int old_val, int new_val);
>
> int
> _safe_fetch (atomic_p addr);
>
> void
> _clear_lock (atomic_p addr, int val);
>
> /******************************************************************/
>
> #define __atomic_fool_gcc(x) (*(volatile struct { int a[100]; } *)x)
>
> int
> _check_lock
> (atomic_p addr, int old_val, int new_val)
> {
> long interim_val = 777;
> int result = 0; /* default is success */
> __asm__ __volatile__ ( "mf");
> __asm__ __volatile__ (
> "mov ar.ccv = %[old]"
> :
> : [old] "r"(old_val));
> __asm__ __volatile__ (
> "cmpxchg4.rel %[interim] =%[mem],%[new],ar.ccv"
> : [interim] "=r" (interim_val)
> : [mem] "m" (__atomic_fool_gcc(addr)), [new] "r" (new_val));
> __asm__ __volatile__ ( "mf");
> if( interim_val == old_val) {
> /* lock memory value was old_val */
> /* exchange to new_val done - new lock in place */
> result = 0; /* false */
> }
> else {
> /* lock memory value was NOT old_val */
> result = 1; /* true */
> }
> return result;
> }
>
> int _safe_fetch
> (atomic_p addr)
> {
> /* 64-bit */
> return *addr;
> }
>
> void _clear_lock
> (atomic_p addr, int val)
> {
> /* 64-bit */
> *addr = val;
> }
8.
~dpcl/src/SD/src/SdServCli.C:
289c289
< if ( (clifd = accept( listenfd, (struct sockaddr *) &unix_addr, (size_t*)&len )) > ZERO )
---
> if ( (clifd = accept( listenfd, (struct sockaddr *) &unix_addr, (socklen_t *)&len )) > ZERO )
9.
~dpcl/src/lib/src/Process.C:
3425c3425
< unsigned data_key;
---
> unsigned long data_key;
3439,3440c3439,3440
< memcpy( (void*)poffset, &data_key, sizeof (int));
< poffset += sizeof (int);
---
> memcpy( (void*)poffset, &data_key, sizeof (unsigned long));
> poffset += sizeof (unsigned long);
10.
~dpcl/src/daemon/src/PEtoBP.C:
925,926c925,926
< v=new BPatch_constExpr((int) msghdl);
< log_write(LGL_severe, "Doit: msghdl=0x%08x, BP_DEBUG %s",msghdl,
---
> v=new BPatch_constExpr((long) msghdl);
> log_write(LGL_severe, "Doit: msghdl=0x%Lx, BP_DEBUG %s",msghdl,
11.
~dpcl/src/lib/src/os/linux/ProbeModuleInt.C:
82a83,86
> #if __WORDSIZE == 64
> Elf64_Ehdr *ehdr;
> Elf64_Shdr *sectionHdr;
> #else
84a89
> #endif
89a95,97
> #if __WORDSIZE == 64
> Elf64_Sym *symbolData;
> #else
90a99
> #endif
104a114,116
> #if __WORDSIZE == 64
> ehdr = elf64_getehdr(elfHandle);
> #else
105a118
> #endif
115a129,131
> #if __WORDSIZE == 64
> sectionHdr = elf64_getshdr(sectionRef);
> #else
116a133
> #endif
125a143,145
> #if __WORDSIZE == 64
> symbolData = new Elf64_Sym[elfData->d_size / sizeof(Elf32_Sym)];
> #else
126a147
> #endif
131a153,155
> #if __WORDSIZE == 64
> symSize = elfData->d_size / sizeof(Elf64_Sym);
> #else
132a157
> #endif
152a178,181
> #if __WORDSIZE == 64
> if ((ELF64_ST_BIND(symbolData[i].st_info) == STB_GLOBAL) &&
> (ELF64_ST_TYPE(symbolData[i].st_info) == STT_FUNC) &&
> #else
154a184
> #endif
164a195,198
> #if __WORDSIZE == 64
> if ((ELF64_ST_BIND(symbolData[i].st_info) == STB_GLOBAL) &&
> (ELF64_ST_TYPE(symbolData[i].st_info) == STT_FUNC) &&
> #else
166a201
> #endif
***************************************************************************************
***************************************************************************************
Analysis for DPCL changes:
-----------------------------
1. Must change 'config.sub' to allow for configuration
`ia64-pc-linux-gnuoldld': machine `ia64-pc.
2. The file /usr/include/inttypes.h includes <stdint.h>, which has:
# if __WORDSIZE == 64
typedef long int int64_t;
# else
__extension__ typedef long long int int64_t;
Without this change, the following compile-time error occurs:
../include/ExpTree.h:214: conflicting types for `typedef long long int
int64_t'
/usr/include/sys/types.h:193: previous declaration as `typedef long int
int64_t'
3. Using the 3.3.2 g++, the following error occurs without this change :
In file included from BlockObj.C:40:
../include/ModuleObj.h:72: error: extra qualification `ModuleObj::' on member `
expand_cb' ignored
BlockObj.C: In constructor `BlockObj::BlockObj()':
BlockObj.C:55: warning: passing NULL used for non-pointer argument 5 of `
SourceObjABC::SourceObjABC(SourceObjABC*, SourceType, int, int, long long
unsigned int, long long unsigned int)'
BlockObj.C:55: warning: passing NULL used for non-pointer argument 6 of `
SourceObjABC::SourceObjABC(SourceObjABC*, SourceType, int, int, long long
unsigned int, long long unsigned int
4. The file /usr/include/inttypes.h includes <stdint.h>, which has:
# if __WORDSIZE == 64
typedef long int int64_t;
# else
__extension__ typedef long long int int64_t;
Without defining __WORDSIZE=64 in rules.mk.linus, the following
compile-time error occurs:
In file included from ../include/os/linux/ShmMessage.h:40,
from ../src/os/linux/ShmAttach.C:95:
../include/os/linux/ShmUsage_RT.h:41: error: conflicting types for `typedef
long long unsigned int uint64_t'
/usr/include/stdint.h:56: error: previous declaration as `typedef long unsigned
int uint64_t'
../include/os/linux/ShmUsage_RT.h:42: error: conflicting types for `typedef
long long int int64_t'
/usr/include/sys/types.h:193: error: previous declaration as `typedef long int
int64_t'
5. Without this change, the following compile-time error occurs:
from ../include/ProcessD.h:65,
from main.C:114:
../include/SourceObjDABC.h: In member function `virtual AisAddress
SourceObjDABC::start_address() const':
../include/SourceObjDABC.h:242: warning: return to non-pointer type `AisAddress
' from NULL
../include/SourceObjDABC.h:242: warning: argument to non-pointer type `long
long unsigned int' from NULL
../include/SourceObjDABC.h: In member function `virtual AisAddress
SourceObjDABC::end_address() const':
../include/SourceObjDABC.h:252: warning: return to non-pointer type `AisAddress
' from NULL
../include/SourceObjDABC.h:252: warning: argument to non-pointer type `long
long unsigned int' from NULL
main.C: In function `int set_max_limit(int)':
main.C:878: error: invalid conversion from `int' to `__rlimit_resource'
main.C:886: error: invalid conversion from `int' to `__rlimit_resource'
6. Without this change, the following compile-time error occurs:
os/linux/PModEntry.C: In method `PModEntry::~PModEntry ()':
os/linux/PModEntry.C:114: warning: NULL used in arithmetic
os/linux/PModEntry.C: In function `BPatch_function *findHaifaFunction
(BPatch_image *, char *, unsigned int, unsigned int, unsigned int,
unsigned int)':
os/linux/PModEntry.C:332: cannot convert `void *' to `unsigned int' in
initialization
7. The routines '_safe_fetch', '_check_lock', and 'clear_lock' need ia64
versions. The ia64 version of 'check_lock' requires gcc 3.3.2 because it
employs named operands.
8. Without this change, the following compile-time error occurs:
SdServCli.C: In function `int serv_accept (int, int *)':
SdServCli.C:289: cannot convert `size_t *' to `socklen_t *' for
argument `3' to `accept (int, sockaddr *, socklen_t *)'
9. From the daemon log, I got the following with a simple mutator testcase:
ProgramObjD::get_inst_point() unique_str libc.so.6.1 module str .so.6.1
ProgramObjD::get_inst_point(): ERROR, the specified module's unique string
did not match. cannot find a specified BPatch_point
The missing 4 characters (i.e. 'libc.so.6.1' .vs. '.so.6.1' occurs for this
reason: GCBObjType is typedef'd to (void *) which is 8 bytes on ia64!!! THe
daemon routine 'install_probe_cb' uses the sizeof(GCBObjType) to walk up on
the 'unique string' pointer. This means 20 bytes!! Unfortunately the message
is packed in the DPCL library routine 'Process::install_probe' using 'unsigned'
and NOT 'unsigned long' for the type of 'data key'. This yields 16 bits!!
Thus we are 4 bytes off, i.e. 'xxxx.so.6' vs 'libc.so.6'.
10. In ~daemon_RT/src/os/linux/ShmMessageAPI_app.C: Ais_send the 'message_handle_id'
coming in is 32 bits and is supposed to be a char *. Its use will cause SEGV.
11. Changes required for processing 64-bit ELF executables.
|
|
From: Steve C. <sl...@sg...> - 2004-03-05 23:13:49
|
Many thanks again to DaveW for his list of helpful hints
for analyzing the 'limit of 45 callbacks' problem we've been
seeing on our ia64 box. These hints and some sleuthing by
myself and Bill Hachfeld-SGI (mostly Bill!!) combined to
come up with what we think are some <potential> bug fixes
for the current DPCL. These fixes do NOT seem to be 64-bit
specific. Next week I will send along a list of 64-bit
specific <potential> changes we would probably need to run
DPCL on our ia64 box. I'm sure Dave has already found most
of these 64-bit specific problems, but I'll send them along
for his comment, just in case. Again, the 64-bit specific
changes I have accumulated in the past few months will be
coming NEXT week. The following is Bill Hachfeld's analysis
and suggested fixes for the 'shared memory' problems we were
seeing (aka the 'limit of 45 callbacks').
Thanks again to DaveW and JamesW for their continued
support!
SteveC - SGI Compilers/Tools
***************************************************************************
==>
==> Proposed Fixes: diff is from JamesW's hybrid source of 121503.
==>
File: ~dpcl/src/daemon_RT/src/os/linux/ShmManager.C
***************************************************************************
237,238c237,238
< unsigned int * obj_tail =
< (unsigned int *) ((unsigned long) obj_header + true_obj_size -
---
> int * obj_tail =
> (int *) ((unsigned long) obj_header + true_obj_size -
552c552
< page->object_size,
---
> 0,
580,581c580,581
< unsigned int * free_object_tail =
< (unsigned int *) ((unsigned long) free_object + page->object_size +
---
> int * free_object_tail =
> (int *) ((unsigned long) free_object + page->object_size +
583d582
<
643,644c642
<
< freeFObjectH ** p_free_object = (freeFObjectH **) object_holder;
---
>
660c658
< page->object_size,
---
> 0,
685a684
> freeFObjectH ** p_free_object = (freeFObjectH **) object_holder;
698c697
< p_free_object = (freeFObjectH **) object_holder;
---
> freeFObjectH ** p_free_object = (freeFObjectH **) object_holder;
701,702c700,701
< unsigned int * free_object_tail =
< (unsigned int *) ((unsigned long) p_free_object [i] + page->object_size +
---
> int * free_object_tail =
> (int *) ((unsigned long) p_free_object [i] + page->object_size +
705d703
<
758,759c756,757
< unsigned int * free_object_tail =
< (unsigned int *) ((unsigned long) free_object + page->object_size +
---
> int * free_object_tail =
> (int *) ((unsigned long) free_object + page->object_size +
814,815c812,813
< unsigned int * free_object_tail =
< (unsigned int *) ((unsigned long) free_object + page->object_size +
---
> int * free_object_tail =
> (int *) ((unsigned long) free_object + page->object_size +
***************************************************************************
--------------------------------------------------------------
Change #1: Change int* --> unsigned int*
--------------------------------------------------------------
==>
==> Analysis (courtesy of Bill Hachfeld, SGI):
==>
ShmManager.C, Line #237-8
The "mask" field is defined as an "unsigned int" in the structures
freeFObjectH, freeVObjectH, freeVObjectT, and allocVObject. While I
don't believe this change made a material difference, it should be
change for the sake of understandability and consistency.
--------------------------------------------------------------
Change #2: Change 0 --> page->object_size
--------------------------------------------------------------
==>
==> Analysis (courtesy of Bill Hachfeld, SGI):
==>
The first page allocated for message queue headers and buffers is
properly requested as a page containing fixed-sized objects. This is
done in Ais_msgInit() by calling shm_blockAlloc(). Subsequent calls
to shm_processObjectAlloc() and shm_processObjectAllocV(), however,
extend the page list by also calling shm_blockAlloc() with pages
containing variable-sized objects.
At both source locations we are requesting the allocation of a new
page to be added to a list of previously allocated pages holding
fixed-sized objects. By passing "0" as the object size, we ask the
allocator to give us a new page containing variable-sized objects.
Subsequent code in the *ObjectAlloc*() functions treats this page as
if it where another fixed-size object page, looking for 0xDEADBEEF
magic symbols at object header/tailer locations where none exist.
My change simply insures that subsequent pages are allocated as
fixed-size object pages with the object size being the same as the
previous page.
--------------------------------------------------------------
Change #3: Move declaration of loop variable
--------------------------------------------------------------
==>
==> Analysis (courtesy of Bill Hachfeld, SGI):
==>
This bug is even more insidious than the one above. It occurs only
when attempting to allocate variable-length arrays of fixed-sized
objects, where the objects are allocated from more than one page.
In shm_processObjectAllocV() we are allocating multiple fixed-size
objects and placing pointers to them into an array passed in by the
caller. We start by allocating as many as possible from the current
page in the loop at line #684. When we run out of objects within that
page, we break out to the outer loop that begins at line #644. Here
we allocate a new page and begin filling in objects again at line
#684. Unfortunately we reset our array pointer back to the beginning
of the user-provided array. Previously allocated objects are
overwritten and the remaining objects are allocated but never
properly returned to the user.
My change simply moves the declaration of the "current" pointer into
the user-passed array outside the outer loop beginning at line #644.
Thus insuring we fill in the entire array properly.
-------------------------------------------------------------------
William Hachfeld EMail: wd...@sg...
SGI Debugger, Object, and Performance Tools Phone: 651-683-3103
|
|
From: Dave W. <dwo...@us...> - 2004-03-03 17:32:16
|
Steve
I looked at the source file and I don't see anything obviously wrong.
Since you are using printf statements to try and track this down, I'm
guessing this is the target application side which you are looking at. Is
this correct?
If so, then I suspect some sort of sign extension bug somewhere when
casting from int to pointer types, unsigned long or long. It looks like
the person who wrote this code was pretty careful about casting to
unsigned long, so the problem is not obvious. This code does work with 64
bit target applications on AIX, so this is rather puzzling.
I would suggest a couple things
First, recompile the dpcl/src/daemon_RT directory with the -Wall compiler
option set. You can modify the dpcl/src/rules.mk.aix file, adding this
flag to the GLOBAL_CFLAGS and GLOBAL_CXX_FLAGS definitions. Once compiled,
look at the compiler diagnostics to see if there are any hints about
problems with imporoper sign conversions of truncations.
You will need to add more printf statements to the code to try to identify
where things are going wrong. I would start with parameters passed to all
of the functions in this file, paying particular attention to anything
that is a pointer or an integer with a negative calue. If you see pointers
which suddenly have zeroes or 0xffffffff in the upper 4 buyes, then that
is an indicator of a trucation or sign extension problem.
If looking at parameters does not help, then you need to start putting
printf statements in the code at signifucant points to try to track this
down further.
An alternative to printf statements is to attach to the target (assuming
the target is the problem) with a debugger after DPCL has inserted probes
and started the application. Some debuggers allow you to use a 'force
attach' option to force the debugger to steal ptrace control away from the
other program, in this case DPCL, which already has ptrace comtrol. Once
attached, set breakpoints and trace thru execution of the application.
As an aside, it looks like the problem here is that the storage at
*free_object_tail is zero, which is the basis for returning the error
status. Note that you have coded your printf statement with what looks
like intent to print out a 64-bit pointer, by use of the 0x%16x, but that
is not what you are really getting. In order to print a 64 bit pointer you
need to use either 0x%16llx 0r 0x%016llx, where the second zero pads the
data to be displayed. In this case, since what you are dereferencing by
*free_object_tail is an integer, you are actually printing the value of
the integer, although with 16 hex digits instead of 8.
Dave
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
03/02/2004 09:48 AM
To: dpc...@ww...
cc: sl...@sg...
Subject: [Dpcl-develop] Re: some DPCL questions
Once again many thanks to DaveW for his continued support. He
has solved the 'soft .vs. hard' external puzzle and I have a
rather hefty clue as to why the supposed 'limit of 45' problem
exists. See below.
Thanks to all - SteveC
Original 3 DPCL questions:
1. Why does there seem to be a 'limit of 45 callbacks' in my
'sleep mutator' testcase?
Analysis:
The following code in
~dpcl/src/daemon_RT/src/os/linux/ShmManger.C,
routine shmFObjectAllocV, is shutting things down after 45
callbacks:
if ((*free_object_tail != FREE_OBJECT_MAGIC_PATTERN) ||
(p_free_object [i]->mask != FREE_OBJECT_MAGIC_PATTERN)) {
.....
*rc = MEM_BAD_FREE_LIST;
return NULL;
}
I inserted a printf as follows:
if ((*free_object_tail != FREE_OBJECT_MAGIC_PATTERN) ) {
printf("MEM_BAD_FREE_object tail bad 0x%16x 0x%16x\n",
*free_object_tail, FREE_OBJECT_MAGIC_PATTERN);
}
and got the following result:
MEM_BAD_FREE_object tail bad 0x 0 0x deadbeaf
It is not clear if the preceding pointer arithmetic is bad or the
mask
is bad or just what is going wrong.
2. Code in the DPCL Library is causing 'unaligned access' errors.
Analysis:
An example of the code occurs in ~dpcl/src/lib/src/ModuleId.C
in routine 'ModuleId unpack_ModuleId(char **buffer)', to wit:
ModuleId
unpack_ModuleId(char **buffer)
{
char *data = *buffer;
char *uniqstr = data;
data = data + 1 + strlen(uniqstr); // don't forget the NULL
character
int *uint_p = (int *) data;
data = data + sizeof(int);
ModuleId new_mid = ModuleId(uniqstr, *uint_p);
.....
This last statement which derefernces from an 'int' alignment
(*uint_p)
seems to upset the ia64 hardware and I get something like this:
mutator(16567): unaligned access to 0x600000000000aa36,
ip=0x2000000000277bb0
Resolution of this problem appears to require finding all the
dubious code
which might cause such 'unaligned access' errors and rewriting it.
Future
project for now since this is just a performance issue, not a
functionality
issue.
3. Why can Dyninst find 'sleep' (the soft external) but DPCL cannot?
Answer: (DaveW's analysis)
The Hybrid uses a version of Dyninst that can find symbols in
dynamically shared/linked objects. DPCL has no such ability.
Mystery solved. Extending DPCL to handle dynamic/shared objects
has been on the 'list' from the beginning. Future project I
guess.
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Steve C. <sl...@sg...> - 2004-03-02 17:49:05
|
Once again many thanks to DaveW for his continued support. He
has solved the 'soft .vs. hard' external puzzle and I have a
rather hefty clue as to why the supposed 'limit of 45' problem
exists. See below.
Thanks to all - SteveC
Original 3 DPCL questions:
1. Why does there seem to be a 'limit of 45 callbacks' in my
'sleep mutator' testcase?
Analysis:
The following code in ~dpcl/src/daemon_RT/src/os/linux/ShmManger.C,
routine shmFObjectAllocV, is shutting things down after 45 callbacks:
if ((*free_object_tail != FREE_OBJECT_MAGIC_PATTERN) ||
(p_free_object [i]->mask != FREE_OBJECT_MAGIC_PATTERN)) {
.....
*rc = MEM_BAD_FREE_LIST;
return NULL;
}
I inserted a printf as follows:
if ((*free_object_tail != FREE_OBJECT_MAGIC_PATTERN) ) {
printf("MEM_BAD_FREE_object tail bad 0x%16x 0x%16x\n",
*free_object_tail, FREE_OBJECT_MAGIC_PATTERN);
}
and got the following result:
MEM_BAD_FREE_object tail bad 0x 0 0x deadbeaf
It is not clear if the preceding pointer arithmetic is bad or the mask
is bad or just what is going wrong.
2. Code in the DPCL Library is causing 'unaligned access' errors.
Analysis:
An example of the code occurs in ~dpcl/src/lib/src/ModuleId.C
in routine 'ModuleId unpack_ModuleId(char **buffer)', to wit:
ModuleId
unpack_ModuleId(char **buffer)
{
char *data = *buffer;
char *uniqstr = data;
data = data + 1 + strlen(uniqstr); // don't forget the NULL character
int *uint_p = (int *) data;
data = data + sizeof(int);
ModuleId new_mid = ModuleId(uniqstr, *uint_p);
.....
This last statement which derefernces from an 'int' alignment (*uint_p)
seems to upset the ia64 hardware and I get something like this:
mutator(16567): unaligned access to 0x600000000000aa36, ip=0x2000000000277bb0
Resolution of this problem appears to require finding all the dubious code
which might cause such 'unaligned access' errors and rewriting it. Future
project for now since this is just a performance issue, not a functionality
issue.
3. Why can Dyninst find 'sleep' (the soft external) but DPCL cannot?
Answer: (DaveW's analysis)
The Hybrid uses a version of Dyninst that can find symbols in
dynamically shared/linked objects. DPCL has no such ability.
Mystery solved. Extending DPCL to handle dynamic/shared objects
has been on the 'list' from the beginning. Future project I
guess.
|
|
From: Dave W. <dwo...@us...> - 2004-03-02 16:22:12
|
Steve
I looked at your test case and am still not totally sure what you are
trying to do. If I compile it the way you sent it, with the code to find
function 'mysleep' enabled, it works fine, although I did put a call to
sleep() in the mysleep function in the mutatee to slow it down so I could
connect to it.
If I change the mutator to look for __sleep, then it fails, stating it
can't find the symbol. I would expect it to fail, since the symbol __sleep
is defined only in libc, which is normally linked as a shared (dynamic)
library. Our version of BPatch/DPCL does not have the ability to find a
symbol defined anywhere other than in the target application's main
executable. We have no knowledge of any shared libraries that are loaded
into the mutatee process.
I did verify that DPCL can find the symbol __sleep by building a static
copy of libdpclRT.a and linking the mutatee with that library, using the
-static option and invoking the linker thru g++ rather than gcc (since
libdpclRT.a is C++ code). However my test caused a daemon crash, which may
be due to the setup on my system. I haven't looked into why.
As James noted, the hybrid DPCL which uses his Dyninst code is aware of
symbols in shared libraries, so it can find __sleep in the unmodified
testcase.
I'm also not getting any problems with the callback not running after 45
invocations, so we will need a log for that. One thing I did notice is
that if the dpcl daemon was left over from a previous run, then the
testcase would not run. Can you see if this occurs only when there is an
old dpcld daemon still running?
Dave
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
02/27/2004 12:37 PM
To: dpc...@ww...
cc: sl...@sg...
Subject: [Dpcl-develop] RE: Some DPCL questions
I'm really sorry about the e-mail mess. I think all our latest rif's
in the computer center are catching up with us. Oh well.
Good to hear from DaveW because I really didn't want to feel too all
alone with DPCL (heh,heh). I'll give my best responses to Dave's
questions and I guess we'll go from there.
1) DAVEW: There's no reason why callbacks should stop after 45 executions
of the probe expression. What happens here?
SLC: Everything runs to completion. No crashes. Nothing unusual. It's
just that the callback which is supposed to fire whenever I enter the
'sleep' program (actually 'mysleep', since 'sleep' has its own
problems as mentioned by Dave below) simply doesn't happen after the
45th trip thru 'mysleep'. Now I put a print in 'mysleep' to verify
that I AM in fact going thru 'mysleep' say, e.g., 5000 times, and
I am going thru 'mysleep' 5000 times. But the callback does not
fire off. BTW, the choice of 5000 is arbitary. Any number > 46
will result in just 46 callbacks. No more, no less. Any number < 46
works just fine. If I call 'mysleep', e.g. 17 times, then I get
17 callbacks and life is good. NOTE: 'mysleep' is just a simple dummy
routine that spins in a 'for' loop for a bit and then returns. Gave
up on actually calling 'sleep' since we surmised that 'sleep' is
signal-driven and that requires more DPCL expertise than I have at
this time.
DAVEW: Does DPCL just hang, crash, or execute normally?
SLC: I think I answered this one above, i.e. no crashes, everything
normal.
Just no callbacks after 46 trips thru 'mysleep'.
DAVEW: Probably the best thing to do is put a call to Ais_blog_on just
before
you set up the probe and see what is going on. We should see probe exp
activity in the DPCL log.
SLC: Thanks for the tip. I'll definitely give this a try. Of all my
known
gotchas right now, this 'callback' limit, or whatever, is the most
problematic because it kinda gets to the heart of DPCL. At least as I
understand it.
2) DAVEW: What happens with the unaligned references on ia64? Do you just
get a warning and DPCL continues, or does DPCL crash?
SLC: This is just a warning on ia64. A warning that performance is
taking
a hit and you should redo the code. Functionality is not affected.
DAVEW: If we track these down, we are going to have to do this on a
case
by case basis with most of them in the various mesaage pack and
unpack
functions where we are building messages between client and daemon.
I understand the reason ia64 is complaining about unaligned
references,
but this is the first hardware/software I have seen lately that seems
to
complain. If there are no serious side effects to the unaligned
references, is there possibly a function call that could be made to
turn
off this warning?
SLC: The minute I saw these hideous messages I sent mail to our
ia64apps
newsgroup and asked how I could turn the darn things off. I can't,
or so I'm told. Rewrite the code is about the only response I got.
Sigh.
3) DAVEW: I don't think you can do what you want, i.e. calling sleep()
with DPCL. The problem is that DPCL only knows about functions within
the target executable. It has no knowledge of functions in any shared
libraries. We get around this limitation in AIX by taking advantage of
how library calls are made in AIX applications. On AIX, a call to a
library function is made by calling a small stub module which gets
linked
with the application, and the stub then makes the call to the real
library function. Since the stubs are linked with the application, we
can find the stub and put pur probes there. Even with this solution,
DPCL can only call functions which are referenced by the application
already. If the application never calls sleep(), then DPCL cannot
build
a probe expression to call it since the stub doesn't exist.
On ia32 Linux we can't take advantage of this solution since the
generated
assembler code calls the library function directly. I suspect Linux on
ia64 works the same way.
SLC: My bad on this one. I didn't explain clearly what the dynamic
probe
is trying to do. The 'mutatee' IS calling 'sleep()' (or, for now, the
dummy 'mysleep'-see NOTE above) and we are just trying to use DPCL
(across a cluster. BTW I've got things working across a partitioned
Altix -not 'sleep' but the 'mysleep' mutator I describe above, even
though it stops at 45 whether across a cluster or not - congrats to
DPCL designers/developers - awesome stuff!!).. trying to use DPCL to
insert a simple little callback at the entry to 'sleep' (or, for now
'mysleep') and this callback in the mutator just increments a global
Count variable which is printed when 'Ais_end_main_loop' is called
upon clean termination. I suspect the 'signal-driven' nature of
'sleep' is problematic.
Anyway, my original query was essentially this: if my mutatee
calls 'sleep', why does the mutator have to look for '__sleep'
in the 'bexpand' phase when I am searching for the entry point
in the symbols? As best we can tell, '__sleep' is the hard external
and 'sleep' is a soft external, which is not found. Dyninst can
find 'sleep' but DPCL can only find '__sleep' and not the soft
version of 'sleep'.
Sorry to be so wordy, Dave. Like you have said many times before,
debugging by email is decidely non-optimal. But it's all I've been
able to arrange for thus far. Sigh.
SteveC - SGI Tools
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
02/26/2004 02:15 PM
To: dpc...@ww...
cc: sl...@sg..., per...@co...
Subject: [Dpcl-develop] Some DPCL questions
I've been having e-mail problems recently so I suspect my
postings here have not been received by everyone. So I am going
to re-post and hope for better results. I am currently trying to
address 3 DPCL issues as described below. Thanks as always to
DaveW, JamesW and others who might relieve my current clueless
state regarding any or all of these DPCL issues.
SteveC
SGI Tools
1.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. This seems to work
great until I ask for more than 45 calls to 'sleep'. Even if I ask
for 4000 calls or just 47 calls to sleep, I only get 45 'callbacks'
to occur to my mutator (or client or tool, etc.). It's like the
callback mechanism works for some limit of approximately 45 times.
I KNOW that I am entering the 'sleep' function 5000 times but only
45 callbacks from the instrumentation are occurring. Only 45 callbacks
occur if I just specify 47 calls to sleep. There <seems> to be a
barrier at '45' when it comes to the number of callbacks that will
or can occur. I have no clue on this one.
2.
Some of the DPCL code (example provided below) is causing the
IA64 hardware to emit 'unaligned access' errors to the screen.
An example of the code occurs in ~dpcl/src/lib/src/ModuleId.C
in routine 'ModuleId unpack_ModuleId(char **buffer)', to wit:
ModuleId
unpack_ModuleId(char **buffer)
{
char *data = *buffer;
char *uniqstr = data;
data = data + 1 + strlen(uniqstr); // don't forget the NULL character
int *uint_p = (int *) data;
data = data + sizeof(int);
ModuleId new_mid = ModuleId(uniqstr, *uint_p);
.....
This last statement which derefernces from an 'int' alignment (*uint_p)
seems to upset the ia64 hardware and I get something like this:
mutator(16567): unaligned access to 0x600000000000aa36,
ip=0x2000000000277bb0
Now this code can be rewritten to avoid this hardware complaint, but it
is probably a change that needs to be made in a number of places. Getting
changes accepted on a voluminous scale would seem to be problematic. Is
that true, or am I just being a little paranoid?
3.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. We have discovered that
'weak' symbols such as 'sleep' are not found by DPCL but they are found
by Dyninst. To wit:
The problem with sleep() rather than __sleep() is that the former is a
weak symbol:
[hope] /scratch/wdh/Test/DPCL: objdump -t /lib/libc.so.6.1 | grep "sleep"
...
0000000000160890 l F .text 00000000000003a0 __sleep
...
0000000000160890 w F .text 00000000000003a0 sleep
...
Now I'm not sure why Dyninst works fine with the weak sleep(), but DPCL
doesn't.
(Note: the above analysis by Bill Hachfeld at SGI)
DaveW/James - have you seen any problem of this sort in the past involving
'weak' symbols? For reminders, we are running on an ia64 machine.
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: James W. <ja...@cs...> - 2004-03-01 21:59:30
|
> 3) I don't think you can do what you want, i.e. calling sleep() with DPCL. > The problem is that DPCL only knows about functions within the target > executable. It has no knowledge of functions in any shared libraries. We This is one area where the hybrid should be different from pure IBM-DPCL in that MD-Dyninst treats all shared libraries as modules and allows constituent functions to be called just as any other function in the executable. One thing to try would be to insert a call to nanosleep() instead of sleep(). At least on x86-linux, this promises to not affect any signals. There is unfortunately some additional complexity since the arg to nanosleep() is (struct timespec *). The easiest way to insert such a call would probably be to have the call to nanosleep() in a probe module, which can be inserted into the target program and called via ProbeExp. Not sure yet about the weak symbol problems, but, for the hybrid's purposes, any function that can be seen by MD-Dyninst should be exported through the DPCL API. -J |
|
From: Dave W. <dwo...@us...> - 2004-02-27 21:19:03
|
Steve
Can you send me the test case for your third question (sleep vs __sleep).
I'll need the source for your target or an explanation of how to build a
simple one and the source for your client. I'll find some time next week
to look at it. You can send it directly to me rather than to the mailing
list.
Dave
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
02/27/2004 12:37 PM
To: dpc...@ww...
cc: sl...@sg...
Subject: [Dpcl-develop] RE: Some DPCL questions
I'm really sorry about the e-mail mess. I think all our latest rif's
in the computer center are catching up with us. Oh well.
Good to hear from DaveW because I really didn't want to feel too all
alone with DPCL (heh,heh). I'll give my best responses to Dave's
questions and I guess we'll go from there.
1) DAVEW: There's no reason why callbacks should stop after 45 executions
of the probe expression. What happens here?
SLC: Everything runs to completion. No crashes. Nothing unusual. It's
just that the callback which is supposed to fire whenever I enter the
'sleep' program (actually 'mysleep', since 'sleep' has its own
problems as mentioned by Dave below) simply doesn't happen after the
45th trip thru 'mysleep'. Now I put a print in 'mysleep' to verify
that I AM in fact going thru 'mysleep' say, e.g., 5000 times, and
I am going thru 'mysleep' 5000 times. But the callback does not
fire off. BTW, the choice of 5000 is arbitary. Any number > 46
will result in just 46 callbacks. No more, no less. Any number < 46
works just fine. If I call 'mysleep', e.g. 17 times, then I get
17 callbacks and life is good. NOTE: 'mysleep' is just a simple dummy
routine that spins in a 'for' loop for a bit and then returns. Gave
up on actually calling 'sleep' since we surmised that 'sleep' is
signal-driven and that requires more DPCL expertise than I have at
this time.
DAVEW: Does DPCL just hang, crash, or execute normally?
SLC: I think I answered this one above, i.e. no crashes, everything
normal.
Just no callbacks after 46 trips thru 'mysleep'.
DAVEW: Probably the best thing to do is put a call to Ais_blog_on just
before
you set up the probe and see what is going on. We should see probe exp
activity in the DPCL log.
SLC: Thanks for the tip. I'll definitely give this a try. Of all my
known
gotchas right now, this 'callback' limit, or whatever, is the most
problematic because it kinda gets to the heart of DPCL. At least as I
understand it.
2) DAVEW: What happens with the unaligned references on ia64? Do you just
get a warning and DPCL continues, or does DPCL crash?
SLC: This is just a warning on ia64. A warning that performance is
taking
a hit and you should redo the code. Functionality is not affected.
DAVEW: If we track these down, we are going to have to do this on a
case
by case basis with most of them in the various mesaage pack and
unpack
functions where we are building messages between client and daemon.
I understand the reason ia64 is complaining about unaligned
references,
but this is the first hardware/software I have seen lately that seems
to
complain. If there are no serious side effects to the unaligned
references, is there possibly a function call that could be made to
turn
off this warning?
SLC: The minute I saw these hideous messages I sent mail to our
ia64apps
newsgroup and asked how I could turn the darn things off. I can't,
or so I'm told. Rewrite the code is about the only response I got.
Sigh.
3) DAVEW: I don't think you can do what you want, i.e. calling sleep()
with DPCL. The problem is that DPCL only knows about functions within
the target executable. It has no knowledge of functions in any shared
libraries. We get around this limitation in AIX by taking advantage of
how library calls are made in AIX applications. On AIX, a call to a
library function is made by calling a small stub module which gets
linked
with the application, and the stub then makes the call to the real
library function. Since the stubs are linked with the application, we
can find the stub and put pur probes there. Even with this solution,
DPCL can only call functions which are referenced by the application
already. If the application never calls sleep(), then DPCL cannot
build
a probe expression to call it since the stub doesn't exist.
On ia32 Linux we can't take advantage of this solution since the
generated
assembler code calls the library function directly. I suspect Linux on
ia64 works the same way.
SLC: My bad on this one. I didn't explain clearly what the dynamic
probe
is trying to do. The 'mutatee' IS calling 'sleep()' (or, for now, the
dummy 'mysleep'-see NOTE above) and we are just trying to use DPCL
(across a cluster. BTW I've got things working across a partitioned
Altix -not 'sleep' but the 'mysleep' mutator I describe above, even
though it stops at 45 whether across a cluster or not - congrats to
DPCL designers/developers - awesome stuff!!).. trying to use DPCL to
insert a simple little callback at the entry to 'sleep' (or, for now
'mysleep') and this callback in the mutator just increments a global
Count variable which is printed when 'Ais_end_main_loop' is called
upon clean termination. I suspect the 'signal-driven' nature of
'sleep' is problematic.
Anyway, my original query was essentially this: if my mutatee
calls 'sleep', why does the mutator have to look for '__sleep'
in the 'bexpand' phase when I am searching for the entry point
in the symbols? As best we can tell, '__sleep' is the hard external
and 'sleep' is a soft external, which is not found. Dyninst can
find 'sleep' but DPCL can only find '__sleep' and not the soft
version of 'sleep'.
Sorry to be so wordy, Dave. Like you have said many times before,
debugging by email is decidely non-optimal. But it's all I've been
able to arrange for thus far. Sigh.
SteveC - SGI Tools
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
02/26/2004 02:15 PM
To: dpc...@ww...
cc: sl...@sg..., per...@co...
Subject: [Dpcl-develop] Some DPCL questions
I've been having e-mail problems recently so I suspect my
postings here have not been received by everyone. So I am going
to re-post and hope for better results. I am currently trying to
address 3 DPCL issues as described below. Thanks as always to
DaveW, JamesW and others who might relieve my current clueless
state regarding any or all of these DPCL issues.
SteveC
SGI Tools
1.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. This seems to work
great until I ask for more than 45 calls to 'sleep'. Even if I ask
for 4000 calls or just 47 calls to sleep, I only get 45 'callbacks'
to occur to my mutator (or client or tool, etc.). It's like the
callback mechanism works for some limit of approximately 45 times.
I KNOW that I am entering the 'sleep' function 5000 times but only
45 callbacks from the instrumentation are occurring. Only 45 callbacks
occur if I just specify 47 calls to sleep. There <seems> to be a
barrier at '45' when it comes to the number of callbacks that will
or can occur. I have no clue on this one.
2.
Some of the DPCL code (example provided below) is causing the
IA64 hardware to emit 'unaligned access' errors to the screen.
An example of the code occurs in ~dpcl/src/lib/src/ModuleId.C
in routine 'ModuleId unpack_ModuleId(char **buffer)', to wit:
ModuleId
unpack_ModuleId(char **buffer)
{
char *data = *buffer;
char *uniqstr = data;
data = data + 1 + strlen(uniqstr); // don't forget the NULL character
int *uint_p = (int *) data;
data = data + sizeof(int);
ModuleId new_mid = ModuleId(uniqstr, *uint_p);
.....
This last statement which derefernces from an 'int' alignment (*uint_p)
seems to upset the ia64 hardware and I get something like this:
mutator(16567): unaligned access to 0x600000000000aa36,
ip=0x2000000000277bb0
Now this code can be rewritten to avoid this hardware complaint, but it
is probably a change that needs to be made in a number of places. Getting
changes accepted on a voluminous scale would seem to be problematic. Is
that true, or am I just being a little paranoid?
3.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. We have discovered that
'weak' symbols such as 'sleep' are not found by DPCL but they are found
by Dyninst. To wit:
The problem with sleep() rather than __sleep() is that the former is a
weak symbol:
[hope] /scratch/wdh/Test/DPCL: objdump -t /lib/libc.so.6.1 | grep "sleep"
...
0000000000160890 l F .text 00000000000003a0 __sleep
...
0000000000160890 w F .text 00000000000003a0 sleep
...
Now I'm not sure why Dyninst works fine with the weak sleep(), but DPCL
doesn't.
(Note: the above analysis by Bill Hachfeld at SGI)
DaveW/James - have you seen any problem of this sort in the past involving
'weak' symbols? For reminders, we are running on an ia64 machine.
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Steve C. <sl...@sg...> - 2004-02-27 20:37:36
|
I'm really sorry about the e-mail mess. I think all our latest rif's
in the computer center are catching up with us. Oh well.
Good to hear from DaveW because I really didn't want to feel too all
alone with DPCL (heh,heh). I'll give my best responses to Dave's
questions and I guess we'll go from there.
1) DAVEW: There's no reason why callbacks should stop after 45 executions
of the probe expression. What happens here?
SLC: Everything runs to completion. No crashes. Nothing unusual. It's
just that the callback which is supposed to fire whenever I enter the
'sleep' program (actually 'mysleep', since 'sleep' has its own
problems as mentioned by Dave below) simply doesn't happen after the
45th trip thru 'mysleep'. Now I put a print in 'mysleep' to verify
that I AM in fact going thru 'mysleep' say, e.g., 5000 times, and
I am going thru 'mysleep' 5000 times. But the callback does not
fire off. BTW, the choice of 5000 is arbitary. Any number > 46
will result in just 46 callbacks. No more, no less. Any number < 46
works just fine. If I call 'mysleep', e.g. 17 times, then I get
17 callbacks and life is good. NOTE: 'mysleep' is just a simple dummy
routine that spins in a 'for' loop for a bit and then returns. Gave
up on actually calling 'sleep' since we surmised that 'sleep' is
signal-driven and that requires more DPCL expertise than I have at
this time.
DAVEW: Does DPCL just hang, crash, or execute normally?
SLC: I think I answered this one above, i.e. no crashes, everything normal.
Just no callbacks after 46 trips thru 'mysleep'.
DAVEW: Probably the best thing to do is put a call to Ais_blog_on just before
you set up the probe and see what is going on. We should see probe exp
activity in the DPCL log.
SLC: Thanks for the tip. I'll definitely give this a try. Of all my known
gotchas right now, this 'callback' limit, or whatever, is the most
problematic because it kinda gets to the heart of DPCL. At least as I
understand it.
2) DAVEW: What happens with the unaligned references on ia64? Do you just
get a warning and DPCL continues, or does DPCL crash?
SLC: This is just a warning on ia64. A warning that performance is taking
a hit and you should redo the code. Functionality is not affected.
DAVEW: If we track these down, we are going to have to do this on a case
by case basis with most of them in the various mesaage pack and unpack
functions where we are building messages between client and daemon.
I understand the reason ia64 is complaining about unaligned references,
but this is the first hardware/software I have seen lately that seems to
complain. If there are no serious side effects to the unaligned
references, is there possibly a function call that could be made to turn
off this warning?
SLC: The minute I saw these hideous messages I sent mail to our ia64apps
newsgroup and asked how I could turn the darn things off. I can't,
or so I'm told. Rewrite the code is about the only response I got.
Sigh.
3) DAVEW: I don't think you can do what you want, i.e. calling sleep()
with DPCL. The problem is that DPCL only knows about functions within
the target executable. It has no knowledge of functions in any shared
libraries. We get around this limitation in AIX by taking advantage of
how library calls are made in AIX applications. On AIX, a call to a
library function is made by calling a small stub module which gets linked
with the application, and the stub then makes the call to the real
library function. Since the stubs are linked with the application, we
can find the stub and put pur probes there. Even with this solution,
DPCL can only call functions which are referenced by the application
already. If the application never calls sleep(), then DPCL cannot build
a probe expression to call it since the stub doesn't exist.
On ia32 Linux we can't take advantage of this solution since the generated
assembler code calls the library function directly. I suspect Linux on
ia64 works the same way.
SLC: My bad on this one. I didn't explain clearly what the dynamic probe
is trying to do. The 'mutatee' IS calling 'sleep()' (or, for now, the
dummy 'mysleep'-see NOTE above) and we are just trying to use DPCL
(across a cluster. BTW I've got things working across a partitioned
Altix -not 'sleep' but the 'mysleep' mutator I describe above, even
though it stops at 45 whether across a cluster or not - congrats to
DPCL designers/developers - awesome stuff!!).. trying to use DPCL to
insert a simple little callback at the entry to 'sleep' (or, for now
'mysleep') and this callback in the mutator just increments a global
Count variable which is printed when 'Ais_end_main_loop' is called
upon clean termination. I suspect the 'signal-driven' nature of
'sleep' is problematic.
Anyway, my original query was essentially this: if my mutatee
calls 'sleep', why does the mutator have to look for '__sleep'
in the 'bexpand' phase when I am searching for the entry point
in the symbols? As best we can tell, '__sleep' is the hard external
and 'sleep' is a soft external, which is not found. Dyninst can
find 'sleep' but DPCL can only find '__sleep' and not the soft
version of 'sleep'.
Sorry to be so wordy, Dave. Like you have said many times before,
debugging by email is decidely non-optimal. But it's all I've been
able to arrange for thus far. Sigh.
SteveC - SGI Tools
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
02/26/2004 02:15 PM
To: dpc...@ww...
cc: sl...@sg..., per...@co...
Subject: [Dpcl-develop] Some DPCL questions
I've been having e-mail problems recently so I suspect my
postings here have not been received by everyone. So I am going
to re-post and hope for better results. I am currently trying to
address 3 DPCL issues as described below. Thanks as always to
DaveW, JamesW and others who might relieve my current clueless
state regarding any or all of these DPCL issues.
SteveC
SGI Tools
1.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. This seems to work
great until I ask for more than 45 calls to 'sleep'. Even if I ask
for 4000 calls or just 47 calls to sleep, I only get 45 'callbacks'
to occur to my mutator (or client or tool, etc.). It's like the
callback mechanism works for some limit of approximately 45 times.
I KNOW that I am entering the 'sleep' function 5000 times but only
45 callbacks from the instrumentation are occurring. Only 45 callbacks
occur if I just specify 47 calls to sleep. There <seems> to be a
barrier at '45' when it comes to the number of callbacks that will
or can occur. I have no clue on this one.
2.
Some of the DPCL code (example provided below) is causing the
IA64 hardware to emit 'unaligned access' errors to the screen.
An example of the code occurs in ~dpcl/src/lib/src/ModuleId.C
in routine 'ModuleId unpack_ModuleId(char **buffer)', to wit:
ModuleId
unpack_ModuleId(char **buffer)
{
char *data = *buffer;
char *uniqstr = data;
data = data + 1 + strlen(uniqstr); // don't forget the NULL character
int *uint_p = (int *) data;
data = data + sizeof(int);
ModuleId new_mid = ModuleId(uniqstr, *uint_p);
.....
This last statement which derefernces from an 'int' alignment (*uint_p)
seems to upset the ia64 hardware and I get something like this:
mutator(16567): unaligned access to 0x600000000000aa36,
ip=0x2000000000277bb0
Now this code can be rewritten to avoid this hardware complaint, but it
is probably a change that needs to be made in a number of places. Getting
changes accepted on a voluminous scale would seem to be problematic. Is
that true, or am I just being a little paranoid?
3.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. We have discovered that
'weak' symbols such as 'sleep' are not found by DPCL but they are found
by Dyninst. To wit:
The problem with sleep() rather than __sleep() is that the former is a
weak symbol:
[hope] /scratch/wdh/Test/DPCL: objdump -t /lib/libc.so.6.1 | grep "sleep"
...
0000000000160890 l F .text 00000000000003a0 __sleep
...
0000000000160890 w F .text 00000000000003a0 sleep
...
Now I'm not sure why Dyninst works fine with the weak sleep(), but DPCL
doesn't.
(Note: the above analysis by Bill Hachfeld at SGI)
DaveW/James - have you seen any problem of this sort in the past involving
'weak' symbols? For reminders, we are running on an ia64 machine.
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Dave W. <dwo...@us...> - 2004-02-27 19:45:26
|
Steve
Your email has not been getting thru until this one. In response to your
questions:
1) There's no reason why callbacks should stop after 45 executions of the
probe expression. What happens here? Does DPCL just hang, crash, or
execute normally? Probably the best thing to do is put a call to
Ais_blog_on just before you set up the probe and see what is going on. We
should see probe exp activity in the DPCL log
2) What happens with the unaligned references on ia64? Do you just get a
warning and DPCL continues, or does DPCL crash? If we track these down, we
are going to have to do this on a case by case basis with most of them in
the various mesaage pack and unpack functions where we are building
messages between client and daemon.
I understand the reason ia64 is complaining about unaligned references,
but this is the first hardware/software I have seen lately that seems to
complain. If there are no serious side effects to the unaligned
references, is there possibly a function call that could be made to turn
off this warning?
3) I don't think you can do what you want, i.e. calling sleep() with DPCL.
The problem is that DPCL only knows about functions within the target
executable. It has no knowledge of functions in any shared libraries. We
get around this limitation in AIX by taking advantage of how library calls
are made in AIX applications. On AIX, a call to a library function is made
by calling a small stub module which gets linked with the application, and
the stub then makes the call to the real library function. Since the stubs
are linked with the application, we can find the stub and put pur probes
there. Even with this solution, DPCL can only call functions which are
referenced by the application already. If the application never calls
sleep(), then DPCL cannot build a probe expression to call it since the
stub doesn't exist.
On ia32 Linux we can't take advantage of this solution since the generated
assembler code calls the library function directly. I suspect Linux on
ia64 works the same way.
Dave
Steve Collins <sl...@sg...>
Sent by: dpc...@ww...
02/26/2004 02:15 PM
To: dpc...@ww...
cc: sl...@sg..., per...@co...
Subject: [Dpcl-develop] Some DPCL questions
I've been having e-mail problems recently so I suspect my
postings here have not been received by everyone. So I am going
to re-post and hope for better results. I am currently trying to
address 3 DPCL issues as described below. Thanks as always to
DaveW, JamesW and others who might relieve my current clueless
state regarding any or all of these DPCL issues.
SteveC
SGI Tools
1.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. This seems to work
great until I ask for more than 45 calls to 'sleep'. Even if I ask
for 4000 calls or just 47 calls to sleep, I only get 45 'callbacks'
to occur to my mutator (or client or tool, etc.). It's like the
callback mechanism works for some limit of approximately 45 times.
I KNOW that I am entering the 'sleep' function 5000 times but only
45 callbacks from the instrumentation are occurring. Only 45 callbacks
occur if I just specify 47 calls to sleep. There <seems> to be a
barrier at '45' when it comes to the number of callbacks that will
or can occur. I have no clue on this one.
2.
Some of the DPCL code (example provided below) is causing the
IA64 hardware to emit 'unaligned access' errors to the screen.
An example of the code occurs in ~dpcl/src/lib/src/ModuleId.C
in routine 'ModuleId unpack_ModuleId(char **buffer)', to wit:
ModuleId
unpack_ModuleId(char **buffer)
{
char *data = *buffer;
char *uniqstr = data;
data = data + 1 + strlen(uniqstr); // don't forget the NULL character
int *uint_p = (int *) data;
data = data + sizeof(int);
ModuleId new_mid = ModuleId(uniqstr, *uint_p);
.....
This last statement which derefernces from an 'int' alignment (*uint_p)
seems to upset the ia64 hardware and I get something like this:
mutator(16567): unaligned access to 0x600000000000aa36,
ip=0x2000000000277bb0
Now this code can be rewritten to avoid this hardware complaint, but it
is probably a change that needs to be made in a number of places. Getting
changes accepted on a voluminous scale would seem to be problematic. Is
that true, or am I just being a little paranoid?
3.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. We have discovered that
'weak' symbols such as 'sleep' are not found by DPCL but they are found
by Dyninst. To wit:
The problem with sleep() rather than __sleep() is that the former is a
weak symbol:
[hope] /scratch/wdh/Test/DPCL: objdump -t /lib/libc.so.6.1 | grep "sleep"
...
0000000000160890 l F .text 00000000000003a0 __sleep
...
0000000000160890 w F .text 00000000000003a0 sleep
...
Now I'm not sure why Dyninst works fine with the weak sleep(), but DPCL
doesn't.
(Note: the above analysis by Bill Hachfeld at SGI)
DaveW/James - have you seen any problem of this sort in the past involving
'weak' symbols? For reminders, we are running on an ia64 machine.
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Steve C. <sl...@sg...> - 2004-02-26 22:15:09
|
I've been having e-mail problems recently so I suspect my
postings here have not been received by everyone. So I am going
to re-post and hope for better results. I am currently trying to
address 3 DPCL issues as described below. Thanks as always to
DaveW, JamesW and others who might relieve my current clueless
state regarding any or all of these DPCL issues.
SteveC
SGI Tools
1.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. This seems to work
great until I ask for more than 45 calls to 'sleep'. Even if I ask
for 4000 calls or just 47 calls to sleep, I only get 45 'callbacks'
to occur to my mutator (or client or tool, etc.). It's like the
callback mechanism works for some limit of approximately 45 times.
I KNOW that I am entering the 'sleep' function 5000 times but only
45 callbacks from the instrumentation are occurring. Only 45 callbacks
occur if I just specify 47 calls to sleep. There <seems> to be a
barrier at '45' when it comes to the number of callbacks that will
or can occur. I have no clue on this one.
2.
Some of the DPCL code (example provided below) is causing the
IA64 hardware to emit 'unaligned access' errors to the screen.
An example of the code occurs in ~dpcl/src/lib/src/ModuleId.C
in routine 'ModuleId unpack_ModuleId(char **buffer)', to wit:
ModuleId
unpack_ModuleId(char **buffer)
{
char *data = *buffer;
char *uniqstr = data;
data = data + 1 + strlen(uniqstr); // don't forget the NULL character
int *uint_p = (int *) data;
data = data + sizeof(int);
ModuleId new_mid = ModuleId(uniqstr, *uint_p);
.....
This last statement which derefernces from an 'int' alignment (*uint_p)
seems to upset the ia64 hardware and I get something like this:
mutator(16567): unaligned access to 0x600000000000aa36, ip=0x2000000000277bb0
Now this code can be rewritten to avoid this hardware complaint, but it
is probably a change that needs to be made in a number of places. Getting
changes accepted on a voluminous scale would seem to be problematic. Is
that true, or am I just being a little paranoid?
3.
I have a DPCL testcase (aka Dyninst mutator) that simply attempts
to insert a call to 'sleep' into the 'mutatee'. We have discovered that
'weak' symbols such as 'sleep' are not found by DPCL but they are found
by Dyninst. To wit:
The problem with sleep() rather than __sleep() is that the former is a
weak symbol:
[hope] /scratch/wdh/Test/DPCL: objdump -t /lib/libc.so.6.1 | grep "sleep"
...
0000000000160890 l F .text 00000000000003a0 __sleep
...
0000000000160890 w F .text 00000000000003a0 sleep
...
Now I'm not sure why Dyninst works fine with the weak sleep(), but DPCL
doesn't.
(Note: the above analysis by Bill Hachfeld at SGI)
DaveW/James - have you seen any problem of this sort in the past involving
'weak' symbols? For reminders, we are running on an ia64 machine.
|
|
From: Nikhil B. <bh...@cs...> - 2004-02-25 09:32:35
|
hi, I am using DPCL to connect to an application in binary form to instrument it, but I was curious if its possible for DPCL to attach to a already running application e.g application started as a batch job with a LoadLeveler. Thanks, Best Regards, Nikhil *********************************************************************** Nikhil Bhatia Contact Information : cl307, Claxton 203, Phone Numbers: Dept. of Computer Science 865-742-3649 (C) University of Tennessee 865-974-0517 (O) Knoxville,TN email address: USA bh...@cs... *********************************************************************** |
|
From: Dave W. <dwo...@us...> - 2004-01-30 22:37:54
|
Steve
There is no compelling reason for you to pick up the hybrid code strictly
for the 64 bit library build directory change right now. This was done
solely to put in place some build structure changes that are needed for
systems which support both 32 bit and 64 bit processes running at the same
time, the primary system at this time being AIX. For systems such as i386
Linux, which supports only 32 bit processes, there is no benefit.
I understand ia64 Linux supports both i386 32 bit processes in emulation
mode and ia64 64 bit processes in native mode. That will introduce
complications of it's own since that means supporting two different
instruction architectures on the same machine. At the minimum it means the
build needs to build the library in i386 mode with one set of compiler and
associated tools and then build it again in ia64 mode using the native
compiler and associated tools. I suggest we not worry about that
combination right now.
Note that this does not mean you should avoid the hybrid code, since the
only changes are to build structure and you should be able to build the
ia64 DPCL using the new build structure. If there are changes in the
hybrid code that you are waiting for, then that is good reason to use the
latest hybrid code.
Dave
Steve Collins <sl...@cl...>
Sent by: dpc...@ww...
01/30/2004 02:20 PM
To: dpc...@ww...
cc: sl...@cl...
Subject: [Dpcl-develop] 64-bit DPCL
JamesW informed me recently that the hybrid_121503_2.tgz has
the 64 bit library build directory from DaveW. Should I pick this
up now or wait until all the 64 bit work is done?
My status right now is that I have a simple DPCL program which
does no more than a 'bcreate' and I have run this successfully
'off chip' between two partitions on an in-house 64-bit Altix.
Exciting stuff. But now I have a little bit more of a DPCL test
which does the 'bcreate' but also a 'bexpand', 'get_program_object',
'bexpand' and 'binstall_probe' of a simple expression that does
nothing more than insert a call to the 'sleep' function. Now I
realize I may be way ahead of myself here, but is it worth
getting hybrid_121503_2.tgz and its 64-bit code and will that
be enough to successfully do the 'sleep' instrumentation?
As always, thanks to DaveW, JamesW, AlbertF, ToddM for their
incredible support.
Thanks again - SteveC
SGI Compilers/Tools
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Steve C. <sl...@cl...> - 2004-01-30 22:23:08
|
JamesW informed me recently that the hybrid_121503_2.tgz has
the 64 bit library build directory from DaveW. Should I pick this
up now or wait until all the 64 bit work is done?
My status right now is that I have a simple DPCL program which
does no more than a 'bcreate' and I have run this successfully
'off chip' between two partitions on an in-house 64-bit Altix.
Exciting stuff. But now I have a little bit more of a DPCL test
which does the 'bcreate' but also a 'bexpand', 'get_program_object',
'bexpand' and 'binstall_probe' of a simple expression that does
nothing more than insert a call to the 'sleep' function. Now I
realize I may be way ahead of myself here, but is it worth
getting hybrid_121503_2.tgz and its 64-bit code and will that
be enough to successfully do the 'sleep' instrumentation?
As always, thanks to DaveW, JamesW, AlbertF, ToddM for their
incredible support.
Thanks again - SteveC
SGI Compilers/Tools
|
|
From: Dave W. <dwo...@us...> - 2004-01-29 23:23:06
|
Steve
The DPCL daemon log should not be deleted for any reason. If Ais_blog_on
successfully starts the log, it should be present in /tmp (by default,
overridable by changing the path in /etc/xinetd.d.dpclSD). Are you certain
that the DPCL daemon is running to the point that logging is started? Note
that if the daemon does not get far enough to start handling requests from
the client, then logging will not start.
The DPCL super daemon log does get deleted, but only on successful exit by
the super daemon.
The only unlink call in the DPCL code that could be deleting any log file
is in teh DPCL super daemon.
Is the file in fact gone, or is it zero length?
What are the symptoms of the failure?
Dave
Steve Collins <sl...@cl...>
Sent by: dpc...@ww...
01/29/2004 03:01 PM
To: dpc...@ww...
cc: sl...@cl...
Subject: [Dpcl-develop] daemon log file
Is there a way to ensure the dpclsd.nnnn file is not thrown away
when the daemon crashes?? The DEBUG_DAEMON isn't overly helpful and
I know a lot of 'log_write's are being done and then being discarded.
Thanks for any help.
SteveC - SGI
_______________________________________________
Dpcl-develop mailing list
Dpc...@ww...
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop
|
|
From: Steve C. <sl...@cl...> - 2004-01-29 23:04:48
|
Is there a way to ensure the dpclsd.nnnn file is not thrown away
when the daemon crashes?? The DEBUG_DAEMON isn't overly helpful and
I know a lot of 'log_write's are being done and then being discarded.
Thanks for any help.
SteveC - SGI
|
|
From: Steve C. <sl...@cl...> - 2004-01-28 00:17:31
|
THe previous version that I sent out apparently wasn't quite right and the
new gcc 3.3.2 compile caught the problem. The following version (using named
operands ONLY - no numbered operands) seems to work well:
#define __atomic_fool_gcc(x) (*(volatile struct { int a[100]; } *)x)
int
_check_lock
(atomic_p addr, int old_val, int new_val)
{
long interim_val = 777;
int result = 0; /* default is success */
__asm__ __volatile__ ( "mf");
__asm__ __volatile__ (
"mov ar.ccv = %[old]"
:
: [old] "r"(old_val));
__asm__ __volatile__ (
"cmpxchg4.rel %[interim] =%[mem],%[new],ar.ccv"
: [interim] "=r" (interim_val)
: [mem] "m" (__atomic_fool_gcc(addr)), [new] "r" (new_val));
__asm__ __volatile__ ( "mf");
if( interim_val == old_val) {
/* lock memory value was old_val */
/* exchange to new_val done - new lock in place */
result = 0; /* false */
}
else {
/* lock memory value was NOT old_val */
result = 1; /* true */
}
return result;
}
I am now back to where I was with gcc 2.96 and, actually, a little further.
The following daemon log shows some real things happening, but of course
the 'attachProcess' does not work with the present IA64 Dyninst and things
eventually break. But for grins, here is the logfile:
ue Jan 27 14:08:22 2004: opened log /tmp/dpclsd.14530
@Timing started connect:14525
enter connect_cb: key(66) connect_stopped(0) pid(14525) client(0)
connect_cb(): create a new ProcessD for pid 14525
PModEntryInt (0x 25260)
enter ProcessD::connect()
cannot find the specified client socket 0, maybe it is a new one
popen found "/tmp/slc/hybrid_121503/dpcl/src/samples/hello/hello
"
bpatch attach: /tmp/slc/hybrid_121503/dpcl/src/samples/hello/hello 14525
enter prepare_new_process_shm()
enter locate_process_shm()
Dyninst-MD message: #0 (level 3): "Processing an executable file"
Dyninst-MD message: #0 (level 3): "Parsing object file: /proc/14525/exe"
Dyninst-MD message: #0 (level 3): "sorting modules"
Dyninst-MD message: #0 (level 3): "winnowing functions"
Dyninst-MD message: #0 (level 3): "defining modules"
Dyninst-MD message: #0 (level 3): "ready"
Dyninst-MD message: #0 (level 3): "PID=14525, initializing shared objects"
Dyninst-MD message: #0 (level 3): "parsing shared object files"
Dyninst-MD message: #0 (level 3): "Processing a shared object file"
Dyninst-MD message: #0 (level 3): "Parsing object file: /lib/libc.so.6.1"
Dyninst-MD message: #0 (level 3): "sorting modules"
Dyninst-MD message: #0 (level 3): "winnowing functions"
Tue Jan 27 14:08:23 2004: Dyninst-MD message: #0 (level 3): "defining modules"
Dyninst-MD message: #0 (level 3): "ready"
Dyninst-MD message: #0 (level 3): "Processing a shared object file"
Dyninst-MD message: #0 (level 3): "Parsing object file: /lib/ld-linux-ia64.so.2"
Dyninst-MD message: #0 (level 3): "sorting modules"
Dyninst-MD message: #0 (level 3): "winnowing functions"
Dyninst-MD message: #0 (level 3): "defining modules"
Dyninst-MD message: #0 (level 3): "ready"
Dyninst-MD message: #0 (level 3): "PID=14525, loading dyninst library"
Dyninst-MD message: #0 (level 2): "process 14525 has terminated on signal 11
"
Dyninst-MD message: #0 (level 3): "process 14525 has terminated on signal 11
"
Dyninst-MD message: #0 (level 2): " 0 total points used
"
Dyninst-MD message: #0 (level 2): " 0 mini-tramps used
"
Dyninst-MD message: #0 (level 2): " 0 tramp bytes
"
Dyninst-MD message: #0 (level 2): " 1 ptrace other calls
"
Dyninst-MD message: #0 (level 2): " 56 ptrace write calls
"
Dyninst-MD message: #0 (level 2): " 1630 ptrace bytes written
"
Dyninst-MD message: #0 (level 2): " 0 instructions generated
"
Dyninst-MD message: #0 (level 2): " 0.000000 time used to generate instrumentation
Thanks to DaveW/Albert/BillH for their copious help!!
SteveC
|