|
From: Nuno L. <nun...@sa...> - 2008-05-01 22:27:39
|
Hi, Just a quick note to say that trunk's drd goes crazy on ppc64. Running the regtests on a PS3 wields to memory trashing, as even the simple tests make drd consume GBs of memory, filling all the swap. This is a regression, as a few weeks ago the tests would run successfully. Regards, Nuno |
|
From: Bart V. A. <bar...@gm...> - 2008-05-02 06:01:53
|
On Fri, May 2, 2008 at 12:27 AM, Nuno Lopes <nun...@sa...> wrote: > > Just a quick note to say that trunk's drd goes crazy on ppc64. Running the > regtests on a PS3 wields to memory trashing, as even the simple tests make > drd consume GBs of memory, filling all the swap. > This is a regression, as a few weeks ago the tests would run successfully. Thanks for reporting this. Unfortunately I do not have access to any ppc32 or ppc64 hardware. I assume you are working with the Subversion trunk ? It would already be a help if you could tell me the revision numbers of the version that works and the version that doesn't work properly. Bart. |
|
From: Nuno L. <nun...@sa...> - 2008-05-02 14:34:44
|
>> Just a quick note to say that trunk's drd goes crazy on ppc64. Running >> the >> regtests on a PS3 wields to memory trashing, as even the simple tests >> make >> drd consume GBs of memory, filling all the swap. >> This is a regression, as a few weeks ago the tests would run >> successfully. > > Thanks for reporting this. Unfortunately I do not have access to any > ppc32 or ppc64 hardware. I assume you are working with the Subversion > trunk ? It would already be a help if you could tell me the revision > numbers of the version that works and the version that doesn't work > properly. Yes, I was working with latest trunk. Doing some binary search on the revisions I found the following: - r7839 (01/April) is the latest good revision - r7840 crashes with the error below - r7841 and newer revisions consume at least 1 GB of RAM even with a simple hello world exp-drd: drd_thread.c:326 (thread_set_pthreadid): Assertion 'ptid != INVALID_POSIX_THREADID' failed. ==18697== at 0x38019888: report_and_quit (m_libcassert.c:140) ==18697== by 0x38019C00: vgPlain_assert_fail (m_libcassert.c:205) ==18697== by 0x38010DC4: thread_set_pthreadid (drd_thread.c:326) ==18697== by 0x3800777C: drd_handle_client_request (drd_clientreq.c:121) ==18697== by 0x3803DEF0: do_client_request (scheduler.c:1388) ==18697== by 0x3803F934: vgPlain_scheduler (scheduler.c:987) ==18697== by 0x3805501C: run_a_thread_NORETURN (syswrap-linux.c:89) sched status: running_tid=1 Thread 1: status = VgTs_Runnable ==18697== at 0xFF6E06C: _init (drd_pthread_intercepts.c:237) ==18697== by 0xFFBE6F8: call_init (in /lib/ld-2.6.so) ==18697== by 0xFFBE8BC: _dl_init (in /lib/ld-2.6.so) ==18697== by 0xFFC6B70: _start (in /lib/ld-2.6.so) I'm sorry but I can't give you access to the PS3, as it isn't mine either. Anyway I can test the patches for you if needed. Regards, Nuno |
|
From: Bart V. A. <bar...@gm...> - 2008-05-02 17:36:00
|
On Fri, May 2, 2008 at 4:34 PM, Nuno Lopes <nun...@sa...> wrote: > Yes, I was working with latest trunk. > Doing some binary search on the revisions I found the following: > - r7839 (01/April) is the latest good revision > - r7840 crashes with the error below > - r7841 and newer revisions consume at least 1 GB of RAM even with a simple > hello world > > exp-drd: drd_thread.c:326 (thread_set_pthreadid): Assertion 'ptid != > INVALID_POSIX_THREADID' failed. Was the above assert triggered by a client program that was not linked against libpthread.so ? For such programs drd_pthread_intercepts.c still calls pthread_self(), but the version provided in glibc. This version of pthread_self() returns 0, which triggered the assert. More recent versions of exp-drd handle single-threaded programs that are not linked against libpthread.so correctly. The actual problem (r7841 and later) might be triggered by the function highest_used_stack_address(), more specifically if the stack pointer this function obtains via VG_(get_StackTrace)() is wrong. I have added an assert statement in highest_used_stack_address() that checks the validity of the obtained stack pointer. Can you please try to run the regression tests on the latest trunk version (revision 7990) ? An assert statement should now be triggered instead of the consumption of lots of RAM. Bart. |
|
From: Nuno L. <nun...@sa...> - 2008-05-02 17:57:15
|
> On Fri, May 2, 2008 at 4:34 PM, Nuno Lopes <nun...@sa...> wrote: >> Yes, I was working with latest trunk. >> Doing some binary search on the revisions I found the following: >> - r7839 (01/April) is the latest good revision >> - r7840 crashes with the error below >> - r7841 and newer revisions consume at least 1 GB of RAM even with a >> simple >> hello world >> >> exp-drd: drd_thread.c:326 (thread_set_pthreadid): Assertion 'ptid != >> INVALID_POSIX_THREADID' failed. > > Was the above assert triggered by a client program that was not linked > against libpthread.so ? yes. > The actual problem (r7841 and later) might be triggered by the > function highest_used_stack_address(), more specifically if the stack > pointer this function obtains via VG_(get_StackTrace)() is wrong. I > have added an assert statement in highest_used_stack_address() that > checks the validity of the obtained stack pointer. Can you please try > to run the regression tests on the latest trunk version (revision > 7990) ? An assert statement should now be triggered instead of the > consumption of lots of RAM. Bingo! I now get the following: exp-drd: drd_clientreq.c:102 (highest_used_stack_address): Assertion 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) <= husa && husa <= VG_(thread_get_stack_max)(vg_tid)' failed. ==6340== at 0x3801A708: report_and_quit (m_libcassert.c:140) ==6340== by 0x3801AA80: vgPlain_assert_fail (m_libcassert.c:205) ==6340== by 0x380077EC: highest_used_stack_address (drd_clientreq.c:100) ==6340== by 0x38007DC4: drd_handle_client_request (drd_clientreq.c:130) ==6340== by 0x380403D0: do_client_request (scheduler.c:1414) ==6340== by 0x38041EA4: vgPlain_scheduler (scheduler.c:1013) ==6340== by 0x3805857C: run_a_thread_NORETURN (syswrap-linux.c:89) sched status: running_tid=1 Thread 1: status = VgTs_Runnable ==6340== at 0xFF6C498: init (drd_pthread_intercepts.c:244) ==6340== by 0xFF6CD94: (within /home/avexe/nuno/valgrind/exp-drd/vgpreload_exp-drd-ppc32-linux.so) ==6340== by 0xFF65CF8: (within /home/avexe/nuno/valgrind/exp-drd/vgpreload_exp-drd-ppc32-linux.so) ==6340== by 0xFFBE6F8: call_init (in /lib/ld-2.6.so) ==6340== by 0xFFBE87C: _dl_init (in /lib/ld-2.6.so) ==6340== by 0xFFC6B70: _start (in /lib/ld-2.6.so) Nuno |
|
From: Bart V. A. <bar...@gm...> - 2008-05-02 18:26:50
|
On Fri, May 2, 2008 at 7:57 PM, Nuno Lopes <nun...@sa...> wrote:
>
> Bingo!
> I now get the following:
>
> exp-drd: drd_clientreq.c:102 (highest_used_stack_address): Assertion
> 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) <=
> husa && husa <= VG_(thread_get_stack_max)(vg_tid)' failed.
> ==6340== at 0x3801A708: report_and_quit (m_libcassert.c:140)
> ==6340== by 0x3801AA80: vgPlain_assert_fail (m_libcassert.c:205)
> ==6340== by 0x380077EC: highest_used_stack_address (drd_clientreq.c:100)
> ==6340== by 0x38007DC4: drd_handle_client_request (drd_clientreq.c:130)
> ==6340== by 0x380403D0: do_client_request (scheduler.c:1414)
> ==6340== by 0x38041EA4: vgPlain_scheduler (scheduler.c:1013)
> ==6340== by 0x3805857C: run_a_thread_NORETURN (syswrap-linux.c:89)
>
> sched status:
> running_tid=1
>
> Thread 1: status = VgTs_Runnable
> ==6340== at 0xFF6C498: init (drd_pthread_intercepts.c:244)
> ==6340== by 0xFF6CD94: (within
> /home/avexe/nuno/valgrind/exp-drd/vgpreload_exp-drd-ppc32-linux.so)
> ==6340== by 0xFF65CF8: (within
> /home/avexe/nuno/valgrind/exp-drd/vgpreload_exp-drd-ppc32-linux.so)
> ==6340== by 0xFFBE6F8: call_init (in /lib/ld-2.6.so)
> ==6340== by 0xFFBE87C: _dl_init (in /lib/ld-2.6.so)
> ==6340== by 0xFFC6B70: _start (in /lib/ld-2.6.so)
Hello Julian,
Shouldn't the code below fill in sps[0] and fps[0] ? See also
coregrind/m_stacktrace.c, starting at line 101.
/* Assertion broken before main() is reached in pthreaded programs; the
* offending stack traces only have one item. --njn, 2002-aug-16 */
/* vg_assert(fp_min <= fp_max);*/
if (fp_min + 512 >= fp_max) {
/* If the stack limits look bogus, don't poke around ... but
don't bomb out either. */
ips[0] = ip;
return 1;
}
Bart.
|
|
From: Julian S. <js...@ac...> - 2008-05-02 18:35:52
|
> Shouldn't the code below fill in sps[0] and fps[0] ? See also
> coregrind/m_stacktrace.c, starting at line 101.
>
> /* Assertion broken before main() is reached in pthreaded programs; the
> * offending stack traces only have one item. --njn, 2002-aug-16 */
> /* vg_assert(fp_min <= fp_max);*/
> if (fp_min + 512 >= fp_max) {
> /* If the stack limits look bogus, don't poke around ... but
> don't bomb out either. */
> ips[0] = ip;
> return 1;
> }
Er, yes it probably should, for consistency. I can't think what
the consequences would be if it didn't; presumably returning junk
to the caller. Which isn't good. Well spotted.
J
|
|
From: Bart V. A. <bar...@gm...> - 2008-05-02 19:21:31
|
On Fri, May 2, 2008 at 7:57 PM, Nuno Lopes <nun...@sa...> wrote: > Bingo! > I now get the following: > > exp-drd: drd_clientreq.c:102 (highest_used_stack_address): Assertion > 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) <= > husa && husa <= VG_(thread_get_stack_max)(vg_tid)' failed. Hello Nuno, Do you have the time to run another test ? In revision 7993 the above assert might be solved (I'm not sure of this). Bart. |
|
From: Nuno L. <nun...@sa...> - 2008-05-02 20:59:45
|
> On Fri, May 2, 2008 at 7:57 PM, Nuno Lopes <nun...@sa...> wrote: >> Bingo! >> I now get the following: >> >> exp-drd: drd_clientreq.c:102 (highest_used_stack_address): Assertion >> 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) >> <= >> husa && husa <= VG_(thread_get_stack_max)(vg_tid)' failed. > > Hello Nuno, > > Do you have the time to run another test ? In revision 7993 the above > assert might be solved (I'm not sure of this). Sure! I'm still able to trigger the assertion with threaded tests, although simple non-threaded programs seem to run fine. Most (if not all) regtests fail. e.g.: $ cat exp-drd/tests/tc01_simple_race.stderr.out WARNING: DRD has only been tested on x86-linux and amd64-linux. exp-drd: drd_clientreq.c:107 (highest_used_stack_address): Assertion 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) <= husa && husa < VG_(thread_get_stack_max)(vg_tid)' failed. at 0x........: report_and_quit (m_libcassert.c:?) by 0x........: vgPlain_assert_fail (m_libcassert.c:?) by 0x........: highest_used_stack_address (drd_clientreq.c:?) by 0x........: drd_handle_client_request (drd_clientreq.c:?) by 0x........: do_client_request (scheduler.c:?) by 0x........: vgPlain_scheduler (scheduler.c:?) by 0x........: run_a_thread_NORETURN (syswrap-linux.c:89) sched status: running_tid=1 Thread 1: status = VgTs_Runnable at 0x........: main (drd_pthread_intercepts.c:?) by 0x........: (below main) (in /...libc...) Nuno |
|
From: Julian S. <js...@ac...> - 2008-05-03 05:48:18
|
On Friday 02 May 2008 21:21, Bart Van Assche wrote: > On Fri, May 2, 2008 at 7:57 PM, Nuno Lopes <nun...@sa...> wrote: > > Bingo! > > I now get the following: > > > > exp-drd: drd_clientreq.c:102 (highest_used_stack_address): Assertion > > 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) > > <= husa && husa <= VG_(thread_get_stack_max)(vg_tid)' failed. I tried to look into this a bit. Not directly related, but: one thing I noticed is that all the drd regression tests fail on a Fedora 8 machine, because the prereq test fails, because it assumes /usr/bin/getconf exists, and it doesn't: $ ./exp-drd/tests/supported_libpthread ; echo $? 1 $ ls /usr/bin/getconf* ls: cannot access /usr/bin/getconf*: No such file or directory $ which getconf /usr/bin/which: no getconf in (/home/sewardj/Bin:/usr/kerberos/bin:/usr/lib/ccache:/usr/local/bin:/bin:/usr/bin:/home/sewardj/bin) $ perl tests/vg_regtest exp-drd -- Running tests in exp-drd/tests ------------------------------------- drd_bitmap_test: (skipping, prereq failed: ./supported_libpthread) fp_race: (skipping, prereq failed: ./supported_libpthread) fp_race2: (skipping, prereq failed: ./supported_libpthread) hg01_all_ok: (skipping, prereq failed: ./supported_libpthread) hg02_deadlock: (skipping, prereq failed: ./supported_libpthread) hg03_inherit: (skipping, prereq failed: ./supported_libpthread) hg04_race: (skipping, prereq failed: ./supported_libpthread) hg05_race2: (skipping, prereq failed: ./supported_libpthread) hg06_readshared: (skipping, prereq failed: ./supported_libpthread) linuxthreads_det: valgrind ./linuxthreads_det *** linuxthreads_det failed (stderr) *** matinv: (skipping, prereq failed: ./supported_libpthread) memory_allocation: (skipping, prereq failed: ./supported_libpthread) omp_matinv: (skipping, prereq failed: ./run_openmp_test ./omp_matinv) omp_matinv_racy: (skipping, prereq failed: ./run_openmp_test ./omp_matinv) omp_prime_racy: (skipping, prereq failed: ./run_openmp_test ./omp_prime) pth_barrier: (skipping, prereq failed: ./supported_libpthread) pth_barrier2: (skipping, prereq failed: ./supported_libpthread) pth_barrier3: (skipping, prereq failed: ./supported_libpthread) pth_broadcast: (skipping, prereq failed: ./supported_libpthread) pth_cond_race: (skipping, prereq failed: ./supported_libpthread) pth_cond_race2: (skipping, prereq failed: ./supported_libpthread) pth_create_chain: (skipping, prereq failed: ./supported_libpthread) pth_detached: (skipping, prereq failed: ./supported_libpthread) pth_detached2: (skipping, prereq failed: ./supported_libpthread) pth_detached_sem: (skipping, prereq failed: ./supported_libpthread) recursive_mutex: (skipping, prereq failed: ./supported_libpthread) rwlock_race: (skipping, prereq failed: ./supported_libpthread) sem_as_mutex: (skipping, prereq failed: ./supported_libpthread) sem_as_mutex2: (skipping, prereq failed: ./supported_libpthread) sigalrm: (skipping, prereq failed: ./supported_libpthread) tc01_simple_race: (skipping, prereq failed: ./supported_libpthread) tc02_simple_tls: (skipping, prereq failed: ./supported_libpthread) tc03_re_excl: (skipping, prereq failed: ./supported_libpthread) tc04_free_lock: (skipping, prereq failed: ./supported_libpthread) tc05_simple_race: (skipping, prereq failed: ./supported_libpthread) tc06_two_races: (skipping, prereq failed: ./supported_libpthread) tc07_hbl1: (skipping, prereq failed: ./supported_libpthread) tc08_hbl2: (skipping, prereq failed: ./supported_libpthread) tc09_bad_unlock: (skipping, prereq failed: ./supported_libpthread) tc10_rec_lock: (skipping, prereq failed: ./supported_libpthread) tc11_XCHG: (skipping, prereq failed: ./supported_libpthread) tc12_rwl_trivial: (skipping, prereq failed: ./supported_libpthread) tc13_laog1: (skipping, prereq failed: ./supported_libpthread) tc15_laog_lockdel: (skipping, prereq failed: ./supported_libpthread) tc16_byterace: (skipping, prereq failed: ./supported_libpthread) tc17_sembar: (skipping, prereq failed: ./supported_libpthread) tc18_semabuse: (skipping, prereq failed: ./supported_libpthread) tc19_shadowmem: (skipping, prereq failed: ./supported_libpthread) tc20_verifywrap: (skipping, prereq failed: ./supported_libpthread) tc20_verifywrap2: (skipping, prereq failed: ./supported_libpthread) tc21_pthonce: (skipping, prereq failed: ./supported_libpthread) tc22_exit_w_lock: (skipping, prereq failed: ./supported_libpthread) tc23_bogus_condwait: (skipping, prereq failed: ./supported_libpthread) tc24_nonzero_sem: (skipping, prereq failed: ./supported_libpthread) trylock: (skipping, prereq failed: ./supported_libpthread) -- Finished tests in exp-drd/tests ------------------------------------- J |
|
From: Bart V. A. <bar...@gm...> - 2008-05-03 08:27:18
|
On Sat, May 3, 2008 at 7:42 AM, Julian Seward <js...@ac...> wrote: > On Friday 02 May 2008 21:21, Bart Van Assche wrote: >> On Fri, May 2, 2008 at 7:57 PM, Nuno Lopes <nun...@sa...> wrote: >> > >> > exp-drd: drd_clientreq.c:102 (highest_used_stack_address): Assertion >> > 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) >> > <= husa && husa <= VG_(thread_get_stack_max)(vg_tid)' failed. > > I tried to look into this a bit. Not directly related, but: one thing I > noticed is that all the drd regression tests fail on a Fedora 8 machine, > because the prereq test fails, because it assumes /usr/bin/getconf > exists, and it doesn't: > > $ ls /usr/bin/getconf* > ls: cannot access /usr/bin/getconf*: No such file or directory Apparently openSUSE distributes /usr/bin/getconf via the glibc RPM, and for Fedora 7 getconf is included in the glibc-common RPM. /usr/bin/getconf might have been moved to the glibc-devel RPM in later Fedora versions. I can will see what I can do to let the "supported_libpthread" script print a more verbose error message. Bart. |
|
From: Bart V. A. <bar...@gm...> - 2008-05-03 10:55:40
|
On Fri, May 2, 2008 at 10:59 PM, Nuno Lopes <nun...@sa...> wrote: > Sure! I'm still able to trigger the assertion with threaded tests, although > simple non-threaded programs seem to run fine. > Most (if not all) regtests fail. e.g.: > > exp-drd: drd_clientreq.c:107 (highest_used_stack_address): Assertion > 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) <= > husa && husa < VG_(thread_get_stack_max)(vg_tid)' failed. > at 0x........: report_and_quit (m_libcassert.c:?) > by 0x........: vgPlain_assert_fail (m_libcassert.c:?) > by 0x........: highest_used_stack_address (drd_clientreq.c:?) > by 0x........: drd_handle_client_request (drd_clientreq.c:?) > by 0x........: do_client_request (scheduler.c:?) > by 0x........: vgPlain_scheduler (scheduler.c:?) > by 0x........: run_a_thread_NORETURN (syswrap-linux.c:89) > > sched status: > running_tid=1 > > Thread 1: status = VgTs_Runnable > at 0x........: main (drd_pthread_intercepts.c:?) > by 0x........: (below main) (in /...libc...) Thanks, this gives some extra information: stack walking by VG_(get_StackTrace)() in the main thread works as expected, but the stack pointers returned by VG_(get_StackTrace)() for threads created by pthread_create() can be out of range. Julian, is it normal that running_tid == 1 for a created thread ? Bart. |
|
From: Julian S. <js...@ac...> - 2008-05-03 18:04:38
|
> Thanks, this gives some extra information: stack walking by > VG_(get_StackTrace)() in the main thread works as expected, but the > stack pointers returned by VG_(get_StackTrace)() for threads created > by pthread_create() can be out of range. Julian, is it normal that > running_tid == 1 for a created thread ? Er, I'm afraid I don't really understand the question. "Is running_tid == 1 for a created thread" at exactly where in the source code? J |
|
From: Bart V. A. <bar...@gm...> - 2008-05-03 19:00:33
|
On Sat, May 3, 2008 at 7:59 PM, Julian Seward <js...@ac...> wrote: > >> Thanks, this gives some extra information: stack walking by >> VG_(get_StackTrace)() in the main thread works as expected, but the >> stack pointers returned by VG_(get_StackTrace)() for threads created >> by pthread_create() can be out of range. Julian, is it normal that >> running_tid == 1 for a created thread ? > > Er, I'm afraid I don't really understand the question. "Is > running_tid == 1 for a created thread" at exactly where in the > source code? I was referring to the last output posted by Nuno. This output shows an assertion failure. The first stack frame shown is run_a_thread_NORETURN(). So this must be a call stack of a thread created via pthread_create(). Such threads have a thread ID that is equal to or higher than 2. Yet in the same output there is a message "sched status: running_tid=1", which means that at the time of the assertion failure VG_(running_tid) == 1. This is a contradiction -- it looks like there is something wrong with the thread ID handling. Posting the output of the following command on powerpc64 will definitely help, since it shows the thread ID's Valgrind assigns to threads: ./vg-in-place --tool=exp-drd -d exp-drd/tests/fp_race 2>&1 | tail -n 30 Bart. |
|
From: Nuno L. <nun...@sa...> - 2008-05-03 22:08:33
Attachments:
drd-ppc64.txt
|
> Posting the output of the following command on powerpc64 will > definitely help, since it shows the thread ID's Valgrind assigns to > threads: > > ./vg-in-place --tool=exp-drd -d exp-drd/tests/fp_race 2>&1 | tail -n 30 Ok, so please find the dump of that command in attach. Regards, Nuno |
|
From: Bart V. A. <bar...@gm...> - 2008-05-04 07:14:54
|
On Sun, May 4, 2008 at 12:08 AM, Nuno Lopes <nun...@sa...> wrote: >> Posting the output of the following command on powerpc64 will >> definitely help, since it shows the thread ID's Valgrind assigns to >> threads: >> >> ./vg-in-place --tool=exp-drd -d exp-drd/tests/fp_race 2>&1 | tail -n 30 > > Ok, so please find the dump of that command in attach. Thanks. I see a mysterious line in the ppc64 output, that does not appear in the amd64 output: --22020:1:signals extending a stack base 0x7fefff000 down by 4096 The address 0x7fefff000 looks to me like an address on Valgrind's stack. Is it normal that VG_(extend_stack)() gets called when Valgrind's stack is extended ? Shouldn't this function be called only for client stacks ? Bart. |
|
From: Julian S. <js...@ac...> - 2008-05-04 08:24:43
|
> Thanks. I see a mysterious line in the ppc64 output, that does not
> appear in the amd64 output:
>
> --22020:1:signals extending a stack base 0x7fefff000 down by 4096
>
> The address 0x7fefff000 looks to me like an address on Valgrind's
> stack. Is it normal that VG_(extend_stack)() gets called when
> Valgrind's stack is extended ? Shouldn't this function be called only
> for client stacks ?
I think it is a client stack. From earlier in the trace there is:
--22020:1:initimg Setup client stack: size will be 8388608
and then
--22020:1:sched sched_init_phase2: tid_main=1, cls_end=0x7ff000fff,
cls_sz=8388608
so the first thread's client stack is placed ending at 0x7ff000fff,
and so 0x7fefff000 is just 1fff (2 pages) before the end.
Also this is visible from the printed-out map:
--22020:1:aspacem 20: RSVN 07fe801000-07feffefff 8380416 ----- SmUpper
--22020:1:aspacem 21: anon 07fefff000-07ff000fff 8192 rwx--
Sections with lowercase names ("anon") belong to the client, and those
with uppercase names ("RSVN") belong to Valgrind. What this shows is
that there is an 8k client stack area belonging to the client and
immediately before it a "reservation" of 8388608 - 8192 == 8380416
belonging to Valgrind. The reservation is the place where V will expand
the stack into, on demand.
------
To answer your question re running_tid=1, at least in this example,
I would guess there is only one thread (no others have been created
yet) and so no ambiguity. In the kind of crash message that Nuno posted,
there is only one thread stack ("Thread 1: status = VgTs_Runnable" ..),
and so this is the root thread, not one created by pthread_create. If
there is > 1 thread then there would be > 1 thread stack shown.
------
Can you clarify the link between this mechanism for finding the highest
address in a stack, and why drd takes lots of memory on ppc? I assume
you have some hypothesis in mind, but I don't know what it is.
J
|
|
From: Bart V. A. <bar...@gm...> - 2008-05-04 08:41:50
|
On Sun, May 4, 2008 at 10:19 AM, Julian Seward <js...@ac...> wrote: > > Can you clarify the link between this mechanism for finding the highest > address in a stack, and why drd takes lots of memory on ppc? I assume > you have some hypothesis in mind, but I don't know what it is. As you know on Linux the NPTL allocates space on the top of the stack for NPTL-private data. This data is accessed by more than one thread. In order to avoid false positives on this NPTL-private data I let DRD suppress data race reports on data accesses in the NPTL-private data area. There is no easy way to find out where this area is allocated, so what happens in DRD is to suppress all accesses to data in the range (highest_used_stack_address() .. (top of stack)). This address range contains a little bit more than the NPTL-private data area, but it contains at least that area. Suppression happens by setting one bit in a bitmap for every address to be suppressed. So my hypothesis about the cause of the out-of-memory error on ppc is that the function highest_used_stack_address() was returning a pointer that was far out of range of the stack addreses. The following error message confirms that VG_(get_StackTrace)() returns a stack pointer that is out of range (see also the source code of highest_used_stack_address() in the source file exp-drd/drd_clientreq.c): exp-drd: drd_clientreq.c:107 (highest_used_stack_address): Assertion 'VG_(thread_get_stack_max)(vg_tid) - VG_(thread_get_stack_size)(vg_tid) <= husa && husa < VG_(thread_get_stack_max)(vg_tid)' failed. Do you already have a clue about why VG_(get_StackTrace)() shows such behavior on ppc ? Bart. |
|
From: Julian S. <js...@ac...> - 2008-05-04 09:31:23
|
> As you know on Linux the NPTL allocates space on the top of the stack
> for NPTL-private data. This data is accessed by more than one thread.
> In order to avoid false positives on this NPTL-private data I let DRD
> suppress data race reports on data accesses in the NPTL-private data
> area.
What happens if NPTL puts some thread private data in some other
place? Then DRD complains again. I saw the same problem in
Helgrind but simply decided to suppress all errors that it
reports inside libpthread.
But anyway. I think
husa = (nframes >= 1 ? sps[nframes - 1] : VG_(get_SP)(vg_tid));
is not good. I added this
VG_(printf)("\n\ngetting stack for tid %d\n", (Int)vg_tid);
VG_(pp_ExeContext)( VG_(record_ExeContext)( vg_tid, 0 ));
just before your call to VG_(get_StackTrace), and
VG_(printf)("nframes = %d\n", nframes);
{Int i; for (i = 0; i < 10; i++)
VG_(printf)("sps[%d] = %p\n", i, sps[i]);}
just after. What it shows is:
getting stack for tid 1
==27412== at 0xFF7C4D8: init (drd_pthread_intercepts.c:244)
==27412== by 0xFF7CDD4:
(within /home/sewardj/VgTRUNK/trunk/exp-drd/vgpreload_exp-drd-ppc32-linux.so)
==27412== by 0xFF75D3C:
(within /home/sewardj/VgTRUNK/trunk/exp-drd/vgpreload_exp-drd-ppc32-linux.so)
==27412== by 0xFFCEC28: call_init (in /lib/ld-2.7.so)
==27412== by 0xFFCEDAC: _dl_init (in /lib/ld-2.7.so)
==27412== by 0xFFD70A0: _start (in /lib/ld-2.7.so)
nframes = 6
sps[0] = 0xFEA04790
sps[1] = 0xFEA048D0
sps[2] = 0xFEA048E0
sps[3] = 0xFEA04910
sps[4] = 0xFEA04940
sps[5] = 0x0
sps[6] = 0x0
sps[7] = 0x0
sps[8] = 0x46AAD70
sps[9] = 0xFFFFFFFF
So you get husa = sps[n_frames - 1] = sps[5] = 0, which is bogus.
This strikes me as much safer:
if (0) {
husa = (nframes >= 1 ? sps[nframes - 1] : VG_(get_SP)(vg_tid));
} else {
UInt i;
tl_assert(nframes >= 1 && nframes <= n_ips);
husa = sps[0];
for (i = 1; i < nframes; i++) {
if (sps[i] == 0) break;
if (sps[i] > husa) husa = sps[i];
}
}
This produces husa = 0xFEA04940, the assertion does not fail, and drd
does not go into outer space. (It was taking about 1GB just to run
/bin/ls before this).
Nuno, can you try that?
J
|
|
From: Nuno L. <nun...@sa...> - 2008-05-04 12:36:20
|
> This strikes me as much safer:
>
> if (0) {
> husa = (nframes >= 1 ? sps[nframes - 1] : VG_(get_SP)(vg_tid));
> } else {
> UInt i;
> tl_assert(nframes >= 1 && nframes <= n_ips);
> husa = sps[0];
> for (i = 1; i < nframes; i++) {
> if (sps[i] == 0) break;
> if (sps[i] > husa) husa = sps[i];
> }
> }
>
> This produces husa = 0xFEA04940, the assertion does not fail, and drd
> does not go into outer space. (It was taking about 1GB just to run
> /bin/ls before this).
>
> Nuno, can you try that?
That worked, yes!
Now I just get the following:
$ cat exp-drd/tests/fp_race.stderr.diff
1a2,8
>
> WARNING: DRD has only been tested on x86-linux and amd64-linux.
>
> get_Dwarf_Reg(ppc64-linux)(31)
> get_Dwarf_Reg(ppc64-linux)(31)
> get_Dwarf_Reg(ppc64-linux)(31)
> get_Dwarf_Reg(ppc64-linux)(31)
10a18,21
> get_Dwarf_Reg(ppc64-linux)(31)
> get_Dwarf_Reg(ppc64-linux)(31)
> get_Dwarf_Reg(ppc64-linux)(31)
> get_Dwarf_Reg(ppc64-linux)(31)
Otherwise tests seem to be working. Now I'm rebuilding the latest SVN trunk
after Bart's changes to check if it works as well.
Nuno
|
|
From: Nuno L. <nun...@sa...> - 2008-05-04 13:50:09
|
> That worked, yes!
> Now I just get the following:
>
> $ cat exp-drd/tests/fp_race.stderr.diff
> 1a2,8
>>
>> WARNING: DRD has only been tested on x86-linux and amd64-linux.
>>
>> get_Dwarf_Reg(ppc64-linux)(31)
>> get_Dwarf_Reg(ppc64-linux)(31)
>> get_Dwarf_Reg(ppc64-linux)(31)
>> get_Dwarf_Reg(ppc64-linux)(31)
These prints are weird. They come from the following code:
/* FIXME: duplicates logic in readdwarf.c: copy_convert_CfiExpr_tree
and {FP,SP}_REG decls */
static Bool get_Dwarf_Reg( /*OUT*/Addr* a, Word regno, RegSummary* regs )
{
vg_assert(regs);
# elif defined(VGP_ppc64_linux)
if (regno == 1/*SP*/) { *a = regs->sp; return True; }
VG_(printf)("get_Dwarf_Reg(ppc64-linux)(%ld)\n", regno);
if (regno == 31) return False;
vg_assert(0);
# endif
return False;
}
Shouldn't that printf be moved after the regno == 31 test?
> Otherwise tests seem to be working. Now I'm rebuilding the latest SVN
> trunk
> after Bart's changes to check if it works as well.
Ok, so latest svn version seems to be working fine! Thanks for fixing the
problem :)
Nuno
|
|
From: Bart V. A. <bar...@gm...> - 2008-05-04 12:00:23
|
On Sun, May 4, 2008 at 11:26 AM, Julian Seward <js...@ac...> wrote: > >> As you know on Linux the NPTL allocates space on the top of the stack >> for NPTL-private data. This data is accessed by more than one thread. >> In order to avoid false positives on this NPTL-private data I let DRD >> suppress data race reports on data accesses in the NPTL-private data >> area. > > What happens if NPTL puts some thread private data in some other > place? Then DRD complains again. I saw the same problem in > Helgrind but simply decided to suppress all errors that it > reports inside libpthread. One of the design goals of the NPTL was to make the creation of new threads as fast as possible. That is why the NPTL puts thread-private data on thread stacks instead of allocating these via a separate memory allocation. Any other approach would make the NPTL slightly slower. With regard to future NPTL versions: the approach to suppress all errors in libpthread is not future-proof. If some day some of the NPTL-functions that access thread-private data are implemented as inline functions, suppressing data races on the basis of call stack pattern matching won't work anymore. > nframes = 6 > sps[0] = 0xFEA04790 > sps[1] = 0xFEA048D0 > sps[2] = 0xFEA048E0 > sps[3] = 0xFEA04910 > sps[4] = 0xFEA04940 > sps[5] = 0x0 > sps[6] = 0x0 > sps[7] = 0x0 > sps[8] = 0x46AAD70 > sps[9] = 0xFFFFFFFF Thanks for the workaround -- by this time I have checked in a slightly modified variant of the workaround. But the above output confirms that VG_(get_StackTrace)() returns values in sps[] that are not valid stack pointers for the relevant thread. I did not expect this this behavior from VG_(get_StackTrace)(). If this is the intended behavior of VG_(get_StackTrace)(), can you please document this behavior ? Bart. |
|
From: Julian S. <js...@ac...> - 2008-05-05 00:05:59
|
> slower. With regard to future NPTL versions: the approach to suppress > all errors in libpthread is not future-proof. If some day some of the > NPTL-functions that access thread-private data are implemented as > inline functions, suppressing data races on the basis of call stack > pattern matching won't work anymore. Hmm, good point. I hadn't thought of that. > VG_(get_StackTrace)() returns values in sps[] that are not valid stack > pointers for the relevant thread. I did not expect this this behavior > from VG_(get_StackTrace)(). If this is the intended behavior of > VG_(get_StackTrace)(), can you please document this behavior ? Will do. J |