|
From: Dimitri Papadopoulos-O. <pap...@sh...> - 2004-08-26 09:24:47
|
Hi, I'm running Valgrind 2.1.2 on Red Hat 9. I'm using it to debug some not always reproducible crashes in program distcc 2.17 under high load. I've replaced program distcc by a script that launches distcc under Valgrind. But now Valgrind crashes with the following message. Is this a known issue, or should I file in a bug? I wasn't able to relate this crash to any of the FAQ items. Note that the distcc stack probably gets corrupted: I couldn't get meaningful stack traces from the core files in any debugger. ==10615== Memcheck, a memory error detector for x86-linux. ==10615== Copyright (C) 2002-2004, and GNU GPL'd, by Julian Seward et al. ==10615== Using valgrind-2.1.2, a program supervision framework for x86-linux. ==10615== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward et al. ==10615== For more details, rerun with: -v ==10615== distcc[10615] ERROR: Connect timeout valgrind: vg_to_ucode.c:5285 (disInstr): Assertion `abyte == 0' failed. ==10615== at 0xB002AA26: vgPlain_skin_assert_fail (vg_mylibc.c:1169) ==10615== by 0xB002AA25: assert_fail (vg_mylibc.c:1165) ==10615== by 0xB002AA63: vgPlain_core_assert_fail (vg_mylibc.c:1176) ==10615== by 0xB00511DA: disInstr (vg_to_ucode.c:7300) sched status: Thread 1: status = Runnable, associated_mx = 0x0, associated_cv = 0x0 ==10615== at 0x52BFE0A4: ??? Note: see also the FAQ.txt in the source distribution. It contains workarounds to several common problems. If that doesn't help, please report this bug to: valgrind.kde.org In the bug report, send all the above text, the valgrind version, and what Linux distro you are using. Thanks. Regards, Dimitri |
|
From: Tom H. <th...@cy...> - 2004-08-26 09:46:46
|
In message <412...@sh...>
Dimitri Papadopoulos-Orfanos <pap...@sh...> wrote:
> But now Valgrind crashes with the following message. Is this a known
> issue, or should I file in a bug? I wasn't able to relate this crash
> to any of the FAQ items.
Please file a bug.
> valgrind: vg_to_ucode.c:5285 (disInstr): Assertion `abyte == 0' failed.
> ==10615== at 0xB002AA26: vgPlain_skin_assert_fail (vg_mylibc.c:1169)
> ==10615== by 0xB002AA25: assert_fail (vg_mylibc.c:1165)
> ==10615== by 0xB002AA63: vgPlain_core_assert_fail (vg_mylibc.c:1176)
> ==10615== by 0xB00511DA: disInstr (vg_to_ucode.c:7300)
That's an ENTER instruction with a non-zero nesting level. It sounds
like a pretty unusual instruction to be using - are you sure your program
isn't jumping through a bad pointer somewhere?
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Dimitri Papadopoulos-O. <pap...@sh...> - 2004-08-26 10:05:14
|
Hi, >>But now Valgrind crashes with the following message. Is this a known >>issue, or should I file in a bug? I wasn't able to relate this crash >>to any of the FAQ items. > > > Please file a bug. >>valgrind: vg_to_ucode.c:5285 (disInstr): Assertion `abyte == 0' failed. >>==10615== at 0xB002AA26: vgPlain_skin_assert_fail (vg_mylibc.c:1169) >>==10615== by 0xB002AA25: assert_fail (vg_mylibc.c:1165) >>==10615== by 0xB002AA63: vgPlain_core_assert_fail (vg_mylibc.c:1176) >>==10615== by 0xB00511DA: disInstr (vg_to_ucode.c:7300) > > > That's an ENTER instruction with a non-zero nesting level. It sounds > like a pretty unusual instruction to be using - are you sure your program > isn't jumping through a bad pointer somewhere? Alas, it's not my program: http://distcc.samba.org/ The crash probably occurs in some error handling routine, since it happens after this message: distcc[10615] ERROR: Connect timeout Since I don't know where distcc crashes, it's hard to tell what the actual code looks like. Note that I have some 40 instances of distcc running in parallel under Valgrind. Among those distcc instances that crash, some of them will cause Valgrind to crash at exactly the same vg_to_ucode.c line, others won't. Regards, Dimitri |
|
From: Nicholas N. <nj...@ca...> - 2004-08-26 09:48:59
|
On Thu, 26 Aug 2004, Dimitri Papadopoulos-Orfanos wrote: > But now Valgrind crashes with the following message. Is this a known issue, > or should I file in a bug? I wasn't able to relate this crash to any of the > FAQ items. It's new. > valgrind: vg_to_ucode.c:5285 (disInstr): Assertion `abyte == 0' failed. > ==10615== at 0xB002AA26: vgPlain_skin_assert_fail (vg_mylibc.c:1169) > ==10615== by 0xB002AA25: assert_fail (vg_mylibc.c:1165) > ==10615== by 0xB002AA63: vgPlain_core_assert_fail (vg_mylibc.c:1176) > ==10615== by 0xB00511DA: disInstr (vg_to_ucode.c:7300) Ah, the instruction "enter" has two forms. The common form is: enter $n, $0 which creates a stack frame. Although it's not used very often (eg. gcc doesn't generate it AFAICT). Valgrind supports that fine. But if the second argument is non-zero, eg: enter $n, $1 then it creates a weird "nested stack frame" which involves copying multiple old frame pointers. Valgrind doesn't handle this case because it's (a) a pain to simulate, and (b) so rare -- you're the first person who's come across it, AFAIK. I'll create a bug report for it. Hopefully someone will take it upon themselves to implement it. If you're feeling adventurous, you could try making a patch for it. If you can prevent your program from using the 2nd form of "enter", that would be a workaround. N |
|
From: Nicholas N. <nj...@ca...> - 2004-08-26 09:56:51
|
On Thu, 26 Aug 2004, Nicholas Nethercote wrote: > If you can prevent your program from using the 2nd form of "enter", that > would be a workaround. Or, as Tom says, you might have buggy code that is jumping to a random location that just happens to look like a nested "enter". N |
|
From: Dimitri Papadopoulos-O. <pap...@sh...> - 2004-08-26 10:38:28
|
Hi, >> If you can prevent your program from using the 2nd form of "enter", >> that would be a workaround. > > > Or, as Tom says, you might have buggy code that is jumping to a random > location that just happens to look like a nested "enter". For your information, I've also attempted to instrument the code with Insure++. It didn't prove very helpful either... Something seems to be wrong in a timeout signal handler in distcc. Here is the error message from the instrumented code before Insure++ crashes: ### Unix/Signal.cc:332: panic: received signal 11 while in runtime ### @(#)$RCSfile: Signal.cc,v $ $Revision: 32.52 $ $Date: 2003/07/28 16:15:14 $ ### ThisThread.cc:593: abort ### @(#)$RCSfile: ThisThread.cc,v $ $Revision: 32.119.2.3 $ $Date: 2003/08/01 22:37:30 $ This is getting off-topic, but does anyone know how to debug such errors? Dimitri |
|
From: Nicholas N. <nj...@ca...> - 2004-08-26 13:13:41
|
On Thu, 26 Aug 2004, Dimitri Papadopoulos-Orfanos wrote: > For your information, I've also attempted to instrument the code with > Insure++. It didn't prove very helpful either... Something seems to be wrong > in a timeout signal handler in distcc. One thing you could try: modify Valgrind so that when the assert triggers, you print out the value of "eip". Then, use "objdump -d" to look at the original machine code (if it's in a shared object it's a little trickier to work out how the 'eip' maps to the code) and see if it really is a nested "enter" instruction. That would at least determine, with some confidence, whether it's a nested "enter" or a jump into non-code that's troubling Valgrind. N |
|
From: John A. <ja...@mb...> - 2004-09-01 07:00:55
|
On Thu, 26 Aug 2004 14:13:35 +0100 (BST), Nicholas Nethercote <nj...@ca...> wrote: >On Thu, 26 Aug 2004, Dimitri Papadopoulos-Orfanos wrote: > >> For your information, I've also attempted to instrument the code with=20 >> Insure++. It didn't prove very helpful either... Something seems to be= wrong=20 >> in a timeout signal handler in distcc. > >One thing you could try: modify Valgrind so that when the assert = triggers,=20 >you print out the value of "eip". Then, use "objdump -d" to look at the= =20 >original machine code (if it's in a shared object it's a little trickier= =20 >to work out how the 'eip' maps to the code) and see if it really is a=20 >nested "enter" instruction. > >That would at least determine, with some confidence, whether it's a = nested=20 >"enter" or a jump into non-code that's troubling Valgrind. > >N > One machine I worked on had a "lagging address pointer", which was the second most recent eip value. It was great for identify what happened just before random branches. Maybe valgrind could add a prior eip stack which could be dumped after an error. [I'm just an observer here.. love to read about this interesting project.] john alvord [ |
|
From: Tom H. <th...@cy...> - 2004-08-26 10:50:32
|
In message <412...@sh...>
Dimitri Papadopoulos-Orfanos <pap...@sh...> wrote:
> Mmmh... Since the crash seems to occur in an error handler, that could
> be the case. If I understand correctly, you're saying that distcc
> attempts to jump to an unitialized value. Doesn't Valgrind detect that
> an uninitialized valure is user as an addres to jump to?
If the pointer is undefined and valgrind realises that it is undefined
then it should warn you. It could be well defined but bogus however, in
which case valgrind wouldn't be able to help.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Nicholas N. <nj...@ca...> - 2004-08-26 11:14:27
|
On Thu, 26 Aug 2004, Tom Hughes wrote: >> Mmmh... Since the crash seems to occur in an error handler, that could >> be the case. If I understand correctly, you're saying that distcc >> attempts to jump to an unitialized value. Doesn't Valgrind detect that >> an uninitialized valure is user as an addres to jump to? > > If the pointer is undefined and valgrind realises that it is undefined > then it should warn you. It could be well defined but bogus however, in > which case valgrind wouldn't be able to help. Just to clarify, it would warn you if: - the address given to the jump contains uninitialised bits - the address jumped to is not addressable. But if the address is wrong but initialised, and you happen to land in addressable memory (eg. in the middle of a heap block) you won't get a warning. N |
|
From: Dimitri Papadopoulos-O. <pap...@sh...> - 2004-08-26 10:08:56
|
Hi, >> If you can prevent your program from using the 2nd form of "enter", >> that would be a workaround. > > > Or, as Tom says, you might have buggy code that is jumping to a random > location that just happens to look like a nested "enter". Mmmh... Since the crash seems to occur in an error handler, that could be the case. If I understand correctly, you're saying that distcc attempts to jump to an unitialized value. Doesn't Valgrind detect that an uninitialized valure is user as an addres to jump to? Dimitri |