|
From: Eyal L. <ey...@ey...> - 2005-01-19 23:07:27
|
I get this report from a run: ==2005-01-20 08:04:14.204 32619== Thread 9: ==2005-01-20 08:04:14.220 32619== Syscall param socketcall.send(msg) points to uninitialised byte(s) ==2005-01-20 08:04:14.220 32619== at 0x1C043A8E: send (in /lib/tls/libpthread-0.60.so) ==2005-01-20 08:04:14.220 32619== Address 0x219C9749 is 57 bytes inside a block of size 12288 alloc'd ==2005-01-20 08:04:14.220 32619== at 0x1B906FE5: calloc (vg_replace_malloc.c:175) I know that I am sending uninitialised data, but in the past I got a proper stack trace rather than just the 'send' message. Even the 'calloc' message, without a stack, is not so helpful. Am I missing a new option? or is there a reason for this change? -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> If attaching .zip rename to .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-19 23:15:31
|
On Thu, 2005-01-20 at 10:07 +1100, Eyal Lebedinsky wrote: > I get this report from a run: > > ==2005-01-20 08:04:14.204 32619== Thread 9: > ==2005-01-20 08:04:14.220 32619== Syscall param socketcall.send(msg) points to uninitialised byte(s) > ==2005-01-20 08:04:14.220 32619== at 0x1C043A8E: send (in /lib/tls/libpthread-0.60.so) > ==2005-01-20 08:04:14.220 32619== Address 0x219C9749 is 57 bytes inside a block of size 12288 alloc'd > ==2005-01-20 08:04:14.220 32619== at 0x1B906FE5: calloc (vg_replace_malloc.c:175) > > I know that I am sending uninitialised data, but in the past I got > a proper stack trace rather than just the 'send' message. Even the > 'calloc' message, without a stack, is not so helpful. > > Am I missing a new option? or is there a reason for this change? I think libpthread is compiled with -fomit-frame-pointer, which makes it hard to get good stack traces. I'm thinking about experimenting with libunwind to see if we can use it for stack traces; it understands the unwind info that gcc puts into new .o files, which should make it possible to get good backtraces in these cases. I'm not sure why calloc isn't getting a bit more backtrace. Make sure there are no -fomit-frame-pointers in the Valgrind makefiles. Oh, and that you're not using --num-callers=1. J |
|
From: Eyal L. <ey...@ey...> - 2005-01-20 00:32:47
|
Jeremy Fitzhardinge wrote: > On Thu, 2005-01-20 at 10:07 +1100, Eyal Lebedinsky wrote: > >>I get this report from a run: >> >>==2005-01-20 08:04:14.204 32619== Thread 9: >>==2005-01-20 08:04:14.220 32619== Syscall param socketcall.send(msg) points to uninitialised byte(s) >>==2005-01-20 08:04:14.220 32619== at 0x1C043A8E: send (in /lib/tls/libpthread-0.60.so) >>==2005-01-20 08:04:14.220 32619== Address 0x219C9749 is 57 bytes inside a block of size 12288 alloc'd >>==2005-01-20 08:04:14.220 32619== at 0x1B906FE5: calloc (vg_replace_malloc.c:175) >> >>I know that I am sending uninitialised data, but in the past I got >>a proper stack trace rather than just the 'send' message. Even the >>'calloc' message, without a stack, is not so helpful. >> >>Am I missing a new option? or is there a reason for this change? > > > I think libpthread is compiled with -fomit-frame-pointer, which makes it > hard to get good stack traces. I'm thinking about experimenting with > libunwind to see if we can use it for stack traces; it understands the > unwind info that gcc puts into new .o files, which should make it > possible to get good backtraces in these cases. > > I'm not sure why calloc isn't getting a bit more backtrace. Make sure > there are no -fomit-frame-pointers in the Valgrind makefiles. For vg I do a different build than normal. I build with '-O0' and nothing else (just some extra warn requests): -W -Wall -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wconversion -Wredundant-decls -ansi -D_XOPEN_SOURCE=1 -D_GNU_SOURCE=1 -O0 -fno-inline -g I should say I used to get the trace, this laconic report is recent. > Oh, and that you're not using --num-callers=1. I use '--num-callers=32' which I find good enough. > J -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> If attaching .zip rename to .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-20 01:13:22
|
On Thu, 2005-01-20 at 11:32 +1100, Eyal Lebedinsky wrote: > For vg I do a different build than normal. I build with '-O0' and nothing > else (just some extra warn requests): > -W -Wall -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wconversion -Wredundant-decls -ansi -D_XOPEN_SOURCE=1 -D_GNU_SOURCE=1 -O0 -fno-inline -g No, I mean when building Valgrind itself. > > Oh, and that you're not using --num-callers=1. > > I use '--num-callers=32' which I find good enough. Yeah, I was pretty sure that wasn't it, but its always worth checking... J |
|
From: Eyal L. <ey...@ey...> - 2005-01-20 02:07:10
|
Jeremy Fitzhardinge wrote: > On Thu, 2005-01-20 at 11:32 +1100, Eyal Lebedinsky wrote: > >>For vg I do a different build than normal. I build with '-O0' and nothing >>else (just some extra warn requests): >> -W -Wall -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wconversion -Wredundant-decls -ansi -D_XOPEN_SOURCE=1 -D_GNU_SOURCE=1 -O0 -fno-inline -g > > > No, I mean when building Valgrind itself. I do ./autogen.sh || exit 1 ./configure || exit 1 make || exit 1 make install || exit 1 If the defaults are unsuitable then I would have a bad build. Here is a snippet of a build: if gcc -DHAVE_CONFIG_H -I. -I. -I.. -I../coregrind -I../coregrind -I../coregrind/x86 \ -I../coregrind/linux -I../coregrind/x86-linux -I../include -I../include \ -I../include/x86 -I../include/linux -I../include/x86-linux \ -DVG_LIBDIR="\"/usr/local/lib/valgrind"\" -I./demangle -DKICKSTART_BASE=0xb0000000 \ -DVG_PLATFORM="\"x86-linux"\" -Winline -Wall -Wshadow -O -g -mpreferred-stack-boundary=2 \ -DELFSZ=32 -MT stage2-vg_dummy_profile.o -MD -MP -MF ".deps/stage2-vg_dummy_profile.Tpo" \ -c -o stage2-vg_dummy_profile.o `test -f 'vg_dummy_profile.c' || echo './'`vg_dummy_profile.c then mv -f ".deps/stage2-vg_dummy_profile.Tpo" ".deps/stage2-vg_dummy_profile.Po" else rm -f ".deps/stage2-vg_dummy_profile.Tpo" exit 1 fi I will repeat - this was not a problem until recently. I am rather sure the stable 2.2.0 gives good backtraces. > J I would like to offer another observation. I just created a simple program in an attempt to demonstrate the laconic report problem. Instead, it crashed (sig 11) on a return. After repeating it a few times, I noticed that my big test is hanging again. I killed it and deleted the semaphore it hold (somehow it is never released after a crash). The tiny test program now works (no sig 11). Is it possible that vg uses some semaphore that all instances share and it gets into trouble after a while? My test suit always fails after a number of tests finish successfully, and every program thereafter gets sig 11. Every single valgrind run. If I kill everything (and remove [ipcrm -s] my own semaphore that my tests use) then I can continue with the tests (well, at least for a while). -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> If attaching .zip rename to .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-20 06:45:46
|
On Thu, 2005-01-20 at 13:06 +1100, Eyal Lebedinsky wrote: > I will repeat - this was not a problem until recently. I am rather sure the stable 2.2.0 > gives good backtraces. Oh, I believe you, but I don't think anything has change recently which would have affected this; at least not for calloc. > I would like to offer another observation. I just created a simple program > in an attempt to demonstrate the laconic report problem. Instead, it crashed > (sig 11) on a return. > > After repeating it a few times, I noticed that my big test is hanging again. > I killed it and deleted the semaphore it hold (somehow it is never released > after a crash). > > The tiny test program now works (no sig 11). > > Is it possible that vg uses some semaphore that all instances share and it > gets into trouble after a while? My test suit always fails after a number of > tests finish successfully, and every program thereafter gets sig 11. Every > single valgrind run. If I kill everything (and remove [ipcrm -s] my own > semaphore that my tests use) then I can continue with the tests (well, at > least for a while). Valgrind doesn't use semaphores itself, and it should just be passing your syscalls through to the kernel untouched. It also respects the CLONE_SYSVSEMA flag, so that should be OK. I'll note that both FC2 and SUSE 9.2 2.6 kernels seem to show sporadic problems with delivering signals without proper siginfo information. That will cause your program to spontaneously SIGSEGV when it tries to grow the stack, which almost every program will need to do. The kernel will stay in this state for some indeterminate amount of time, but then will spontaneously start working again. You can test for this state by running none/test/faultstatus (natively, not under Valgrind). If it doesn't pass everything, then your kernel is in a buggy state. I have never seen this with stock kernel.org kernels. Is your kernel a Debian-supplied one, or one you've built yourself? J |
|
From: Eyal L. <ey...@ey...> - 2005-01-20 07:33:16
|
Jeremy Fitzhardinge wrote: [trimmed] > I have never seen this with stock kernel.org kernels. Is your kernel a > Debian-supplied one, or one you've built yourself? I build my own kernels, I now am on $ uname -a Linux e7 2.6.10-ac9 #1 SMP Fri Jan 14 08:56:38 EST 2005 i686 GNU/Linux Just to remove this worry I will now boot into 2.6.10. > J -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> If attaching .zip rename to .dat |
|
From: Eyal L. <ey...@ey...> - 2005-01-20 08:58:06
|
Eyal Lebedinsky wrote: > Jeremy Fitzhardinge wrote: > [trimmed] > >> I have never seen this with stock kernel.org kernels. Is your kernel a >> Debian-supplied one, or one you've built yourself? > > > I build my own kernels, I now am on > > $ uname -a > Linux e7 2.6.10-ac9 #1 SMP Fri Jan 14 08:56:38 EST 2005 i686 GNU/Linux > > Just to remove this worry I will now boot into 2.6.10. Same thing with vanilla 2.6.10. This time, when my tests hang and zz35 fails, I did killall -9 valgrind ; sh zz35.sh and zz35 succeeded. In other words, the killall (which killed my hanging testsuit programs running under vg) immediately fixed the situation. No delay. FYI -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> If attaching .zip rename to .dat |
|
From: Eyal L. <ey...@ey...> - 2005-01-20 08:21:19
Attachments:
zz35.sh
zz35.tar.bz2
|
Jeremy Fitzhardinge wrote: [trimmed] > I'll note that both FC2 and SUSE 9.2 2.6 kernels seem to show sporadic > problems with delivering signals without proper siginfo information. > That will cause your program to spontaneously SIGSEGV when it tries to > grow the stack, which almost every program will need to do. The kernel > will stay in this state for some indeterminate amount of time, but then > will spontaneously start working again. > > J In case it is the same thing, let me describe how I tested it just now. I have a small test zz35.sh (attached) that simple creates an uninited error, and which I use to see that I get a proper backtrace. I now use it to investigate a different problem where my regression testsuit hangs after a number of successful runs and will not proceed until I shutdown all my servers (thud stopping all valgrind instances). - run my tests. It should do 11-12 of them before failing - wait for my tests to hang - run my zz35 which fails sig 11 (zz35-sig11.log). It fails consistently for as long as I want. - 'killall -9 valgrind' to release my failed tests/servers - without any waiting run my zz35 which works OK again (zz35-ok.log) So, stopping the running valgrind instances allowed zz35 to run OK. There does not seem to be a period where the kernel is in 'a mood', but rather one needs to ensure all valgrind instances are stopped. Which suggests that some sort of global resource (internal to vg) is associated with the failure. Naturally, it could still be a kernel bug that does this. This is with vanilla 2.6.10-ac9. -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> If attaching .zip rename to .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-22 17:17:18
|
On Thu, 2005-01-20 at 19:21 +1100, Eyal Lebedinsky wrote: > - run my tests. It should do 11-12 of them before failing > - wait for my tests to hang > - run my zz35 which fails sig 11 (zz35-sig11.log). It fails > consistently for as long as I want. > - 'killall -9 valgrind' to release my failed tests/servers > - without any waiting run my zz35 which works OK again (zz35-ok.log) > > So, stopping the running valgrind instances allowed zz35 to run > OK. There does not seem to be a period where the kernel is in > 'a mood', but rather one needs to ensure all valgrind instances > are stopped. When it is in this state, before you killall -9 valgrind, please run none/tests/faultstatus (natively, not under Valgrind). If this fails, it is definitely a kernel bug. J |
|
From: Eyal L. <ey...@ey...> - 2005-01-23 00:53:39
|
Jeremy Fitzhardinge wrote: > When it is in this state, before you killall -9 valgrind, please run > none/tests/faultstatus (natively, not under Valgrind). If this fails, > it is definitely a kernel bug. > > J Done this, and faultstatus is not failing 'in that state': root@e7:/data2/valgrind/valgrind/none/tests# ./faultstatus Test 0: PASS 1 Test 1: PASS 2 Test 2: PASS 3 Test 3: PASS 4 Test 4: PASS 5 However, running as non-root it does fail 'in that state': eyal@e7:~$ /data2/valgrind/valgrind/none/tests/faultstatus Test 0: FAIL: expected si_code==1, not 0 Test 1: FAIL: expected si_code==2, not 0 Test 2: FAIL: expected si_code==2, not 0 Test 3: FAIL: expected si_code==2, not 0 Test 4: FAIL: expected si_code==1, not 0 My application runs non-root. 'killall' makes it work instantly. This is on vanilla 2.6.11-rc2. Was this reported to linux-kernel? Who is handling it there? I am ready to do testing/investigation as I can reproduce it consistently. I cannot report it myself as I do not know enough about the nature of the problem (or what faultstatus does). However, if this test program is a simple one that should always work but sometimes does not (and can be built stand-alone) then it may be enough for a report. -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-23 01:54:33
|
On Sun, 2005-01-23 at 11:53 +1100, Eyal Lebedinsky wrote: > However, running as non-root it does fail 'in that state': > > eyal@e7:~$ /data2/valgrind/valgrind/none/tests/faultstatus > Test 0: FAIL: expected si_code==1, not 0 > Test 1: FAIL: expected si_code==2, not 0 > Test 2: FAIL: expected si_code==2, not 0 > Test 3: FAIL: expected si_code==2, not 0 > Test 4: FAIL: expected si_code==1, not 0 > > My application runs non-root. > > 'killall' makes it work instantly. > > This is on vanilla 2.6.11-rc2. OK, definitely a kernel problem. faultstatus shouldn't depend on UID at all. > Was this reported to linux-kernel? Who is handling it there? I am > ready to do testing/investigation as I can reproduce it consistently. I haven't seen it reported against a stock kernel, so I haven't reported it to linux-kernel. > I cannot report it myself as I do not know enough about the nature > of the problem (or what faultstatus does). However, if this test > program is a simple one that should always work but sometimes does > not (and can be built stand-alone) then it may be enough for a > report. Yes. What does your test suite do? How much can you simplify it and get the same result? J |
|
From: Eyal L. <ey...@ey...> - 2005-01-23 06:10:44
|
Jeremy Fitzhardinge wrote: [trimmed] > What does your test suite do? How much can you simplify it and > get the same result? > > J I have tried it before and failed. If you still want me to try (after you finish your investigation) then i will give it another go. -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-23 02:32:07
|
On Sun, 2005-01-23 at 11:53 +1100, Eyal Lebedinsky wrote: > Was this reported to linux-kernel? Who is handling it there? I am > ready to do testing/investigation as I can reproduce it consistently. > > I cannot report it myself as I do not know enough about the nature > of the problem (or what faultstatus does). However, if this test > program is a simple one that should always work but sometimes does > not (and can be built stand-alone) then it may be enough for a > report. Hm, I've worked out what it is, and it probably is a Valgrind bug. There's a system-wide limit to the number of queued signals; if that limit gets hit, then signals are delivered without extra info. I can reproduce it easily if I change faultstatus to send itself a lot of blocked signals before doing the main part of the test. Does your test do a lot of thread exits? J |
|
From: Eyal L. <ey...@ey...> - 2005-01-23 06:09:02
|
Jeremy Fitzhardinge wrote: > On Sun, 2005-01-23 at 11:53 +1100, Eyal Lebedinsky wrote: > >>Was this reported to linux-kernel? Who is handling it there? I am >>ready to do testing/investigation as I can reproduce it consistently. >> >>I cannot report it myself as I do not know enough about the nature >>of the problem (or what faultstatus does). However, if this test >>program is a simple one that should always work but sometimes does >>not (and can be built stand-alone) then it may be enough for a >>report. > > > Hm, I've worked out what it is, and it probably is a Valgrind bug. > There's a system-wide limit to the number of queued signals; if that > limit gets hit, then signals are delivered without extra info. I can > reproduce it easily if I change faultstatus to send itself a lot of > blocked signals before doing the main part of the test. > > Does your test do a lot of thread exits? Probably. My testsuit is running clients against servers and each client session gets a thread, and a typical test will have no more than 10 threads at a time active on the server. I have seen the failure after 11 tests or it managed up to 25. > J > -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |
|
From: Eyal L. <ey...@ey...> - 2005-01-20 22:38:20
Attachments:
date.tar.bz2
|
Jeremy Fitzhardinge wrote:
[trimmed]
> I'll note that both FC2 and SUSE 9.2 2.6 kernels seem to show sporadic
> problems with delivering signals without proper siginfo information.
> That will cause your program to spontaneously SIGSEGV when it tries to
> grow the stack, which almost every program will need to do. The kernel
> will stay in this state for some indeterminate amount of time, but then
> will spontaneously start working again.
Some further observations. I ran
strace valgrind --tool=memcheck date >date.strace 2>&1
when a crash is reported, and then when one is not. Comparing the
two logs (attached) I note the point when the two diverge:
non-crashing
============
fstat(3, {st_mode=S_IFREG|0644, st_size=78233, ...}) = 0
readlink("/proc/self/fd/3", "/lib/tls/libpthread-0.60.so", 4096) = 27
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
gettid() = 31890
old_mmap(0x52bfd000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x52bfd000
crashing
========
fstat(3, {st_mode=S_IFREG|0644, st_size=78233, ...}) = 0
readlink("/proc/self/fd/3", "/lib/tls/libpthread-0.60.so", 4096) = 27
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
gettid() = 31744
gettid() = 31744
old_getrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=2147483647}) = 0
getpid() = 31744
write(1016, "==31744== \n", 11==31744==
) = 11
getpid() = 31744
write(1016, "==31744== Process terminating wi"..., 73==31744== Process terminating with default action of signal 11 (SIGSEGV)
) = 73
I note that there is always a "SIGSEGV (Segmentation fault)"
present, even in a good run. It is the reaction to it that
differs. Is it possible that there is a special SIGSEGV
(in vg or glibc) that is overloaded (not a real segfault)
and should be handled specially?
BTW, The 'good' run was done after a fresh boot, to ensure
the kernel is not in any 'funny' state. Just in case.
--
Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/>
attach .zip as .dat
|
|
From: Jeremy F. <je...@go...> - 2005-01-24 08:15:22
|
On Thu, 2005-01-20 at 13:06 +1100, Eyal Lebedinsky wrote: > I would like to offer another observation. I just created a simple program > in an attempt to demonstrate the laconic report problem. Instead, it crashed > (sig 11) on a return. OK, I just checked in a fix for this too. J |
|
From: Eyal L. <ey...@ey...> - 2005-01-24 09:31:59
|
Jeremy Fitzhardinge wrote: > On Thu, 2005-01-20 at 13:06 +1100, Eyal Lebedinsky wrote: > >>I would like to offer another observation. I just created a simple program >>in an attempt to demonstrate the laconic report problem. Instead, it crashed >>(sig 11) on a return. > > OK, I just checked in a fix for this too. I started my testing, I should know how it goes in 1-2h. The pasta sauce (now simmering happily) should also be ready by then... > J Thanks -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |
|
From: Eyal L. <ey...@ey...> - 2005-01-24 10:35:37
|
Jeremy Fitzhardinge wrote: > On Thu, 2005-01-20 at 13:06 +1100, Eyal Lebedinsky wrote: > >>I would like to offer another observation. I just created a simple program >>in an attempt to demonstrate the laconic report problem. Instead, it crashed >>(sig 11) on a return. > > OK, I just checked in a fix for this too. > > J It still crashes, but differently. faultstatus does not fail anymore, but my tests still fail. The pattern is that one program fails with /ssa/builds/20050118g-vgi/bin/showtime: line 81: 10967 Segmentation fault $valgrind_prefix --logfile-fd=9 $orig "$@" 9>>$log The log does not show any errors. Actually, this last run of showtime has not a single line from VG in the log, as if vg itself died quietly. And after this I cannot run anything. This is due to the segfault happening when the failed program was holding a semaphore. This is most unusual because that sem is held for a very brief time, and I do not expect a random crash to happen just then. The fact is that these crashes are very common, actually the earlier problem (the sig11 now patched) was causing exactly the same thing, and I always had to manually remove the sem. Almost as if the fix is not just right. In short, I still have a problem, and it happens about the same time into my tests as the sig11 did. -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-25 18:24:59
|
On Mon, 2005-01-24 at 21:35 +1100, Eyal Lebedinsky wrote: > It still crashes, but differently. faultstatus does not fail > anymore, but my tests still fail. The pattern is that one program > fails with > > /ssa/builds/20050118g-vgi/bin/showtime: line 81: 10967 Segmentation fault $valgrind_prefix --logfile-fd=9 $orig "$@" 9>>$log > > The log does not show any errors. Actually, this last run of > showtime has not a single line from VG in the log, as if vg > itself died quietly. > > And after this I cannot run anything. This is due to the segfault > happening when the failed program was holding a semaphore. This > is most unusual because that sem is held for a very brief time, > and I do not expect a random crash to happen just then. The fact > is that these crashes are very common, actually the earlier > problem (the sig11 now patched) was causing exactly the same > thing, and I always had to manually remove the sem. Almost as if > the fix is not just right. Well, I did fix a real bug, but there could be another one. Lots of bugs have "died with SIGSEGV" as their symptom. Could you file a bug for this? And if you can give me some way to reproduce this, that would be helpful. There are only a few places within Valgrind where it would quietly die with SIGSEGV without being able to say anything. I'm not sure what it might be in this case. J |
|
From: Eyal L. <ey...@ey...> - 2005-01-26 08:14:41
Attachments:
vg.tar.bz2
|
Jeremy Fitzhardinge wrote: > On Mon, 2005-01-24 at 21:35 +1100, Eyal Lebedinsky wrote: > >>It still crashes, but differently. faultstatus does not fail >>anymore, but my tests still fail. The pattern is that one program >>fails with >> >>/ssa/builds/20050118g-vgi/bin/showtime: line 81: 10967 Segmentation fault $valgrind_prefix --logfile-fd=9 $orig "$@" 9>>$log >> >>The log does not show any errors. Actually, this last run of >>showtime has not a single line from VG in the log, as if vg >>itself died quietly. >> >>And after this I cannot run anything. This is due to the segfault >>happening when the failed program was holding a semaphore. This >>is most unusual because that sem is held for a very brief time, >>and I do not expect a random crash to happen just then. The fact >>is that these crashes are very common, actually the earlier >>problem (the sig11 now patched) was causing exactly the same >>thing, and I always had to manually remove the sem. Almost as if >>the fix is not just right. > > > Well, I did fix a real bug, but there could be another one. Lots of > bugs have "died with SIGSEGV" as their symptom. > > Could you file a bug for this? And if you can give me some way to > reproduce this, that would be helpful. I do not have an easy way to reproduce, however I can provide better logs. The attached one shows a successfull run of the currently failing program followed by one that aborted quietly. The failure in the ipc area may explain why my program always dies while holding a semaphore, which I must remove manually before my system is usable again. > There are only a few places within Valgrind where it would quietly die > with SIGSEGV without being able to say anything. I'm not sure what it > might be in this case. Hope the above helps, I will turn on my own logging to see what I was trying to do at the time of the crash. It is "good" that the crash is very consistent now, it used to be very variable before. Yes, I do see the cup half full... > J -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |
|
From: Jeremy F. <je...@go...> - 2005-01-26 18:49:47
|
On Wed, 2005-01-26 at 19:14 +1100, Eyal Lebedinsky wrote: > I do not have an easy way to reproduce, however I can provide better logs. > The attached one shows a successfull run of the currently failing program > followed by one that aborted quietly. > > The failure in the ipc area may explain why my program always dies while > holding a semaphore, which I must remove manually before my system is > usable again. Is your program multithreaded? Your trace makes it appear that it dies in the sys_ipc syscall doing a semget, but Valgrind does literally nothing in this case - there's nothing for it to check, so it does nothing. Could you try again with --trace-signals=yes. And could you please, please, please file a bug and attach these logs to it. > > There are only a few places within Valgrind where it would quietly die > > with SIGSEGV without being able to say anything. I'm not sure what it > > might be in this case. > > Hope the above helps, I will turn on my own logging to see what I was > trying to do at the time of the crash. It is "good" that the crash is > very consistent now, it used to be very variable before. Yes, I do see > the cup half full... Yes, that will definitely help. J |
|
From: Eyal L. <ey...@ey...> - 2005-01-27 01:02:29
|
Jeremy Fitzhardinge wrote: > On Wed, 2005-01-26 at 19:14 +1100, Eyal Lebedinsky wrote: > >>I do not have an easy way to reproduce, however I can provide better logs. >>The attached one shows a successfull run of the currently failing program >>followed by one that aborted quietly. >> >>The failure in the ipc area may explain why my program always dies while >>holding a semaphore, which I must remove manually before my system is >>usable again. > > Is your program multithreaded? Your trace makes it appear that it dies > in the sys_ipc syscall doing a semget, but Valgrind does literally > nothing in this case - there's nothing for it to check, so it does > nothing. Could you try again with --trace-signals=yes. I now have more logs. ssashut.1 is a log of a good run, ssashut.2 is a similar run that failed. The logs include both VG traces any my own. You will see non of my traces in the failed one because it failed as it was establishing the framework for our trace system when we access our shared memory object. I will try and get a more detailed log. > And could you please, please, please file a bug and attach these logs to > it. Will do soon. Promise. > J -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |
|
From: Eyal L. <ey...@ey...> - 2005-01-27 05:32:27
Attachments:
logs.tar.bz2
|
An earlier reply of mine is not showing on the list, and it also was
missing the attachements so this is not a big loss... Here it is with
more details.
Jeremy Fitzhardinge wrote:
> On Wed, 2005-01-26 at 19:14 +1100, Eyal Lebedinsky wrote:
>
>> I do not have an easy way to reproduce, however I can provide better logs.
>> The attached one shows a successfully run of the currently failing program
>> followed by one that aborted quietly.
>>
>> The failure in the ipc area may explain why my program always dies while
>> holding a semaphore, which I must remove manually before my system is
>> usable again.
>
> Is your program multithreaded?
Sure is. But the one dying ('2') never managed to progress to the point of
starting a thread. Look at log '1' to compare to a good run.
> Your trace makes it appear that it dies
> in the sys_ipc syscall doing a semget, but Valgrind does literally
> nothing in this case - there's nothing for it to check, so it does
> nothing. Could you try again with --trace-signals=yes.
I now have more logs. '1' is a log of a good run, '2' is a similar run that
failed. The logs include both VG traces any my own. You will see none of my
traces in the failed one because it failed as it was establishing the
framework for our trace system when we access our shared memory object.
I will try and get a more detailed log.
[later] I confirmed that the log '2' has none of my traces because it never
got that far. Log '1' shows library calls issued by my program, like
iodll[18641/---]> share.c(2266): CALL getpid
SYSCALL[18641,1]( 4) --> 45 (0x2D)
SYSCALL[18641,1]( 20):sys_getpid () --> 18641 (0x48D1)
SYSCALL[18641,1]( 20):sys_getpid () --> 18641 (0x48D1)
SYSCALL[18641,1]( 4) mayBlock:sys_write ( 2, 0x52BFB4F0, 45 ) --> ...
iodll[18641/---]> share.c(2266): CALL RETURN
as well as our own internal calls, like
iodll[0/0]> main.c(892) ssamain_trace_init: ENTER
You will notice that log '2' never got even this far, never logging
any of our tracing.
I am now running with '--trace-signals=yes'.
> And could you please, please, please file a bug and attach these logs to
> it.
Done
Bug 97975 has been added to the database
> J
--
Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/>
attach .zip as .dat
|
|
From: Eyal L. <ey...@ey...> - 2005-01-27 06:23:36
Attachments:
logs-34.tar.bz2
|
Jeremy Fitzhardinge wrote: > Could you try again with --trace-signals=yes. Done. '3' is a good run, '4' is a failed run. Mmm, I still do not see my posting with logs 1/2 on the list. Hope it gets there is due time. -- Eyal Lebedinsky (ey...@ey...) <http://samba.org/eyal/> attach .zip as .dat |