|
From: Howard C. <hy...@hi...> - 2004-06-28 23:14:35
|
I was trying to use valgrind 2.0.0/helgrind and it was crashing on me when I tried to add a suppression for Helgrind. So I grabbed the 2.1.1 tarball and built that, but it complained about fix_auxv and didn't do anything useful at all. Looking at the valgrind-users archives it seems this is a problem with the 2.2 kernel I'm running. I didn't see any notes about support for 2.2 being dropped in any of the NEWS/release notes. Upgrading the kernel is not an option on the machine I'm testing at the moment. I took a crack at fix_auxv, and made it insert a few of the missing entries. This got stage2 loaded and running, but things die pretty quickly after that: (btw, this is kernel 2.2.25) valgrind -v --tool=memcheck ../servers/slapd/.libs/lt-slapd -d7 ==4931== Memcheck, a memory error detector for x86-linux. ==4931== Copyright (C) 2002-2004, and GNU GPL'd, by Julian Seward. ==4931== Using valgrind-2.1.1, a program supervision framework for x86-linux. ==4931== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward. ==4931== Valgrind library directory: /usr/local/lib/valgrind ==4931== Command line ==4931== /home/hyc/OD/sasl2/servers/slapd/.libs/lt-slapd ==4931== -d7 ==4931== Startup, with flags: ==4931== -v ==4931== --tool=memcheck ==4931== Reading syms from /home/hyc/OD/sasl2/servers/slapd/.libs/lt-slapd (0x8048000) ==4931== Reading syms from /lib/ld-2.3.2.so (0x3C000000) ==4931== Reading syms from /lib/ld-2.3.2.so (0xB0000000) ==4931== Reading syms from /lib/libdl-2.3.2.so (0xB0016000) ==4931== Reading syms from /lib/libc-2.3.2.so.debug (0xB0019000) ==4931== Reading syms from /opt/local/lib/valgrind/vgskin_memcheck.so (0xB024F000) ==4931== Reading syms from /opt/local/lib/valgrind/stage2 (0xB8000000) ==4931== Reading suppressions file: /usr/local/lib/valgrind/default.supp ==4931== REDIRECT soname:libc.so.6(__GI___errno_location) to soname:libpthread.so.0(__errno_location) ==4931== REDIRECT soname:libc.so.6(__errno_location) to soname:libpthread.so.0(__errno_location) ==4931== REDIRECT soname:libc.so.6(__GI___h_errno_location) to soname:libpthread.so.0(__h_errno_location) ==4931== REDIRECT soname:libc.so.6(__h_errno_location) to soname:libpthread.so.0(__h_errno_location) ==4931== REDIRECT soname:libc.so.6(__GI___res_state) to soname:libpthread.so.0(__res_state) ==4931== REDIRECT soname:libc.so.6(__res_state) to soname:libpthread.so.0(__res_state) ==4931== REDIRECT soname:libc.so.6(stpcpy) to *vgpreload_memcheck.so*(stpcpy) ==4931== REDIRECT soname:libc.so.6(strnlen) to *vgpreload_memcheck.so*(strnlen) ==4931== REDIRECT soname:ld-linux.so.2(stpcpy) to *vgpreload_memcheck.so*(stpcpy) ==4931== REDIRECT soname:ld-linux.so.2(strchr) to *vgpreload_memcheck.so*(strchr) ==4931== ==4931== ==4931== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==4931== at 0x3C005A1C: _dl_map_object_internal (dl-load.c:1669) ==4931== by 0x3C0025A2: dl_main (rtld.c:951) ==4931== by 0x3C00DA4D: _dl_sysdep_start (../sysdeps/generic/dl-sysdep.c:195) ==4931== by 0x3C000CBD: _dl_start_final (rtld.c:248) ==4931== ==4931== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ==4931== malloc/free: in use at exit: 0 bytes in 0 blocks. ==4931== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==4931== --4931-- TT/TC: 0 tc sectors discarded. --4931-- 249 chainings, 0 unchainings. --4931-- translate: new 472 (7892 -> 104490; ratio 132:10) --4931-- discard 0 (0 -> 0; ratio 0:10). --4931-- dispatch: 2956 jumps (bb entries), of which 567 (19%) were unchained. --4931-- 1/488 major/minor sched events. 480 tt_fast misses. --4931-- reg-alloc: 104 t-req-spill, 19492+806 orig+spill uis, 2565 total-reg-r. --4931-- sanity: 2 cheap, 1 expensive checks. --4931-- ccalls: 1534 C calls, 55% saves+restores avoided (4992 bytes) --4931-- 2066 args, avg 0.86 setup instrs each (560 bytes) --4931-- 0% clear the stack (4602 bytes) --4931-- 718 retvals, 25% of reg-reg movs avoided (356 bytes) What now? -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Tom H. <th...@cy...> - 2004-06-29 18:48:45
|
In message <40E...@hi...>
Howard Chu <hy...@hi...> wrote:
> I took a crack at fix_auxv, and made it insert a few of the missing
> entries. This got stage2 loaded and running, but things die pretty
> quickly after that: (btw, this is kernel 2.2.25)
If you look back in the archives you should find somewhere the results
of my last attempts to get things going on a 2.2 kernel.
I suspect that the problem you're seeing is not actually down to the
kernel at all, but to the version of gcc that was used to build the
system C library, combined with a trick that valgrind tries to pull
with the auxv entries.
The problem is the fake auxv entries that valgrind adds for it's own
use - if you reduce the value of those to less than 32 then it will
work as glibc will no longer wander off into invalid memory.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Howard C. <hy...@sy...> - 2004-07-02 03:26:56
Attachments:
dif.txt
|
Tom Hughes wrote: > In message <40E...@hi...> > Howard Chu <hy...@hi...> wrote: >>I took a crack at fix_auxv, and made it insert a few of the missing >>entries. This got stage2 loaded and running, but things die pretty >>quickly after that: (btw, this is kernel 2.2.25) > If you look back in the archives you should find somewhere the results > of my last attempts to get things going on a 2.2 kernel. > > I suspect that the problem you're seeing is not actually down to the > kernel at all, but to the version of gcc that was used to build the > system C library, combined with a trick that valgrind tries to pull > with the auxv entries. > > The problem is the fake auxv entries that valgrind adds for it's own > use - if you reduce the value of those to less than 32 then it will > work as glibc will no longer wander off into invalid memory. I didn't totally understand your point, but after playing with it on a separate SuSE 9.1 system and comparing the results to my 2.2 kernel, looking at prmap etc., I got it working. Here's my patch which inserts the missing auxv entries, and then adds 16 pages to the stack so that the client process can startup without an immediate SEGV. Of course not everything works yet - signal handling appears to be broken. -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Tom H. <th...@cy...> - 2004-07-02 06:21:12
|
In message <40E...@sy...>
Howard Chu <hy...@sy...> wrote:
> Tom Hughes wrote:
>
> > The problem is the fake auxv entries that valgrind adds for it's own
> > use - if you reduce the value of those to less than 32 then it will
> > work as glibc will no longer wander off into invalid memory.
>
> I didn't totally understand your point, but after playing with it on a
> separate SuSE 9.1 system and comparing the results to my 2.2 kernel,
> looking at prmap etc., I got it working. Here's my patch which inserts
> the missing auxv entries, and then adds 16 pages to the stack so that
> the client process can startup without an immediate SEGV.
Adding 16 pages to the stack is a gross hack though - the real fix
is to change AT_UME_PADFD and AT_UME_EXECFD to 30 and 31 instead of
the current values (0xff01 and 0xff02) so that glibc doesn't write
off the end of a bitfield during startup.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Howard C. <hy...@sy...> - 2004-07-02 07:01:08
|
Tom Hughes wrote: >>Tom Hughes wrote: >>>The problem is the fake auxv entries that valgrind adds for it's own >>>use - if you reduce the value of those to less than 32 then it will >>>work as glibc will no longer wander off into invalid memory. >>I didn't totally understand your point, but after playing with it on a >>separate SuSE 9.1 system and comparing the results to my 2.2 kernel, >>looking at prmap etc., I got it working. Here's my patch which inserts >>the missing auxv entries, and then adds 16 pages to the stack so that >>the client process can startup without an immediate SEGV. > Adding 16 pages to the stack is a gross hack though - the real fix > is to change AT_UME_PADFD and AT_UME_EXECFD to 30 and 31 instead of > the current values (0xff01 and 0xff02) so that glibc doesn't write > off the end of a bitfield during startup. Ahhhhh, now I understand what you were referring to by "less than 32" ... Thanks for clearing that up. Hm, no; I just recompiled with this change and it still crashes the same way I reported originally. Whereas with the increased stack, it runs most of the regtest's. -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Howard C. <hy...@sy...> - 2004-07-04 16:44:35
|
Something else I've noticed on my desktop machine running kernel 2.2.25: valgrind 2.1.0 and 2.1.1 use almost no CPU time; "top" shows the system at 99% idle, and the target program runs very very slowly. I saw some old references saying this may happen on laptops with power management enabled, but that's not the case here. Also, running a CPU hog to force the CPU to work harder doesn't change anything; the CPU hog gets 99% of the CPU and valgrind still uses less than 1%. One of the big differences I see between 2.0.0 and 2.1 is that the vg_scheduler.c uses poll instead of select, but there've been so many other changes it's hard to see exactly what is causing this behavior. Does anyone else have any insight into this? -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Howard C. <hy...@sy...> - 2004-07-05 19:20:35
|
While looking for memory leaks in my code I found that valgrind reported 519 lost blocks attributed to a line of code that is only executed once. In this case it is a strdup that copies a single command-line argument, in the startup phase of the program's main(). Needless to say, this was a bit puzzling; but since I know there cannot be 519 passes thru this bit of code, I pretty much ignored that part of the report. But seeing that makes me very skeptical about the other leaks that it reported. Any idea why it's so far from reality? -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Nicholas N. <nj...@ca...> - 2004-07-05 20:21:50
|
On Mon, 5 Jul 2004, Howard Chu wrote: > While looking for memory leaks in my code I found that valgrind reported 519 > lost blocks attributed to a line of code that is only executed once. In this > case it is a strdup that copies a single command-line argument, in the > startup phase of the program's main(). > > Needless to say, this was a bit puzzling; but since I know there cannot be > 519 passes thru this bit of code, I pretty much ignored that part of the > report. But seeing that makes me very skeptical about the other leaks that it > reported. Any idea why it's so far from reality? It's pretty much impossible to say without more information. Can you give a sample program that exhibits the behaviour? N |
|
From: Tom H. <th...@cy...> - 2004-07-05 22:14:49
|
In message <40E...@sy...>
Howard Chu <hy...@sy...> wrote:
> While looking for memory leaks in my code I found that valgrind reported
> 519 lost blocks attributed to a line of code that is only executed once.
> In this case it is a strdup that copies a single command-line argument,
> in the startup phase of the program's main().
>
> Needless to say, this was a bit puzzling; but since I know there cannot
> be 519 passes thru this bit of code, I pretty much ignored that part of
> the report. But seeing that makes me very skeptical about the other
> leaks that it reported. Any idea why it's so far from reality?
Bear in mind the leak resolution is set to low by default, which means
that any leaks which share the same two locations at the bottom of the
stack trace will be merged - try using --leak-resolution=high and see
if that changes things.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Howard C. <hy...@sy...> - 2004-07-06 02:17:52
|
Tom Hughes wrote: > In message <40E...@sy...> > Howard Chu <hy...@sy...> wrote: >>While looking for memory leaks in my code I found that valgrind reported >>519 lost blocks attributed to a line of code that is only executed once. >>In this case it is a strdup that copies a single command-line argument, >>in the startup phase of the program's main(). >>Needless to say, this was a bit puzzling; but since I know there cannot >>be 519 passes thru this bit of code, I pretty much ignored that part of >>the report. But seeing that makes me very skeptical about the other >>leaks that it reported. Any idea why it's so far from reality? > Bear in mind the leak resolution is set to low by default, which means > that any leaks which share the same two locations at the bottom of the > stack trace will be merged - try using --leak-resolution=high and see > if that changes things. Here's the actual report: ==13499== 29454 bytes in 519 blocks are definitely lost in loss record 24 of 25 ==13499== at 0x7501E8B4: malloc (vg_replace_malloc.c:105) ==13499== by 0x812BE28: ber_memalloc_x (memory.c:232) ==13499== by 0x812BF30: ber_strdup_x (memory.c:662) ==13499== by 0x807368B: ch_strdup (ch_malloc.c:136) ==13499== by 0x804CD9D: main (main.c:412) This is from OpenLDAP slapd, current CVS HEAD. I didn't expect that leak-resolution would change anything, because there simply isn't any other code path possible from this point. No branches, no loops, and no other function calls. This result was produced with --num-callers=8 so this stack trace is complete, bottom to top of stack, and there certainly isn't anything above main(). At the moment I'm unable to test this with memcheck; as I reported earlier my desktop machine runs too slowly to test (it's stuck on kernel 2.2.25 and would take weeks to get the result) and my laptop doesn't seem to have enough RAM for the test, it always aborts with a malloc failure. (I'm waiting for a new machine, a few more days before it gets delivered...) The actual invocation was: valgrind -v --skin=addrcheck --gdb-attach=yes --num-callers=8 --leak-check=yes --gdb-path=~/bin/rungdb --logfile-fd=3 3>&1 slapd -f $CONF1 -h $URI1 -d $LVL > $LOG1 2>&1 < /dev/tty & CONF1 is ./testrun/slapd.1.conf URI1 is ldap://:9011 LVL is 4 LOG1 is ./testrun/slapd.1.log There were no errors; gdb was never invoked during this run. I'll try again with leak-resolution=high and see if that makes a difference. -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Tom H. <th...@cy...> - 2004-07-06 06:17:33
|
In message <40E...@sy...>
Howard Chu <hy...@sy...> wrote:
> Tom Hughes wrote:
>
> > Bear in mind the leak resolution is set to low by default, which means
> > that any leaks which share the same two locations at the bottom of the
> > stack trace will be merged - try using --leak-resolution=high and see
> > if that changes things.
>
> Here's the actual report:
> ==13499== 29454 bytes in 519 blocks are definitely lost in loss record
> 24 of 25
> ==13499== at 0x7501E8B4: malloc (vg_replace_malloc.c:105)
> ==13499== by 0x812BE28: ber_memalloc_x (memory.c:232)
> ==13499== by 0x812BF30: ber_strdup_x (memory.c:662)
> ==13499== by 0x807368B: ch_strdup (ch_malloc.c:136)
> ==13499== by 0x804CD9D: main (main.c:412)
>
> This is from OpenLDAP slapd, current CVS HEAD. I didn't expect that
> leak-resolution would change anything, because there simply isn't any
> other code path possible from this point. No branches, no loops, and no
> other function calls. This result was produced with --num-callers=8 so
> this stack trace is complete, bottom to top of stack, and there
> certainly isn't anything above main().
All allocations made from malloc called from ber_memalloc_x will be
grouped together in that report, regardless of where ber_memalloc_x
was called from. Changing the leak resolution will stop that.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Julian S. <js...@ac...> - 2004-07-05 20:31:05
|
> Needless to say, this was a bit puzzling; but since I know there cannot > be 519 passes thru this bit of code, I pretty much ignored that part of > the report. But seeing that makes me very skeptical about the other > leaks that it reported. Any idea why it's so far from reality? That is a little strange. One question is, if you re-run it with memcheck instead of addrcheck, do you get more plausible results? Reason I ask is that the leak checkers have to scan all address space they believe is accessible, to search for pointers to blocks. Since memcheck tracks definedness as well as addressibility, it may wind up scanning less memory and therefore produce more accurate results. Just a thought. J |
|
From: Crispin F. <val...@fl...> - 2004-07-05 22:09:55
|
On Mon, 2004-07-05 at 21:32 +0100, Julian Seward wrote: > > Needless to say, this was a bit puzzling; but since I know there cannot > > be 519 passes thru this bit of code, I pretty much ignored that part of > > the report. But seeing that makes me very skeptical about the other > > leaks that it reported. Any idea why it's so far from reality? > > That is a little strange. One question is, if you re-run it with > memcheck instead of addrcheck, do you get more plausible results? > Reason I ask is that the leak checkers have to scan all address > space they believe is accessible, to search for pointers to blocks. > Since memcheck tracks definedness as well as addressibility, it > may wind up scanning less memory and therefore produce more accurate > results. Just a thought. Isn't this just the --leak-resolution defaulting to just the top 3 (or is it 2) stack frames ? Does it help if you use --leak-resolution=high on the command line? Crispin |
|
From: Howard C. <hy...@sy...> - 2004-07-06 08:41:37
|
Tom Hughes wrote: > All allocations made from malloc called from ber_memalloc_x will be > grouped together in that report, regardless of where ber_memalloc_x > was called from. Changing the leak resolution will stop that. Thanks. I had actually used --leak-resolution=high in an older test script, I just seem to have forgotten about it this time around. My mistake. This may be a silly question, but why does this leak-resolution option even exist? When is it ever useful to have leak-resolution and num-callers set differently? E.g., it's pointless to have leak-resolution=high and num-callers < 4. Or num-callers > 2 and leak-resolution=low, or num-callers > 4 and leak-resolution=mid. It would make more sense to just have the num-callers option, perhaps with num-callers=0 meaning "the entire stack" which would be the equivalent of leak-resolution=high. -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Howard C. <hy...@sy...> - 2004-07-06 11:07:24
|
Tom Hughes wrote: > All allocations made from malloc called from ber_memalloc_x will be > grouped together in that report, regardless of where ber_memalloc_x > was called from. Changing the leak resolution will stop that. Well, 5 hours later, the test is complete and the results are more sensible. Thanks for clarifying this situation for me. Just posting this to verify. (I already found and fixed the leaks using FunctionCheck). The nice thing about valgrind is that it doesn't require recompiling your target, and since it emulates the target it can drop you into gdb at the moment Something Bad happens (such as touching freed memory). But damn is it slow... Using FunctionCheck is a pain sometimes because it requires a specially compiled binary, but it runs on the real CPU so it only takes 30 minutes to run my test. It can't protect itself from a wildly misbehaving program though. Oh well. -- -- Howard Chu Chief Architect, Symas Corp. Director, Highland Sun http://www.symas.com http://highlandsun.com/hyc Symas: Premier OpenSource Development and Support |
|
From: Nicholas N. <nj...@ca...> - 2004-07-06 11:35:15
|
On Tue, 6 Jul 2004, Howard Chu wrote: > Just posting this to verify. (I already found and fixed the leaks using > FunctionCheck). The nice thing about valgrind is that it doesn't require > recompiling your target, and since it emulates the target it can drop you > into gdb at the moment Something Bad happens (such as touching freed memory). > But damn is it slow... Using FunctionCheck is a pain sometimes because it > requires a specially compiled binary, but it runs on the real CPU so it only > takes 30 minutes to run my test. It can't protect itself from a wildly > misbehaving program though. Oh well. No tool is perfect. Use a combination of tools that works for you. N |
|
From: Dennis L. <pla...@tz...> - 2004-07-06 12:17:56
|
Am Di, den 06.07.2004 schrieb Nicholas Nethercote um 12:55: > I don't know how much people use --leak-resolution. It would be > interesting to hear what people think: > > - is the default trace merging confusing Too often, yes. > > - does anyone use --leak-resolution=high? Yes, I have my own start script for valgrind (so that I can link to the script and valgrind uses the skin of the links name) and in memcheck case I always add there this option. > > And any other relevant opinions. Dont know if its relevant, but I would like to have this option so configurable that you can set a number, rather than high/mid/low. And this number should always be at least the one what num-callers was set to, since if it wasnt, it was confusing even more. If there was a way to determine where this leak logically occurs (like whenever calling do_sort_something_out() which resorts and forgots to deallocate something, but is called from different places), the merging would be nice to make it more clear, but I dont know if theres such a way. > > N |
|
From: Nicholas N. <nj...@ca...> - 2004-07-06 12:35:31
|
On Tue, 6 Jul 2004, Dennis Lubert wrote: > Dont know if its relevant, but I would like to have this option so > configurable that you can set a number, rather than high/mid/low. And > this number should always be at least the one what num-callers was set > to, since if it wasnt, it was confusing even more. If this number is always higher than num-callers, then it would make sense to remove --leak-resolution altogether, as no traces would ever be merged. Perhaps this is the sensible thing to do, if people are always using --leak-resolution=high anyway. > If there was a way to determine where this leak logically occurs (like > whenever calling do_sort_something_out() which resorts and forgots to > deallocate something, but is called from different places), the merging > would be nice to make it more clear, but I dont know if theres such a > way. Unfortunately identifying when a leak occurs is very difficult to do without being horribly slow. N |
|
From: Nicholas N. <nj...@ca...> - 2004-07-06 10:55:46
|
On Tue, 6 Jul 2004, Howard Chu wrote: > This may be a silly question, but why does this leak-resolution option even > exist? When is it ever useful to have leak-resolution and num-callers set > differently? E.g., it's pointless to have leak-resolution=high and > num-callers < 4. Or num-callers > 2 and leak-resolution=low, or num-callers > > 4 and leak-resolution=mid. > > It would make more sense to just have the num-callers option, perhaps with > num-callers=0 meaning "the entire stack" which would be the equivalent of > leak-resolution=high. The manual describes --leak-resolution like this: When doing leak checking, determines how willing Memcheck is to consider different backtraces to be the same. When set to low, the default, only the first two entries need match. When med, four entries have to match. When high, all entries need to match. For hardcore leak debugging, you probably want to use --leak-resolution=high together with --num-callers=40 or some such large number. Note however that this can give an overwhelming amount of information, which is why the defaults are 4 callers and low-resolution matching. I don't know how much people use --leak-resolution. It would be interesting to hear what people think: - is the default trace merging confusing? - does anyone use --leak-resolution=high? And any other relevant opinions. N |
|
From: Crispin F. <val...@fl...> - 2004-07-06 11:03:22
|
> - is the default trace merging confusing? I certainly find it confusing, as when using C++, you have the new() call in the frame before the malloc() so a lot of things are merged wrongly. > - does anyone use --leak-resolution=high? I use this all the time, so much that I have it in my .valgrindrc file (along with --num-callers=40) :-) Crispin |