|
From: Paul F. <pa...@fr...> - 2006-10-20 09:23:46
|
Hi I have a problem with recent versions of valgrind - my AUT works with Val= grind 3.0.1, but fails with Valgrind 3.2.1 and 3.2.0. I don't have access to the source code for the dynamic library where the = error is happening. However, it's possible to get a fairly good picture of what= is going on with the --trace-signals=3Dyes --trace-cfi=3Dyes --trace-syscall= s=3Dyes options. Here's a snippet of the valgrind output for 3.0.1: --1721-- REDIR: 0x1CA046E0 (mallopt) redirected to 0x1B9000FA (mallopt) SYSCALL[1721,1](191) sys_getrlimit ( 3, 0x52BDEBF8 )[sync] --> Success(0x= 0) --1721-- signal 11 arrived ... si_code=3D1, EIP=3D0x1BBEE9B9, eip=3D0xB1E= BC279 --1721-- SIGSEGV: si_code=3D1 faultaddr=3D0x52B5EBC0 tid=3D1 ESP=3D0x52B5= EBC0 seg=3D0x52BC7000-0x52C00000 fl=3D34 shad=3D0x52D00000-0xB0000000 --1721-- -> extended stack base to 0x52B5E000 ... lots more deleted ... --1721-- signal 11 arrived ... si_code=3D1, EIP=3D0x1BBEE9B9, eip=3D0xB1E= BC279 --1721-- SIGSEGV: si_code=3D1 faultaddr=3D0x50D5E440 tid=3D1 ESP=3D0x50D5= E440 seg=3D0x50DDE000-0x52C00000 fl=3D34 shad=3D0x52D00000-0xB0000000 --1721-- -> extended stack base to 0x50D5E000 SYSCALL[1721,1](183) sys_getcwd ( 0x52BDCAD0, 4095 )[sync] --> Success(0x= 1C) And the same snippet, from 3.2.1: --4712-- REDIR: 0x52EF6E0 (mallopt) redirected to 0x401BEB4 (mallopt) SYSCALL[4712,1](191) sys_getrlimit ( 3, 0xBEFD99B8 )[sync] --> Success(0x= 0) --4712-- signal 11 arrived ... si_code=3D1, EIP=3D0x43FCDC9, eip=3D0x6502= 1323 --4712-- SIGSEGV: si_code=3D1 faultaddr=3D0xBEF59970 tid=3D1 ESP=3D0xBEF5= 9970 seg=3D0xBDFFA000-0xBEFC1FFF --4712-- -> extended stack base to 0xBEF59000 ... lots more ... --4712-- signal 11 arrived ... si_code=3D1, EIP=3D0x43FCDC9, eip=3D0x6502= 14DC --4712-- SIGSEGV: si_code=3D1 faultaddr=3D0xBDFD93A0 tid=3D1 ESP=3D0xBDFD= 93A0 seg=3DNULL --4712-- delivering signal 11 (SIGSEGV):1 to thread 1 --4712-- push_signal_frame (thread 1): signal 11 =3D=3D4712=3D=3D at 0x43FCDC9: simpleRecurse (main.c:1570) --4712-- delivering signal 11 (SIGSEGV) to thread 1: on ALT STACK (0x54DF010-0x54E1010; 8192 bytes) SYSCALL[4712,1]( 8) sys_creat ( 0x96E95BD(vams_ms-stacktrace.dump), 438 = ) --> [async] ... There are 32 signals handled before the last one where seg=3DNULL. For th= at signal, the stack does not get extended, and the AUT gets the signal inst= ead, and terminates. I've started poking around the valgrind code, but I haven't yet found whe= re this limit is, or how to get around it. A+ Paul |
|
From: Julian S. <js...@ac...> - 2006-10-20 09:41:29
|
Are you sure you're not trying to allocate too much stuff onto the stack? 3.2.1 allows you a max stack of 16M and you get to segfault after that. J On Friday 20 October 2006 10:23, Paul Floyd wrote: > Hi > > I have a problem with recent versions of valgrind - my AUT works with > Valgrind 3.0.1, but fails with Valgrind 3.2.1 and 3.2.0. > > I don't have access to the source code for the dynamic library where the > error is happening. However, it's possible to get a fairly good picture of > what is going on with the --trace-signals=yes --trace-cfi=yes > --trace-syscalls=yes options. Here's a snippet of the valgrind output for > 3.0.1: > > --1721-- REDIR: 0x1CA046E0 (mallopt) redirected to 0x1B9000FA (mallopt) > SYSCALL[1721,1](191) sys_getrlimit ( 3, 0x52BDEBF8 )[sync] --> Success(0x0) > --1721-- signal 11 arrived ... si_code=1, EIP=0x1BBEE9B9, eip=0xB1EBC279 > --1721-- SIGSEGV: si_code=1 faultaddr=0x52B5EBC0 tid=1 ESP=0x52B5EBC0 > seg=0x52BC7000-0x52C00000 fl=34 shad=0x52D00000-0xB0000000 > --1721-- -> extended stack base to 0x52B5E000 > ... lots more deleted ... > --1721-- signal 11 arrived ... si_code=1, EIP=0x1BBEE9B9, eip=0xB1EBC279 > --1721-- SIGSEGV: si_code=1 faultaddr=0x50D5E440 tid=1 ESP=0x50D5E440 > seg=0x50DDE000-0x52C00000 fl=34 shad=0x52D00000-0xB0000000 > --1721-- -> extended stack base to 0x50D5E000 > SYSCALL[1721,1](183) sys_getcwd ( 0x52BDCAD0, 4095 )[sync] --> > Success(0x1C) > > And the same snippet, from 3.2.1: > --4712-- REDIR: 0x52EF6E0 (mallopt) redirected to 0x401BEB4 (mallopt) > SYSCALL[4712,1](191) sys_getrlimit ( 3, 0xBEFD99B8 )[sync] --> Success(0x0) > --4712-- signal 11 arrived ... si_code=1, EIP=0x43FCDC9, eip=0x65021323 > --4712-- SIGSEGV: si_code=1 faultaddr=0xBEF59970 tid=1 ESP=0xBEF59970 > seg=0xBDFFA000-0xBEFC1FFF > --4712-- -> extended stack base to 0xBEF59000 > ... lots more ... > --4712-- signal 11 arrived ... si_code=1, EIP=0x43FCDC9, eip=0x650214DC > --4712-- SIGSEGV: si_code=1 faultaddr=0xBDFD93A0 tid=1 ESP=0xBDFD93A0 > seg=NULL --4712-- delivering signal 11 (SIGSEGV):1 to thread 1 > --4712-- push_signal_frame (thread 1): signal 11 > ==4712== at 0x43FCDC9: simpleRecurse (main.c:1570) > --4712-- delivering signal 11 (SIGSEGV) to thread 1: on ALT STACK > (0x54DF010-0x54E1010; 8192 bytes) > SYSCALL[4712,1]( 8) sys_creat ( 0x96E95BD(vams_ms-stacktrace.dump), 438 ) > --> [async] ... > > There are 32 signals handled before the last one where seg=NULL. For that > signal, the stack does not get extended, and the AUT gets the signal > instead, and terminates. > > I've started poking around the valgrind code, but I haven't yet found where > this limit is, or how to get around it. > > A+ > Paul > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users |
|
From: Paul F. <pa...@fr...> - 2006-10-20 09:53:27
|
>
> Are you sure you're not trying to allocate too much stuff onto the stac=
k?
> 3.2.1 allows you a max stack of 16M and you get to segfault after that.
Hi
[Julian - sorry for reply-to-sender, just noticed as the message was sent=
]
It's quite possible that we're over 16M. I'll look into that.
I just noticed another thing. If I use --sanity-check=3D3 I get
# --30200:0:aspacem sync_check_mapping_callback: segment mismatch: V's s=
eg:
# --30200:0:aspacem NSegment{file, start=3D0x8048000, end=3D0x9A40FFF,
smode=3DSmFixed, dev=3D65043, ino=3D2147487009, offset=3D0, fnIdx=3D1, ha=
sR=3D1, hasW=3D0,
hasX=3D1, hasT=3D0, mark=3D0, name=3D"/view/PFView/vobs/adms/ixl-dbg/sim/=
aut"}
# --30200:0:aspacem sync_check_mapping_callback: segment mismatch: kerne=
l's
seg:
# --30200:0:aspacem start=3D0x8048000 end=3D0x9A40FFF prot=3D5 dev=3D63 =
ino=3D12330004
offset=3D0
name=3D"/ccstores/viewstore_1/admsviews/pfloyd/PFView.vws/.s/00011/80000d=
2144ec700aaut"
# --30200:0:aspacem sync check at m_aspacemgr/aspacemgr.c:2031
(vgPlain_am_get_advisory): FAILED
# --30200:0:aspacem
# --30200:0:aspacem Valgrind: FATAL: aspacem assertion failed:
# --30200:0:aspacem VG_(am_do_sync_check)
(__PRETTY_FUNCTION__,__FILE__,__LINE__)
# --30200:0:aspacem at m_aspacemgr/aspacemgr.c:2031 (vgPlain_am_get_ad=
visory)
# --30200:0:aspacem Exiting now.
Could this be related? It looks like this is due to the fact that I'm run=
ning
the test within ClearCase. No problem either here with valgrind 3.0.1.
A+
Paul
|
|
From: Tom H. <to...@co...> - 2006-10-20 10:08:33
|
In message <116...@im...>
Paul Floyd <pa...@fr...> wrote:
> I just noticed another thing. If I use --sanity-check=3 I get
>
> # --30200:0:aspacem sync_check_mapping_callback: segment mismatch: V's seg:
> # --30200:0:aspacem NSegment{file, start=0x8048000, end=0x9A40FFF,
> smode=SmFixed, dev=65043, ino=2147487009, offset=0, fnIdx=1, hasR=1, hasW=0,
> hasX=1, hasT=0, mark=0, name="/view/PFView/vobs/adms/ixl-dbg/sim/aut"}
> # --30200:0:aspacem sync_check_mapping_callback: segment mismatch: kernel's
> seg:
> # --30200:0:aspacem start=0x8048000 end=0x9A40FFF prot=5 dev=63 ino=12330004
> offset=0
> name="/ccstores/viewstore_1/admsviews/pfloyd/PFView.vws/.s/00011/80000d2144ec700aaut"
> # --30200:0:aspacem sync check at m_aspacemgr/aspacemgr.c:2031
> (vgPlain_am_get_advisory): FAILED
> # --30200:0:aspacem
> # --30200:0:aspacem Valgrind: FATAL: aspacem assertion failed:
> # --30200:0:aspacem VG_(am_do_sync_check)
> (__PRETTY_FUNCTION__,__FILE__,__LINE__)
> # --30200:0:aspacem at m_aspacemgr/aspacemgr.c:2031 (vgPlain_am_get_advisory)
> # --30200:0:aspacem Exiting now.
>
> Could this be related? It looks like this is due to the fact that I'm running
> the test within ClearCase. No problem either here with valgrind 3.0.1.
I think this is a separate issue. It indicates that valgrind's record
of the memory map does not match the kernel's record for some reason.
The problem here is the device and inode numbers, which probably
aren't the most critical bits, but it is odd that they don't match.
The kernel has the mapping as device 63 and inode 12330004 which
equates to:
/ccstores/viewstore_1/admsviews/pfloyd/PFView.vws/.s/00011/80000d2144ec700aaut
and valgrind as it as device 65043 and inode 2147487009 which equates
to:
/view/PFView/vobs/adms/ixl-dbg/sim/aut
So valgrind has it as a mapping from the ClearCase virtual file system
thing (from what I've heard of CC - I have no experience of it) while
the kernel view is perhaps of the real underlying file? Is /ccstores a
real physical path on disk?
It would be interesting if you could start a program (I assume aut is
the program you are tracing?) from the CC view and then cat /proc/pid/maps
for it and see what path is reported...
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2006-10-20 10:09:13
|
> It's quite possible that we're over 16M. I'll look into that.
You really shouldn't allocate lots of stuff on the stack -- it's
wildly unportable. However, if you really have to have a bigger
stack, find this at coregrind/m_main.c:2180
SizeT m16 = 16 * m1;
and change appropriately.
> I just noticed another thing. If I use --sanity-check=3 I get
>
> # --30200:0:aspacem sync_check_mapping_callback: segment mismatch: V's
> seg: # --30200:0:aspacem NSegment{file, start=0x8048000, end=0x9A40FFF,
> smode=SmFixed, dev=65043, ino=2147487009, offset=0, fnIdx=1, hasR=1,
> hasW=0, hasX=1, hasT=0, mark=0,
> name="/view/PFView/vobs/adms/ixl-dbg/sim/aut"} # --30200:0:aspacem
> sync_check_mapping_callback: segment mismatch: kernel's seg:
> # --30200:0:aspacem start=0x8048000 end=0x9A40FFF prot=5 dev=63
> ino=12330004 offset=0
> name="/ccstores/viewstore_1/admsviews/pfloyd/PFView.vws/.s/00011/80000d2144
>ec700aaut"
The device/inode/dev don't match; this could maybe be caused by ClearCase
(I don't know; or by NFS?) but in any case I think it's unrelated.
> running the test within ClearCase. No problem either here with valgrind
> 3.0.1.
That just means 3.0.1 doesn't check to the same degree as 3.2.1.
J
|
|
From: Ashley P. <as...@qu...> - 2006-10-20 10:04:32
|
On Fri, 2006-10-20 at 10:41 +0100, Julian Seward wrote: > Are you sure you're not trying to allocate too much stuff onto the stack? > 3.2.1 allows you a max stack of 16M and you get to segfault after that. Is this limitation something that's here to stay? I've not spotted it as a problem but I'm a few months behind head-of-tree currently. Ashley, |
|
From: Tom H. <to...@co...> - 2006-10-20 10:13:59
|
In message <1161338659.24137.20.camel@localhost.localdomain>
Ashley Pittman <as...@qu...> wrote:
> On Fri, 2006-10-20 at 10:41 +0100, Julian Seward wrote:
>> Are you sure you're not trying to allocate too much stuff onto the stack?
>> 3.2.1 allows you a max stack of 16M and you get to segfault after that.
>
> Is this limitation something that's here to stay? I've not spotted it
> as a problem but I'm a few months behind head-of-tree currently.
It's not a particularly new limitation - it has been in ever since
the address space manager rewrite.
Personally I always have a 10 Mb stack limit set anyway so that
my debugger has some chance of unwinding the stack if I get a
runaway recursion.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2006-10-20 10:17:04
|
On Friday 20 October 2006 11:04, Ashley Pittman wrote: > On Fri, 2006-10-20 at 10:41 +0100, Julian Seward wrote: > > Are you sure you're not trying to allocate too much stuff onto the stack? > > 3.2.1 allows you a max stack of 16M and you get to segfault after that. > > Is this limitation something that's here to stay? I've not spotted it > as a problem but I'm a few months behind head-of-tree currently. Well, it's been like that for quite a long time now, probably since 3.1.0 at least. It's not a recent change. Is it causing you or potentially causing you a problem? I know that Fortran HPC folks tend to complain about it from time to time as putting huge arrays on the stack is apparently a culturally acceptable thing to do in Fortran-world, but apart from it doesn't seem to cause many problems. J |
|
From: Ashley P. <as...@qu...> - 2006-10-20 10:34:16
|
On Fri, 2006-10-20 at 11:16 +0100, Julian Seward wrote:
> On Friday 20 October 2006 11:04, Ashley Pittman wrote:
> > On Fri, 2006-10-20 at 10:41 +0100, Julian Seward wrote:
> > > Are you sure you're not trying to allocate too much stuff onto the stack?
> > > 3.2.1 allows you a max stack of 16M and you get to segfault after that.
> >
> > Is this limitation something that's here to stay? I've not spotted it
> > as a problem but I'm a few months behind head-of-tree currently.
>
> Well, it's been like that for quite a long time now, probably since
> 3.1.0 at least. It's not a recent change. Is it causing you or
> potentially causing you a problem? I know that Fortran HPC folks
> tend to complain about it from time to time as putting huge arrays
> on the stack is apparently a culturally acceptable thing to do in
> Fortran-world, but apart from it doesn't seem to cause many problems.
It's exactly those people I was thinking of, it's not something I do
myself but I have seen stacks measured in Gb. More of a potential
problem that a real one.
One reproducer I had sent to me recently was this one:
int main() {
unsigned long eqs[1309732];
g_elanBaseP = elan_baseInit(0);
printf ("elan_baseInit returned: %#lx\n", g_elanBaseP);
return 0;
}
As it turns out in this case the problem was a simple ulimit one but it
goes to show that people really do try some unexpected things.
Valgrind seems to get this mostly right but does complain when
elan_baseInit() tries to read it's parameter (a 64 bit int). It also
seems to get the line numbers wrong on ia64, base.c only has 864 lines,
857 is the } closing the elan_baseInit() function.
I'll try and get together a reproducer for this second issue.
on ia32 valgrind shows this:
==4376== Warning: client switching stacks? SP change: 0xBEFFF9E8 -->
0xBEB00934
==4376== to suppress, use: --max-stackframe=5238964 or greater
==4376== Invalid write of size 4
==4376== at 0x80484B2: main (test.c:8)
==4376== Address 0xBEB00934 is on thread 1's stack
==4376==
==4376== Invalid read of size 4
==4376== at 0x4063F00: elan_baseInit (base.c:593)
==4376== Address 0xBEB00934 is on thread 1's stack
==4376== Warning: set address range perms: large range 134225920 (readable)
==4376== Warning: set address range perms: large range 134225920 (readable)
==4376== Warning: set address range perms: large range 134217728 (readable)
==4376== Warning: client switching stacks? SP change: 0xBEB00930 --> 0xBEFFF9E8
==4376== to suppress, use: --max-stackframe=5238968 or greater
on ia64 it shows this:
==17920== Warning: client switching stacks? SP change: 0x7FF0008A0 -->
0x7FE602778
==17920== to suppress, use: --max-stackframe=10477864 or greater
==17920== Invalid write of size 8
==17920== at 0x400638: main (test.c:8)
==17920== Address 0x7FE602778 is on thread 1's stack
==17920== Warning: set address range perms: large range 134225920 (defined)
==17920== Warning: set address range perms: large range 134225920 (defined)
==17920== Warning: set address range perms: large range 134217728 (defined)
==17920==
==17920== Invalid read of size 8
==17920== at 0x4B5B5AC: elan_baseInit (base.c:857)
==17920== Address 0x7FE602778 is on thread 1's stack
==17920== Warning: client switching stacks? SP change: 0x7FE602780 --> 0x7FF0008A0
==17920== to suppress, use: --max-stackframe=10477856 or greater
==17920==
Ashley,
|
|
From: Julian S. <js...@ac...> - 2006-10-20 10:39:43
|
> Valgrind seems to get this mostly right Presumably if you give it --max-stackframe= as it requests then it shuts up a bit. > but does complain when > elan_baseInit() tries to read it's parameter (a 64 bit int). It also > seems to get the line numbers wrong on ia64, base.c only has 864 lines, > 857 is the } closing the elan_baseInit() function. > > I'll try and get together a reproducer for this second issue. Ok. I take it you mean amd64 and not ia64 :-) J |
|
From: Ashley P. <as...@qu...> - 2006-10-20 11:00:52
|
On Fri, 2006-10-20 at 11:39 +0100, Julian Seward wrote: > > Valgrind seems to get this mostly right > > Presumably if you give it --max-stackframe= as it requests then > it shuts up a bit. I'm assuming so. Our V tests take about 12 hours to run so we drive them from a weekly cron job. I should be able to answer this question on Monday. > > but does complain when > > elan_baseInit() tries to read it's parameter (a 64 bit int). It also > > seems to get the line numbers wrong on ia64, base.c only has 864 lines, > > 857 is the } closing the elan_baseInit() function. > > > > I'll try and get together a reproducer for this second issue. > > Ok. I take it you mean amd64 and not ia64 :-) Yes. Ashley, |
|
From: Paul F. <pa...@fr...> - 2006-10-20 11:35:38
|
Quoting Julian Seward <js...@ac...>: > > > Valgrind seems to get this mostly right > > Presumably if you give it --max-stackframe=3D as it requests then > it shuts up a bit. Doesn't the hard coded 16M limit that you just mentioned override this? T= he default (given by valgrind --help) is 2000000 or 32M. A+ Paul |