|
From: Richard C. <ric...@pr...> - 2005-08-08 16:48:52
|
Hi, One of our tools fails some tests on one of our machines. The failures are consistent and reproducible, they are seg 11s, and in some cases I get a message from glibc. *** glibc detected *** free(): invalid pointer: 0x082587f4 *** These messages always have the same memory address. The interesting thing is that when I run the same test with valgrind, I don't get any failures, and the tools pass the test as expected. Initially 'valgrind' found a 'memcpy' with overlapping memory which is now fixed but other than that there were no other issues when using --tool=memcheck. The machine is relatively new (Intel Zeon hyper threading P4 3.2GHz) and I have had problems keeping it cool, I do get a lot of messages from syslogd about passing temperature threshold. I need to find out if it is the overheating of the CPU or ?something else? which causes the failure? My question is, what could valgrind be doing that might stop the problem from occurring? I've been running using --tool=memcheck, should I try something else? Regards, Richard -- Richard Corden |
|
From: Julian S. <js...@ac...> - 2005-08-08 17:10:36
|
> The machine is relatively new (Intel Zeon hyper threading P4 3.2GHz) > and I have had problems keeping it cool, I do get a lot of messages from > syslogd about passing temperature threshold. Usually if you have overheating problems, it'll kill the system at unpredictable times and you wind up with either a complete hang of the system or a spontaneous reboot. Anyway, are you sure? P4's have pretty sophisticated thermal management and fall back to half-speed or less if the temperature gets too high; if it gets worse they shut down completely. > I need to find out if it is the overheating of the CPU or ?something > else? which causes the failure? > > My question is, what could valgrind be doing that might stop the problem > from occurring? It might be worth trying to cut out some of the variables by trying an identical software setup on a different machine to see if that makes any difference. If you are worried about the hardware, also run memtest86 for a couple hours and see if it picks up any memory problems. It's very good at doing so (imo) and easy to use; for one thing recent SuSE install CDs have it. J |
|
From: Dennis L. <pla...@tz...> - 2005-08-08 17:30:36
|
Hi, unfortunately valgrind cannot catch all errors, as described in some FAQs, some (quite rare) cases are missed. Although the address in your glibc message suggests that it is a programming error, maybe you have temperature problems (as you described). To check this, run memtest86+ for testing your memory and prime95 (from mersennne.org) for cpu testing (torture test). If both dont return errors, you are a bit on your own with the classical debugging techniques (gdb, examining core dums, running program with MALLOC_CHECK_ set etc.) greets Dennis Am Montag, den 08.08.2005, 17:48 +0100 schrieb Richard Corden: > Hi, > > One of our tools fails some tests on one of our machines. The failures > are consistent and reproducible, they are seg 11s, and in some cases I > get a message from glibc. > > *** glibc detected *** free(): invalid pointer: 0x082587f4 *** > > These messages always have the same memory address. > > The interesting thing is that when I run the same test with valgrind, I > don't get any failures, and the tools pass the test as expected. > Initially 'valgrind' found a 'memcpy' with overlapping memory which is > now fixed but other than that there were no other issues when using > --tool=memcheck. > > The machine is relatively new (Intel Zeon hyper threading P4 3.2GHz) > and I have had problems keeping it cool, I do get a lot of messages from > syslogd about passing temperature threshold. > > I need to find out if it is the overheating of the CPU or ?something > else? which causes the failure? > > My question is, what could valgrind be doing that might stop the problem > from occurring? > > I've been running using --tool=memcheck, should I try something else? > > > Regards, > > > Richard > -- Dennis Lubert <pla...@tz...> |
|
From: Nicholas N. <nj...@cs...> - 2005-08-08 18:28:59
|
On Mon, 8 Aug 2005, Dennis Lubert wrote: > unfortunately valgrind cannot catch all errors, as described in some > FAQs, some (quite rare) cases are missed. Memcheck should give a warning before any seg-fault-causing memory error, though. Some programs do run differently under Valgrind, although it's not very common. For example, Memcheck pads the start and end of heap blocks with 16 extra bytes so that overruns/underruns are less likely to corrupt the heap metadata, which may allow a program to run successfully under Valgrind when it crashes normally. But Memcheck should give warnings about any such overruns/underruns, so it's unclear what the problem is. N |
|
From: Richard C. <ric...@pr...> - 2005-08-11 08:55:31
|
Thanks too all who replied. I already had a copy of memtest86 and I have run it a few times overnight without any memory failures. I hadn't heard about prim95 before, but I ran it at higher priority for 5 hours without a failure. During the running of prim95 there were many messages about the CPU temperature being high - but as mentioned by 'Julian Seward' the CPU thermal management must be working. The only other difference with this machine is the HD is SATA, but I now consider it a long shot that the bug is not present in our code! :( Again thanks for the advice - when (if) I track down the problem I'll let you know, especially if its something that valgrind may be able to catch. Regards, Richard Richard Corden wrote: > > Hi, > > One of our tools fails some tests on one of our machines. The > failures are consistent and reproducible, they are seg 11s, and in > some cases I get a message from glibc. > > *** glibc detected *** free(): invalid pointer: 0x082587f4 *** > > These messages always have the same memory address. > > The interesting thing is that when I run the same test with valgrind, > I don't get any failures, and the tools pass the test as expected. > Initially 'valgrind' found a 'memcpy' with overlapping memory which is > now fixed but other than that there were no other issues when using > --tool=memcheck. > > The machine is relatively new (Intel Zeon hyper threading P4 3.2GHz) > and I have had problems keeping it cool, I do get a lot of messages > from syslogd about passing temperature threshold. > > I need to find out if it is the overheating of the CPU or ?something > else? which causes the failure? > > My question is, what could valgrind be doing that might stop the > problem from occurring? > > I've been running using --tool=memcheck, should I try something else? > > > Regards, > > > Richard > -- Richard Corden Programming Research Ltd. ric...@pr... + 44 845 0048478 |
|
From: Richard C. <ric...@pr...> - 2005-09-13 15:03:20
|
Hi, > I need to find out if it is the overheating of the CPU or ?something > else? which causes the failure? I'm relatively certain now that overheating was not the cause of the failures. We have found one issue with our code which fixes some tests. We were not closing all our files. We have some global 'ostream*' - and so I'm not even sure if the destructor for them would get called. Can valgrind be run in a way which catches this kind of problem? Regards, Richard -- Richard Corden Programming Research Ltd. ric...@pr... + 44 845 0048478 |
|
From: Nicholas N. <nj...@cs...> - 2005-09-13 15:06:12
|
On Tue, 13 Sep 2005, Richard Corden wrote: > We have found one issue with our code which fixes some tests. We were not > closing all our files. We have some global 'ostream*' - and so I'm not even > sure if the destructor for them would get called. > > Can valgrind be run in a way which catches this kind of problem? --track-fds=yes turns on file descriptor tracking. That might be what you're after. Nick |
|
From: Richard C. <ric...@pr...> - 2005-09-21 11:09:00
|
Problem solved!!!!!
I've now found the cause of the problem - and I offer my sins up to you
in order to see if I could have used valgrind in a way which would have
caught the problem.
I have a cache for a routine, which used a hash based on a memory
location to see if it had already handled that particular item before.
The code was:
int getKey (); // returns &m_data or something similar
const int cache_size = 127;
static CacheType cache[cache_size];
int entry = (getKey () >> 2) % cache_size;
if (cache[entry] != 0) { ... }
The code didn't fail in valgrind as '&m_data' must always be less than
INT_MAX. However, when running on the machine normally, sometimes
m_data was above INT_MAX and so 'entry' was negative.
It might be asking a lot to have valgrind catch this - but I am
overjoyed that I've finally found the issue!
All the best!
Richard
Richard Corden wrote:
>
> Hi,
>
> One of our tools fails some tests on one of our machines. The
> failures are consistent and reproducible, they are seg 11s, and in
> some cases I get a message from glibc.
>
> *** glibc detected *** free(): invalid pointer: 0x082587f4 ***
>
> These messages always have the same memory address.
>
> The interesting thing is that when I run the same test with valgrind,
> I don't get any failures, and the tools pass the test as expected.
> Initially 'valgrind' found a 'memcpy' with overlapping memory which is
> now fixed but other than that there were no other issues when using
> --tool=memcheck.
>
> The machine is relatively new (Intel Zeon hyper threading P4 3.2GHz)
> and I have had problems keeping it cool, I do get a lot of messages
> from syslogd about passing temperature threshold.
>
> I need to find out if it is the overheating of the CPU or ?something
> else? which causes the failure?
>
> My question is, what could valgrind be doing that might stop the
> problem from occurring?
>
> I've been running using --tool=memcheck, should I try something else?
>
>
> Regards,
>
>
> Richard
>
--
Richard Corden
Programming Research Ltd.
ric...@pr...
+ 44 845 0048478
|