|
From: Tom H. <to...@co...> - 2007-12-06 00:11:15
|
I get an assertion in memcheck trying to valgrind a windows program under wine. This is the current trunk code with a tweaked version of the patch from http://wiki.winehq.org/Wine_and_Valgrind applied. I'm just running "valgrind --trace-children=yes notepad" and I get: Memcheck: mc_main.c:957 (get_sec_vbits8): Assertion 'n' failed. Memcheck: get_sec_vbits8: no node for address 0x6FA9EA0 (0x6FA9EAC) ==9850== at 0x38018ACD: report_and_quit (m_libcassert.c:140) ==9850== by 0x8: ??? ==9850== by 0x9: ??? ==9850== by 0x9: ??? ==9850== by 0x40000005: ??? ==9850== by 0xFFFFFFFE: ??? Any ideas? Tom -- Tom Hughes (to...@co...) http://www.compton.nu/ |
|
From: Julian S. <js...@ac...> - 2007-12-06 00:22:52
|
Hmm, Memcheck does very occasionally keel over like that, but the reason for it has never been established. Is it repeatable? Were there any invalid reads/writes prior to this point, that might have trashed any Memcheck-specific data structures? Nick may well have something more constructive to add. J On Thursday 06 December 2007 01:11, Tom Hughes wrote: > I get an assertion in memcheck trying to valgrind a windows program > under wine. This is the current trunk code with a tweaked version of > the patch from > http://wiki.winehq.org/Wine_and_Valgrind applied. > > I'm just running "valgrind --trace-children=yes notepad" and I get: > > Memcheck: mc_main.c:957 (get_sec_vbits8): Assertion 'n' failed. > Memcheck: get_sec_vbits8: no node for address 0x6FA9EA0 (0x6FA9EAC) > > ==9850== at 0x38018ACD: report_and_quit (m_libcassert.c:140) > ==9850== by 0x8: ??? > ==9850== by 0x9: ??? > ==9850== by 0x9: ??? > ==9850== by 0x40000005: ??? > ==9850== by 0xFFFFFFFE: ??? > > Any ideas? > > Tom |
|
From: Nicholas N. <nj...@cs...> - 2007-12-06 01:15:01
|
On Thu, 6 Dec 2007, Tom Hughes wrote: > I get an assertion in memcheck trying to valgrind a windows program > under wine. This is the current trunk code with a tweaked version of > the patch from > http://wiki.winehq.org/Wine_and_Valgrind applied. > > I'm just running "valgrind --trace-children=yes notepad" and I get: > > Memcheck: mc_main.c:957 (get_sec_vbits8): Assertion 'n' failed. > Memcheck: get_sec_vbits8: no node for address 0x6FA9EA0 (0x6FA9EAC) > > ==9850== at 0x38018ACD: report_and_quit (m_libcassert.c:140) > ==9850== by 0x8: ??? > ==9850== by 0x9: ??? > ==9850== by 0x9: ??? > ==9850== by 0x40000005: ??? > ==9850== by 0xFFFFFFFE: ??? > > Any ideas? It's a problem with the secondary V bits table in Memcheck. That table holds the full V bits for all memory bytes that are partially defined. It's happened a couple of times, but always in situations that are impossible for me to reproduce. If you are able to reduce it to a small test, or are able to do any debugging yourself, that would be very helpful. One thing you could do is try commenting out all these lines and see if it goes away. If it does, you could then experiment with which combinations make it go away. #define PERF_FAST_LOADV 1 #define PERF_FAST_STOREV 1 #define PERF_FAST_SARP 1 #define PERF_FAST_STACK 1 #define PERF_FAST_STACK2 1 Nick |
|
From: Tom H. <to...@co...> - 2007-12-06 09:34:39
|
On Dec 6, 2007 1:14 AM, Nicholas Nethercote <nj...@cs...> wrote: > On Thu, 6 Dec 2007, Tom Hughes wrote: > > > I get an assertion in memcheck trying to valgrind a windows program > > under wine. This is the current trunk code with a tweaked version of > > the patch from > > http://wiki.winehq.org/Wine_and_Valgrind applied. > > > > I'm just running "valgrind --trace-children=yes notepad" and I get: > > > > Memcheck: mc_main.c:957 (get_sec_vbits8): Assertion 'n' failed. > > Memcheck: get_sec_vbits8: no node for address 0x6FA9EA0 (0x6FA9EAC) > > > > ==9850== at 0x38018ACD: report_and_quit (m_libcassert.c:140) > > ==9850== by 0x8: ??? > > ==9850== by 0x9: ??? > > ==9850== by 0x9: ??? > > ==9850== by 0x40000005: ??? > > ==9850== by 0xFFFFFFFE: ??? > > > > Any ideas? > > It's a problem with the secondary V bits table in Memcheck. That table > holds the full V bits for all memory bytes that are partially defined. > It's happened a couple of times, but always in situations that are > impossible for me to reproduce. If you are able to reduce it to a small > test, or are able to do any debugging yourself, that would be very helpful. It is 100% repeatable for me but, interestingly, only on my machine at home. My machine at work doesn't have the same problem. Both are x86_64 machines with two cores and 4Gb of memory and both are running Fedora 8! > One thing you could do is try commenting out all these lines and see if it > goes away. If it does, you could then experiment with which combinations > make it go away. > > #define PERF_FAST_LOADV 1 > #define PERF_FAST_STOREV 1 > > #define PERF_FAST_SARP 1 > > #define PERF_FAST_STACK 1 > #define PERF_FAST_STACK2 1 I've commented them all out and it makes no difference. There are also no bad writes reported beforehand, which rules out Julian's suggestion. Any suggestions for the best way to debug it? Tom -- Tom Hughes (to...@co...) http://www.compton.nu/ |
|
From: Julian S. <js...@ac...> - 2007-12-06 15:18:13
|
> It is 100% repeatable for me but, interestingly, only on my machine at > home. My machine at work doesn't have the same problem. Both are > x86_64 machines with two cores and 4Gb of memory and both are running > Fedora 8! This probably sounds like a really dumbass question, but .. do the two machines have the same CPUs? I ask because just recently Christoph Bartoschek (IIRC) reported some strangeness to do with memory corruption and CPUs. Or something. Christoph, can you summarise what it is you found? J |
|
From: Tom H. <to...@co...> - 2007-12-06 15:26:47
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
>> It is 100% repeatable for me but, interestingly, only on my machine at
>> home. My machine at work doesn't have the same problem. Both are
>> x86_64 machines with two cores and 4Gb of memory and both are running
>> Fedora 8!
>
> This probably sounds like a really dumbass question, but .. do the two
> machines have the same CPUs? I ask because just recently Christoph
> Bartoschek (IIRC) reported some strangeness to do with memory
> corruption and CPUs. Or something. Christoph, can you summarise
> what it is you found?
No - the home machine (where it is failing) is lot newer. It has:
vendor_id : AuthenticAMD
cpu family : 15
model : 75
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
stepping : 2
cpu MHz : 1000.000
cache size : 512 KB
The work machine has two single core Opterons:
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 250
stepping : 10
cpu MHz : 2393.189
cache size : 1024 KB
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Christoph B. <bar...@or...> - 2007-12-06 15:35:05
Attachments:
excess.C
|
Am Donnerstag, 6. Dezember 2007 schrieb Julian Seward: > > It is 100% repeatable for me but, interestingly, only on my machine at > > home. My machine at work doesn't have the same problem. Both are > > x86_64 machines with two cores and 4Gb of memory and both are running > > Fedora 8! > > This probably sounds like a really dumbass question, but .. do the two > machines have the same CPUs? I ask because just recently Christoph > Bartoschek (IIRC) reported some strangeness to do with memory > corruption and CPUs. Or something. Christoph, can you summarise > what it is you found? We could resolve the strangeness by updating the linux kernel. The kernel comming with opensuse 10.2 had the error. Updating Opensuse to 10.3 resolved the issues. Our sysadmin tells me that all distributions with kernels smaller than or equal to 2.6.18 showed the error. I have attached the testprogram that forces the error. This program forces the error after running about 10 - 60 minutes. If it runs longer than 60 minutes it does not fail. The program just allocates a big array and increments the values inside. The program fails if the expected value is not in a memory cell. The program shows the error when it is run natively. It is not run as a valgrind client. I've only used in our machines with 64 GiB of memory and used 67 as the parameter. I would expect that you have to change the program to allocate only 4.1 GiB on a 4 GiB machine. Currently it accepts only full GiB steps. You should also adapt the number of threads to the available number of cores. Greetings Christoph |
|
From: Tom H. <to...@co...> - 2007-12-06 15:39:21
|
In message <200...@or...>
Christoph Bartoschek <bar...@or...> wrote:
> Am Donnerstag, 6. Dezember 2007 schrieb Julian Seward:
>> > It is 100% repeatable for me but, interestingly, only on my machine at
>> > home. My machine at work doesn't have the same problem. Both are
>> > x86_64 machines with two cores and 4Gb of memory and both are running
>> > Fedora 8!
>>
>> This probably sounds like a really dumbass question, but .. do the two
>> machines have the same CPUs? I ask because just recently Christoph
>> Bartoschek (IIRC) reported some strangeness to do with memory
>> corruption and CPUs. Or something. Christoph, can you summarise
>> what it is you found?
>
> We could resolve the strangeness by updating the linux kernel.
> The kernel comming with opensuse 10.2 had the error. Updating Opensuse to 10.3
> resolved the issues. Our sysadmin tells me that all distributions with
> kernels smaller than or equal to 2.6.18 showed the error.
Both the machines in question are running 2.6.23.1-49.fc8 kernels.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Nicholas N. <nj...@cs...> - 2007-12-06 21:43:28
|
On Thu, 6 Dec 2007, Tom Hughes wrote: >>> Memcheck: mc_main.c:957 (get_sec_vbits8): Assertion 'n' failed. >>> Memcheck: get_sec_vbits8: no node for address 0x6FA9EA0 (0x6FA9EAC) >> >> It's a problem with the secondary V bits table in Memcheck. That table >> holds the full V bits for all memory bytes that are partially defined. >> It's happened a couple of times, but always in situations that are >> impossible for me to reproduce. If you are able to reduce it to a small >> test, or are able to do any debugging yourself, that would be very helpful. > > It is 100% repeatable for me but, interestingly, only on my machine at > home. My machine at work doesn't have the same problem. Both are > x86_64 machines with two cores and 4Gb of memory and both are running > Fedora 8! > [...] > Any suggestions for the best way to debug it? The relevant code starts with this line, around line 838: /* --------------- Secondary V bit table ------------ */ It's a fairly basic data structure, the only notable thing is that we periodically garbage collect (GC) it, ie. remove stale nodes. The easy first thing to try is to turn off the GC, ie. make gcSecVBitTable() do nothing. If that makes the problem go away, then we know that the GC is removing nodes it shouldn't. It might also be useful if you can run with -v. The "memcheck GC" lines indicate when GCs are happening. Nick |
|
From: Tom H. <to...@co...> - 2007-12-06 22:50:39
|
In message <Pin...@mu...>
Nicholas Nethercote <nj...@cs...> wrote:
> On Thu, 6 Dec 2007, Tom Hughes wrote:
>
> >>> Memcheck: mc_main.c:957 (get_sec_vbits8): Assertion 'n' failed.
> >>> Memcheck: get_sec_vbits8: no node for address 0x6FA9EA0 (0x6FA9EAC)
> >>
> >> It's a problem with the secondary V bits table in Memcheck. That table
> >> holds the full V bits for all memory bytes that are partially defined.
> >> It's happened a couple of times, but always in situations that are
> >> impossible for me to reproduce. If you are able to reduce it to a small
> >> test, or are able to do any debugging yourself, that would be very helpful.
> >
> > It is 100% repeatable for me but, interestingly, only on my machine at
> > home. My machine at work doesn't have the same problem. Both are
> > x86_64 machines with two cores and 4Gb of memory and both are running
> > Fedora 8!
> > [...]
> > Any suggestions for the best way to debug it?
>
> The relevant code starts with this line, around line 838:
>
> /* --------------- Secondary V bit table ------------ */
>
> It's a fairly basic data structure, the only notable thing is that we
> periodically garbage collect (GC) it, ie. remove stale nodes. The easy
> first thing to try is to turn off the GC, ie. make gcSecVBitTable() do
> nothing. If that makes the problem go away, then we know that the GC is
> removing nodes it shouldn't.
Well I made that function do nothing, and it still happens...
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2007-12-07 00:54:39
|
Here's some background. Apologies if you know this already.
The "secondary V bits table" holds V (definedness) bits for selected
few parts of the process' address space. Just the parts of the
address space where bytes are partially defined, that is, neither
completely undefined nor completely defined. There are relatively
few of these.
The table (secVBitTable) is actually an OSet, essentially an AVL
tree which maps guest addresses to the V bits for that address.
Because it would be rather wasteful of space to have one tree node
for each partially defined byte in the address space, instead
each node contains the definedness data for BYTES_PER_SEC_VBIT_NODE
(16) bytes at a time. Accordingly the associated OSet key is
rounded down to the nearest 16 byte boundary.
Memcheck is bombing in "get_sec_vbits8(Addr a)" because, following
consultation of other data structures, it has determined that the
byte at "a" is partially defined, so it needs to look up in
said table, its exact definedness info. Problem is there is no
entry in the table.
That means, either:
1. no entry was ever made for "a"
(really, for VG_ROUNDDN(a, BYTES_PER_SEC_VBIT_NODE)), or
2. there was an entry, but it has since been deleted, or
3. some other snafu.
Let's chase (1) first: in set_sec_vbits8 I'd add
VG_(printf)("setting line %p\n", aAligned)
let it run, presumably accumulating a large log file. When it borks,
have a look in the log file, to see if the aAligned causing the assertion in
get_sec_vbits8 was actually entered in the first place. Yell if that
don't make sense.
If that looks OK (iow, there is at least one corresponding log file
entry), let's consider 2. Periodically gcSecVBitTable() walks over
said AVL tree. If all 16 bytes in a given chunk are completely
defined or completely undefined, then the chunk is redundant, and
can be deleted from the tree. There are two complications, though:
(a) we don't want to be chucking lines out of the tree too
enthusiastically, for performance reasons. So a line has to
have no part-defined bytes for MAX_STALE_AGE consecutive
checks before it gets dumped.
(b) we can't delete nodes from the tree at the same time we're
iterating over it (using VG_(OSet_Next)). So instead, the
survivor lines are copied into a new tree (OSet) and the old one
is nuked afterwards.
So anyway, you see "if ( keep )" at line 918. In the case (!keep),
add a printf to show "n->a" of the line being dumped and see if
any dumped line matches the missing one causing the assertion failure.
Hmm, on rereading previous messages, all of (2) is irrelevant if
you disabled gcSecVBitTable and the problem still exists. So
it's either 1. or 3. Can you at least try 1. ?
J
On Thursday 06 December 2007 22:43, Nicholas Nethercote wrote:
> On Thu, 6 Dec 2007, Tom Hughes wrote:
> >>> Memcheck: mc_main.c:957 (get_sec_vbits8): Assertion 'n' failed.
> >>> Memcheck: get_sec_vbits8: no node for address 0x6FA9EA0 (0x6FA9EAC)
> >>
> >> It's a problem with the secondary V bits table in Memcheck. That table
> >> holds the full V bits for all memory bytes that are partially defined.
> >> It's happened a couple of times, but always in situations that are
> >> impossible for me to reproduce. If you are able to reduce it to a small
> >> test, or are able to do any debugging yourself, that would be very
> >> helpful.
> >
> > It is 100% repeatable for me but, interestingly, only on my machine at
> > home. My machine at work doesn't have the same problem. Both are
> > x86_64 machines with two cores and 4Gb of memory and both are running
> > Fedora 8!
> > [...]
> > Any suggestions for the best way to debug it?
>
> The relevant code starts with this line, around line 838:
>
> /* --------------- Secondary V bit table ------------ */
>
> It's a fairly basic data structure, the only notable thing is that we
> periodically garbage collect (GC) it, ie. remove stale nodes. The easy
> first thing to try is to turn off the GC, ie. make gcSecVBitTable() do
> nothing. If that makes the problem go away, then we know that the GC is
> removing nodes it shouldn't.
>
> It might also be useful if you can run with -v. The "memcheck GC" lines
> indicate when GCs are happening.
>
> Nick
>
> -------------------------------------------------------------------------
> SF.Net email is sponsored by:
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
|
|
From: Nicholas N. <nj...@cs...> - 2007-12-07 03:07:05
|
On Fri, 7 Dec 2007, Julian Seward wrote: > 1. no entry was ever made for "a" > (really, for VG_ROUNDDN(a, BYTES_PER_SEC_VBIT_NODE)), or > > 2. there was an entry, but it has since been deleted, or > > 3. some other snafu. > > Hmm, on rereading previous messages, all of (2) is irrelevant if > you disabled gcSecVBitTable and the problem still exists. So > it's either 1. or 3. Can you at least try 1. ? Hopefully it's 1. If it's 3, then the secVbits set/get calls match up, which (I think) means the secVBits table is getting corrupted somehow. Or maybe the set/get calls themselves are buggy? Nick |
|
From: Tom H. <to...@co...> - 2007-12-07 14:06:29
|
On Dec 7, 2007 12:53 AM, Julian Seward <js...@ac...> wrote:
> That means, either:
>
> 1. no entry was ever made for "a"
> (really, for VG_ROUNDDN(a, BYTES_PER_SEC_VBIT_NODE)), or
>
> 2. there was an entry, but it has since been deleted, or
>
> 3. some other snafu.
>
> Let's chase (1) first: in set_sec_vbits8 I'd add
> VG_(printf)("setting line %p\n", aAligned)
> let it run, presumably accumulating a large log file. When it borks,
> have a look in the log file, to see if the aAligned causing the assertion in
> get_sec_vbits8 was actually entered in the first place. Yell if that
> don't make sense.
Done that, and it looks like it is being created - first we get this:
--31740-- setting line 0x75D0EA0
and then a bit later this:
Memcheck: mc_main.c:959 (get_sec_vbits8): Assertion 'n' failed.
Memcheck: get_sec_vbits8: no node for address 0x75D0EA0 (0x75D0EAC)
==31740== at 0x3801444D: report_and_quit (m_libcassert.c:140)
> Hmm, on rereading previous messages, all of (2) is irrelevant if
> you disabled gcSecVBitTable and the problem still exists. So
> it's either 1. or 3. Can you at least try 1. ?
The GC is disabled, so it shouldn't be that..
It's getting odder and odder really.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Christoph B. <bar...@or...> - 2007-12-07 14:15:30
|
Am Freitag, 7. Dezember 2007 schrieb Tom Hughes: > > Done that, and it looks like it is being created - first we get this: > > --31740-- setting line 0x75D0EA0 > > and then a bit later this: > > Memcheck: mc_main.c:959 (get_sec_vbits8): Assertion 'n' failed. > Memcheck: get_sec_vbits8: no node for address 0x75D0EA0 (0x75D0EAC) > > ==31740== at 0x3801444D: report_and_quit (m_libcassert.c:140) > > > Hmm, on rereading previous messages, all of (2) is irrelevant if > > you disabled gcSecVBitTable and the problem still exists. So > > it's either 1. or 3. Can you at least try 1. ? > > The GC is disabled, so it shouldn't be that.. > > It's getting odder and odder really. You could run valgrind in the debugger. And set a watchpoint to the node that holds the address 0x75D0EA0 and its parent in the AVL tree. At some point the link from the root node to the node seems to get lost. This only works if the address is constant between two runs of valgrind. If this is the case you could also search for the address in the AVL tree after all modification steps and use binary search over time to find the location where it gets lost. Christoph |
|
From: Tom H. <to...@co...> - 2007-12-07 15:01:29
|
On Dec 7, 2007 2:06 PM, Tom Hughes <to...@co...> wrote:
> On Dec 7, 2007 12:53 AM, Julian Seward <js...@ac...> wrote:
>
> > That means, either:
> >
> > 1. no entry was ever made for "a"
> > (really, for VG_ROUNDDN(a, BYTES_PER_SEC_VBIT_NODE)), or
> >
> > 2. there was an entry, but it has since been deleted, or
> >
> > 3. some other snafu.
> >
> > Let's chase (1) first: in set_sec_vbits8 I'd add
> > VG_(printf)("setting line %p\n", aAligned)
> > let it run, presumably accumulating a large log file. When it borks,
> > have a look in the log file, to see if the aAligned causing the assertion in
> > get_sec_vbits8 was actually entered in the first place. Yell if that
> > don't make sense.
>
> Done that, and it looks like it is being created - first we get this:
>
> --31740-- setting line 0x75D0EA0
>
> and then a bit later this:
>
> Memcheck: mc_main.c:959 (get_sec_vbits8): Assertion 'n' failed.
> Memcheck: get_sec_vbits8: no node for address 0x75D0EA0 (0x75D0EAC)
>
> ==31740== at 0x3801444D: report_and_quit (m_libcassert.c:140)
>
> > Hmm, on rereading previous messages, all of (2) is irrelevant if
> > you disabled gcSecVBitTable and the problem still exists. So
> > it's either 1. or 3. Can you at least try 1. ?
>
> The GC is disabled, so it shouldn't be that..
>
> It's getting odder and odder really.
It seems that the problem is that the AVL tree is getting out of
order. I made get_sec_vbits8 walk the oset when it detects the problem
and dump the addresses to the log and this is what I get:
0x28D448D0
0x28D44950
...
0x28E81BF0
0x28E8C910
0x7F22ECE0
0x7F22ED30
...
0x7F22F930
0x7F22F960
0xFEDD6D30
0xFEDD6D70
0x75D0EA0
At that point it stopped as it noticed the address going backwards.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Tom H. <to...@co...> - 2007-12-07 15:15:36
|
On Dec 7, 2007 3:01 PM, Tom Hughes <to...@co...> wrote: > It seems that the problem is that the AVL tree is getting out of > order. I made get_sec_vbits8 walk the oset when it detects the problem > and dump the addresses to the log and this is what I get: > > 0x28D448D0 > 0x28D44950 > ... > 0x28E81BF0 > 0x28E8C910 > 0x7F22ECE0 > 0x7F22ED30 > ... > 0x7F22F930 > 0x7F22F960 > 0xFEDD6D30 > 0xFEDD6D70 > 0x75D0EA0 > > At that point it stopped as it noticed the address going backwards. I've now made set_sec_vbits8 dump the tree before and after setting the line and it looks very simple: 0xFEC7CD30 0xFEC7CD70 --11681-- setting line 0x75D0EA0 0xFEC7CD30 0xFEC7CD70 0x75D0EA0 So we have a tree with 2 notes, insert a third and get an unordered tree out. Somehow VG_(OSetGen_Lookup) still works at that point though... Tom -- Tom Hughes (to...@co...) http://www.compton.nu/ |
|
From: Julian S. <js...@ac...> - 2007-12-07 18:20:45
|
> I've now made set_sec_vbits8 dump the tree before and after setting > the line and it looks very simple: > > 0xFEC7CD30 > 0xFEC7CD70 > --11681-- setting line 0x75D0EA0 > 0xFEC7CD30 > 0xFEC7CD70 > 0x75D0EA0 > > So we have a tree with 2 notes, insert a third and get an unordered tree > out. Sheesh. That's very strange, because the AVL stuff has been intensively hammered on since it was installed, well over a year ago. Can you send your tree-check routine? I am curious to use it in mc_expensive_sanity_check to see if it shows the problem happening but undetected on any other programs. I had a brief check through m_oset.c, looking for word size and signedness issues to do with fast-case comparisons of keys (as is used here), but saw nothing suspicious. J |
|
From: Nicholas N. <nj...@cs...> - 2007-12-07 23:52:53
|
On Fri, 7 Dec 2007, Julian Seward wrote: > I had a brief check through m_oset.c, looking for word size and signedness > issues to do with fast-case comparisons of keys (as is used here), but > saw nothing suspicious. I think it is a problem with the fast comparison. I've reproduced the bug, and when I added an explicit slow comparison function, it behaves correctly. I'll keep looking... N |
|
From: Julian S. <js...@ac...> - 2007-12-07 23:58:04
|
> I think it is a problem with the fast comparison. > I've reproduced the bug, Coolness. How did you manage that? J |
|
From: Tom H. <to...@co...> - 2007-12-08 00:40:23
|
In message <Pin...@mu...>
Nicholas Nethercote <nj...@cs...> wrote:
> On Fri, 7 Dec 2007, Julian Seward wrote:
>
> > I had a brief check through m_oset.c, looking for word size and signedness
> > issues to do with fast-case comparisons of keys (as is used here), but
> > saw nothing suspicious.
>
> I think it is a problem with the fast comparison. I've reproduced the bug,
> and when I added an explicit slow comparison function, it behaves correctly.
> I'll keep looking...
I came to that conclusion as well, but had to go out before I had a
chance to send another mail.
Thinking about it as I was walking to the station I suddenly realised
that the fast case comparison is treating the addresses as Word types
which are signed, so the high address in this case becomes negative
and appears to come before the other address.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Nicholas N. <nj...@cs...> - 2007-12-08 00:51:03
|
On Sat, 8 Dec 2007, Tom Hughes wrote:
>> I think it is a problem with the fast comparison. I've reproduced the bug,
>> and when I added an explicit slow comparison function, it behaves correctly.
>> I'll keep looking...
>
> I came to that conclusion as well, but had to go out before I had a
> chance to send another mail.
>
> Thinking about it as I was walking to the station I suddenly realised
> that the fast case comparison is treating the addresses as Word types
> which are signed, so the high address in this case becomes negative
> and appears to come before the other address.
The thing is, as long as the comparison is always done the same way, it
should work. The ordering won't be right if we treat the values as
unsigned addresses, but they will be right if we treat the values as signed
words.
I thought I had reproduced the bug, but all I've reproduced is the
unexpected ordering -- I can't actually get the assertion failure on the
lookup.
AFAICT, 'fast_cmp' and 'avl_lookup' are the only two functions that do the
fast comparisons. And they seem to be doing the same thing -- they treat
the words as signed, do a subtraction, the result is treated as signed, and
the result is checked for < 0 or > 0.
Tom, can you try the following. In createSecVBitTable(), change the NULL
parameter passed to OSetGen_Create to 'mycmp', and define 'mycmp' as
follows:
Word mycmp( void* key, void* elem )
{
Addr a1 = *(Addr*)key;
SecVBitNode* n = (SecVBitNode*)elem;
Addr a2 = n->a;
if (a1 < a2) return -1;
if (a1 > a2) return 1;
return 0;
}
This does give the ordering that we expect for Addrs. Does the assertion
still occur in this case?
Nick
|
|
From: Tom H. <to...@co...> - 2007-12-08 01:18:29
|
On Dec 8, 2007 12:50 AM, Nicholas Nethercote <nj...@cs...> wrote:
> Tom, can you try the following. In createSecVBitTable(), change the NULL
> parameter passed to OSetGen_Create to 'mycmp', and define 'mycmp' as
> follows:
>
> Word mycmp( void* key, void* elem )
> {
> Addr a1 = *(Addr*)key;
> SecVBitNode* n = (SecVBitNode*)elem;
> Addr a2 = n->a;
> if (a1 < a2) return -1;
> if (a1 > a2) return 1;
> return 0;
> }
>
> This does give the ordering that we expect for Addrs. Does the assertion
> still occur in this case?
No - that fixes the original assertion that I was seeing.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
> they treat > the words as signed, do a subtraction, the result is treated as signed, and > the result is checked for < 0 or > 0. If the subtraction overflows then it is incorrect to compare the "result" (while forgetting about the Carry) to 0. -- |
|
From: Nicholas N. <nj...@cs...> - 2007-12-08 05:41:50
Attachments:
diff
|
On Sat, 8 Dec 2007, Julian Seward wrote: >>> they treat >>> the words as signed, do a subtraction, the result is treated as signed, >>> and the result is checked for < 0 or > 0. >> >> If the subtraction overflows then it is incorrect to compare the "result" >> (while forgetting about the Carry) to 0. > > Hmm, yes. Wasn't there a giant storm in a teacup on the gcc list a while > back about whether gcc should optimise on the basis that overflow of signed > arithmetic is undefined? I don't understand why the carry bit is important... But here's a patch that should fix the problem. It uses a slower but safer and easier-to-understand method to compute the result. Tom, can you apply and see if it fixes the problem? If so, I guess it should be committed for 3.3.0. Nick |
> --- coregrind/m_oset.c (revision 7273)
> +++ coregrind/m_oset.c (working copy)
> @@ -175,7 +175,15 @@
> // Compare the first word of each element. Inlining is *crucial*.
> static inline Word fast_cmp(void* k, AvlNode* n)
> {
> - return ( *(Word*)k - *(Word*)elem_of_node(n) );
> + Word w1 = *(Word*)k;
> + Word w2 = *(Word*)elem_of_node(n);
> + // In previous versions, we tried to do this faster by doing
> + // "return w1 - w2". But it didn't always work, due to the carry bit
> + // when the subtraction overflowed(?) The branching version is slightly
> + // slows, but safer and easier to understand.
> + if (w1 > w2) return 1;
> + if (w1 < w2) return -1;
> + return 0;
> }
For a closed subroutine of which the callers have no knowledge other than
arguments and result, and the specific values -1, 0, +1 are important:
return (w1 > w2) - (w1 < w2);
which looks cute, and fully exploits the value of a relational expression
(1 if true, 0 if false). On i686 a good compiler should generate
movl w1,%eax
cmpl w2,%eax # w1 vs w2 (AT&T syntax reverses args to 'cmp')
sgt %al # %al= (w1 > w2); /* use 'sa' for unsigned */
slt %cl # %cl= (w1 < w2); /* use 'sb' for unsigned */
subb %cl,%al # %al -= %cl; /* 8-bit arithmetic */
movsbl %al,%eax # sign extend 8 to 32 bits /* does not change CC */
featuring no branches (therefore no penalties for mis-prediction)
but requiring two registers.
If you are restricted to only one register, but addressing is inexpensive
and the CPU+compiler has a conditional move instruction:
Word rv = 0;
if (w1 > w2) rv = 1;
if (w1 < w2) rv = -1;
return rv;
which becomes
.rodata
zero: Int 0
pos1: Int +1
neg1: Int -1
.text
movl w1,%eax
cmpl w2,%eax # w1 vs w2
movl zero,%eax # assume (w1==w2) /* no change in CC */
cmovgt pos1,%eax # if (w1 > w2) then %eax = +1; /* no change in CC */
cmovlt neg1,%eax # if (w1 < w2) then %eax = -1; /* no change in CC */
If the result of fast_cmp is used to index an array, then either method
above works well.
However, the declaration says "static inline" and the caller uses the result
as the decision variable for a branch tree. Gcc knows how to expand the
function inline and propagate the constant values across 'return',
so why was there any need for "fast" cmp in the first place?
Indeed, a single compare instruction 'cmp' ( with no set-characteristic-value,
and no conditional-move) is enough. If the usage of the specific function
is that important, then prepare a separate version of the caller which
has the comparison function written literally, fully integrated with
the caller's decision tree.
--
|