|
From: Francis G. <fra...@gm...> - 2012-08-12 13:31:26
Attachments:
smime.p7s
|
Hi,
Valgrind reports an invalid read, while I think the program is valid.
The program uses stpncpy() instead of strcpy():
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv) {
char *dst = (char *) calloc(4, 1);
char *src = (char *) calloc(4, 1);
src[0] = 'a';
src[1] = 'b';
src[2] = 'c';
src[3] = '\0';
dst = stpncpy(dst, src, 3);
printf("%s %s\n", src, dst);
}
The error is the following:
$ valgrind ./a.out
==21946== Memcheck, a memory error detector
==21946== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==21946== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==21946== Command: ./a.out
==21946==
==21946== Invalid read of size 8
==21946== at 0x4ECA554: __stpncpy_sse2_unaligned
(strcpy-sse2-unaligned.S:297)
==21946== by 0x400608: main (in /data/francis/workspace/augeas/a.out)
==21946== Address 0x51ef090 is 0 bytes inside a block of size 4 alloc'd
==21946== at 0x4C29DB4: calloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21946== by 0x4005C4: main (in /data/francis/workspace/augeas/a.out)
==21946==
==21946== Invalid read of size 8
==21946== at 0x4ECA558: __stpncpy_sse2_unaligned
(strcpy-sse2-unaligned.S:298)
==21946== by 0x400608: main (in /data/francis/workspace/augeas/a.out)
==21946== Address 0x51ef0a0 is 12 bytes after a block of size 4 alloc'd
==21946== at 0x4C29DB4: calloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21946== by 0x4005C4: main (in /data/francis/workspace/augeas/a.out)
==21946==
abc
In the stpncpy routine, there are two key instructions that are executed:
<__stpncpy_sse2_unaligned+45> pcmpeqb (%rsi), xmm1
<__stpncpy_sse2_unaligned+49> pmovmskb %xmm1,%edx
The result in %edx is: 0xfff8, that shows that all bytes are zero,
except the last 3. The whole point here is: this instruction reads 8
bytes, no matter the size of the string (here 4). Could it mislead
valgrind? Does stpncpy assumes the memory is word aligned, that mean
even with a string of 4 bytes, 8 are actually allocated, and that it's
safe to read those 8 bytes?
Thanks!
Francis Giraldeau
|
|
From: songbird <son...@an...> - 2012-08-12 15:41:55
|
Francis Giraldeau wrote:
...
> Hi,
>
> Valgrind reports an invalid read, while I think the program is valid.
> The program uses stpncpy() instead of strcpy():
>
> #include <string.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> int main(int argc, char **argv) {
> char *dst =3D (char *) calloc(4, 1);
> char *src =3D (char *) calloc(4, 1);
> src[0] =3D 'a';
> src[1] =3D 'b';
> src[2] =3D 'c';
> src[3] =3D '\0';
> dst =3D stpncpy(dst, src, 3);
> printf("%s %s\n", src, dst);
> }
>
> The error is the following:
....
from the manpage:
"RETURN VALUE
stpncpy() returns a pointer to the terminating null byte in dest, or,
if dest is not null-terminated, dest + n."
songbird
|
|
From: Francis G. <fra...@gm...> - 2012-08-12 16:12:01
Attachments:
smime.p7s
|
Le 2012-08-12 17:41, songbird a écrit :
> Francis Giraldeau wrote:
> ...
>> Hi,
>>
>> Valgrind reports an invalid read, while I think the program is valid.
>> The program uses stpncpy() instead of strcpy():
>>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>>
>> int main(int argc, char **argv) {
>> char *dst =3D (char *) calloc(4, 1);
>> char *src =3D (char *) calloc(4, 1);
>> src[0] =3D 'a';
>> src[1] =3D 'b';
>> src[2] =3D 'c';
>> src[3] =3D '\0';
>> dst =3D stpncpy(dst, src, 3);
>> printf("%s %s\n", src, dst);
>> }
>>
>> The error is the following:
> ....
>
> from the manpage:
>
> "RETURN VALUE
> stpncpy() returns a pointer to the terminating null byte in dest, or,
> if dest is not null-terminated, dest + n."
>
Of course you are right! Thought, I just tested without updating dst and
the problem is raised anyway. The error is not caused in the printf(),
but in the stpncpy().
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv) {
char *dst = (char *) calloc(4, 1);
char *src = (char *) calloc(4, 1);
src[0] = 'a';
src[1] = 'b';
src[2] = 'c';
src[3] = '\0';
stpncpy(dst, src, 3);
//printf("%s %s\n", src, dst);
free(dst);
free(src);
}
Cheers,
Francis
|
|
From: Philippe W. <phi...@sk...> - 2012-08-12 16:56:18
|
On Sun, 2012-08-12 at 18:10 +0200, Francis Giraldeau wrote: > Of course you are right! Thought, I just tested without updating dst and > the problem is raised anyway. The error is not caused in the printf(), > but in the stpncpy(). String functions might be optimised very specially by the compiler and then might not be redirected by Valgrind. There has been some changes in the area of sse instructions and some replacement of string functions in 3.8.0. => would be good to try with the 3.8.0. If problem is still there, you might try to compile telling gcc to not use string builtins or similar. You can also (with or without the previous builtin trial) examine the output of Valgrind (-v) to see if the stpncpy function is properly redirected to the valgrind one. >From what I can see, there is no replacement for stpncpy. So, the optimised code is then not replaced. And as it is optimised a lot, it can gives false positive. (this is one of the reasons for which some "string" functions are replaced. See memcheck/mc_replace_strmem.c) Philippe |
|
From: Francis G. <fra...@gm...> - 2012-08-12 17:17:21
Attachments:
smime.p7s
|
Le 2012-08-12 18:56, Philippe Waroquiers a écrit :
> On Sun, 2012-08-12 at 18:10 +0200, Francis Giraldeau wrote:
>> Of course you are right! Thought, I just tested without updating dst and
>> the problem is raised anyway. The error is not caused in the printf(),
>> but in the stpncpy().
> String functions might be optimised very specially by the compiler
> and then might not be redirected by Valgrind.
> There has been some changes in the area of sse instructions
> and some replacement of string functions in 3.8.0.
> => would be good to try with the 3.8.0.
I just tested and it produces the same error with valgrind 3.8.0.
> You can also (with or without the previous builtin trial)
> examine the output of Valgrind (-v) to see if the stpncpy
> function is properly redirected to the valgrind one.
>
>>From what I can see, there is no replacement for stpncpy.
> So, the optimised code is then not replaced.
I confirm this too. I did a trivial implementation of stpcpy/stpncpy
with strcpy and it do not raise the issue. Maybe they can be a basis for
an additional REDIR?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
static char *_stpcpy(char *dest, const char *src) {
strcpy(dest, src);
size_t len = strlen(src);
return dest + len;
}
static char *_stpncpy(char *dest, const char *src, size_t len) {
strcpy(dest, src);
return dest + len;
}
int main(int argc, char **argv) {
char *dst = (char *) calloc(4, 1);
char *src = (char *) calloc(4, 1);
src[0] = 'a';
src[1] = 'b';
src[2] = 'c';
src[3] = '\0';
_stpncpy(dst, src, 3);
printf("%s %s\n", src, dst);
free(src);
free(dst);
}
At least in this particular case it's a false positive and thus, I can
ignore safely the warning. Thanks a lot for your feedback!
Cheers,
Francis
|
|
From: Philippe W. <phi...@sk...> - 2012-08-12 18:03:02
|
On Sun, 2012-08-12 at 19:15 +0200, Francis Giraldeau wrote: > I confirm this too. I did a trivial implementation of stpcpy/stpncpy > with strcpy and it do not raise the issue. Maybe they can be a basis for > an additional REDIR? The best is to file a bug on bugzilla, with the test program to reproduce the false positive and additional details such as gcc version, glibc version, distro version, compilation options used. Thanks Philippe |
|
From: Patrick J. L. <lop...@gm...> - 2012-08-12 18:24:05
|
On Sun, Aug 12, 2012 at 11:03 AM, Philippe Waroquiers < phi...@sk...> wrote: > On Sun, 2012-08-12 at 19:15 +0200, Francis Giraldeau wrote: > > > The best is to file a bug on bugzilla, > with the test program to reproduce the false positive > and additional details such as gcc version, glibc version, > distro version, compilation options used. > There are already two bug reports for exactly this issue: https://bugs.kde.org/show_bug.cgi?id=294285 https://bugs.kde.org/show_bug.cgi?id=264936 The correct fix for this problem is to set "--partial-loads-ok=yes". Except that doesn't work because "--partial-loads-ok=yes" does not support 16-bit loads (SSE operations). I would suggest (a) verify that "--partial-loads-ok=yes" does not fix your problem; (b) attach your test case as an additional example on bug 294285; (c) add your vote to bug 294285. (But then again, I am biased because it is my own report.) The smarter the compiler, the more likely it is to emit code like this. - Pat |
|
From: John R. <jr...@bi...> - 2012-08-12 20:44:00
|
> The correct fix for this problem is to set "--partial-loads-ok=yes".
Perhaps not for all purposes. The original traceback for _this_ case:
==21946== Invalid read of size 8
==21946== at 0x4ECA554: __stpncpy_sse2_unaligned (strcpy-sse2-unaligned.S:297)
implicates hand-written assembly-language code (note the filename ending in ".S")
for a closed subroutine (stpncpy) which shares many of the properties of strcpy
and strncpy.
memcheck implements intercepting re-directs for both strcpy and strncpy.
The re-directs remove the possibility of any false positive complaints,
and also are significantly faster than emulating instructions.
Therefore, I suggest that the best fix for _this_ case is for memcheck
to implement a re-direct for stpncpy.
> Except that doesn't work because "--partial-loads-ok=yes" does not support 16-bit loads (SSE operations).
>
> I would suggest (a) verify that "--partial-loads-ok=yes" does not fix your problem; (b) attach your test case as an additional example on bug 294285; (c) add your vote to bug 294285. (But then again, I am biased because it is my own report.)
>
> The smarter the compiler, the more likely it is to emit code like this.
The code which triggered _this_ complaint from memcheck
was not generated by a compiler program. Although "--partial-loads-ok=yes"
might remove the false positive complaint, it would also tend to hide the
performance problem with the emulation of stpncpy.
--
|
|
From: Patrick J. L. <lop...@gm...> - 2012-08-13 01:44:29
|
On Sun, Aug 12, 2012 at 1:44 PM, John Reiser <jr...@bi...> wrote: > > memcheck implements intercepting re-directs for both strcpy and strncpy. > The re-directs remove the possibility of any false positive complaints, > and also are significantly faster than emulating instructions. > Therefore, I suggest that the best fix for _this_ case is for memcheck > to implement a re-direct for stpncpy. What fraction of time does a typical program -- or indeed, any real-world program -- spend in stpncpy, do you think? Amdahl's Law would like to know. Hey, I have a brilliant idea. Let's implement redirects for all of the C standard, C++ standard, POSIX standard, X11, Qt, KDE, zlib, OpenSSL, and Boost libraries. Think how much faster so many programs could run under Valgrind. Or, you know, we could instead spend effort fixing Valgrind's machine emulator to match the memory model of real machines. > Although "--partial-loads-ok=yes" > might remove the false positive complaint, it would also tend to hide the > performance problem with the emulation of stpncpy. What performance problem? - Pat |
|
From: John R. <jr...@bi...> - 2012-08-13 02:09:17
|
On 08/12/2012 06:44 PM, Patrick J. LoPresti wrote: > Or, you know, we could instead spend effort fixing Valgrind's machine emulator to match the memory model of real machines. Your idea of "real machine" is not the same as mine. -- |
|
From: Julian S. <js...@ac...> - 2012-08-13 08:18:28
|
> Of course, it is free software, so I guess I should fix it myself. But > that is not a trivial undertaking and I really do not have the time. So as > this problem hits people over and over and over, I encourage those > encountering it to add their voices to mine to see it fixed once, > permanently. I lose track of what is actually required. Is it to implement, for vector loads, the same thing that you did for scalar loads? That is, don't complain about naturally aligned word size loads that partially overlap the end of a block, and instead simply mark the part of the register corresponding to the area beyond the end of the block, as undefined? Or something else? J |
|
From: John R. <jr...@bi...> - 2012-08-13 15:19:06
|
On 08/13/2012 01:07 AM, Julian Seward wrote:
>
>> Of course, it is free software, so I guess I should fix it myself. But
>> that is not a trivial undertaking and I really do not have the time. So as
>> this problem hits people over and over and over, I encourage those
>> encountering it to add their voices to mine to see it fixed once,
>> permanently.
>
> I lose track of what is actually required. Is it to implement, for
> vector loads, the same thing that you did for scalar loads? That is,
> don't complain about naturally aligned word size loads that partially
> overlap the end of a block, and instead simply mark the part of the
> register corresponding to the area beyond the end of the block, as
> undefined?
What is a "vector load"? Perhaps this means "any load to %xmmK or %ymmK",
but it might be better to say, "a fetch of 8 or more bytes to %xmmK or %ymmK".
It is not enough to mark bytes that are "over-fetched" (beyond the end of
an allocated block) as Uninit, while being silent about the portion of the
access which is beyond the end of the block. Those uninit bits must be
propagated carefully during subsequent instructions, including "not equal".
In the case of st*cpy, a typical sequence is:
-----glibc-2.15/sysdeps/i386/i686/multiarch/strcpy-sse2.S
movdqa (%esi, %ecx), %xmm1
movaps 16(%esi, %ecx), %xmm2
movdqu %xmm1, (%edi, %ecx)
pcmpeqb %xmm2, %xmm0
pmovmskb %xmm0, %edx
add $16, %ecx
sub $48, %ebx
jbe L(CopyFrom1To16BytesCase2OrCase3)
test %edx, %edx
jnz L(CopyFrom1To16BytesUnalignedXmm2)
-----
Notice that any over-fetching into %xmm2 affects the byte-parallel pcmpeqb
and the sign-bit selector pmovmskb (128-to-16 bit reduction), which feeds
the "test; jnz". If the pcmpeqb-pmovmskb detects differences in bytes that
are Defined, then the test-jnz must succeed even though some high bits
of %edx are Uninit. memcheck must *not* say, "some bits are Uninit,
therefore the result is Uninit". Instead, memcheck must recognize that
"some *Defined* bits are different, therefore the result is NotZero."
--
|
|
From: Patrick J. L. <lop...@gm...> - 2012-08-13 17:10:07
|
On Mon, Aug 13, 2012 at 1:07 AM, Julian Seward <js...@ac...> wrote:
>
> I lose track of what is actually required. Is it to implement, for
> vector loads, the same thing that you did for scalar loads? That is,
> don't complain about naturally aligned word size loads that partially
> overlap the end of a block, and instead simply mark the part of the
> register corresponding to the area beyond the end of the block, as
> undefined?
>
Yes, to a first approximation. At least, if it were done, I think I could
come up with patches for the rest.
I would propose the following approach.
Step 1: Add the infrastructure to allow 128-bit return values from helper
functions. We do not yet need 128-bit _arguments_, because addresses are
still just 64 bits. If you want to be forward-looking, allow 256-bit
return values too, since AVX2 is coming.
This step is the only thing I do not think I know how to do myself. If it
were finished, I would be willing to take a crack at the rest.
Step 2: Instead of emulating 128-bit loads with two calls to mc_LOADV64,
implement mc_LOADV128 returning a 128-bit result.
Step 3: Fix the logic for when partial_loads_ok takes effect. Right now,
the final test looks like this:
if (szB == VG_WORDSIZE && VG_IS_WORD_ALIGNED(a)
&& n_addrs_bad < VG_WORDSIZE) {
In other words, the special handling only applies for aligned loads of the
natural word size of the machine. But this is not correct; on a real
system, any aligned load of any size can never "partially fault". So the
correct logic looks something like this:
if (0 == (a % szB) && n_addrs_bad < szB) {
That is, if the address is aligned to the size of the load and the load was
only partially bad, then trigger the special handling. If you want to be
very thorough, add assertions for 0 == (szB & (szB -1)) and (szB <
VKI_PAGE_SIZE). That is, assert that the load size is a power of 2 and
less than the page size. (Although if either of these assertions triggers
on any actual system ever, I promise to eat my hat and shave my eyebrows.)
Step 4: As John Reiser points out in another message, the validity bits
need more precise propagation. PCMPEQB needs to propagate the undefined
_bytes_. PMOVMSKB needs to propagate the undefined _bits_. "Compare
against zero" and "find first set" must give the right answer if they have
enough defined bits to determine it. This would be enough to handle every
case I have actually seen (*).
The test case attached to https://bugs.kde.org/show_bug.cgi?id=294285 exercises
all of these except "compare against zero".
Thanks.
- Pat
(*) For SSE2, anyway. Some non-vector implementations use standard 8-byte
registers and then do something like subtract 0x0101010101010101. Over
time, these should become decreasingly common compared to the superior
vectorized versions. In particular, the Intel compiler I am using never
generates them when it can use SSE2 instead.
|
|
From: Patrick J. L. <lop...@gm...> - 2012-08-13 04:08:24
|
On Sun, Aug 12, 2012 at 7:10 PM, John Reiser <jr...@bi...> wrote: > On 08/12/2012 06:44 PM, Patrick J. LoPresti wrote: > > > Or, you know, we could instead spend effort fixing Valgrind's machine > emulator to match the memory model of real machines. > > Your idea of "real machine" is not the same as mine. > > Perhaps. My definition does, however, encompass the machine on your desktop, and the machine on my desktop, and the machine on the desktop of approximately every person reading this text. In particular, it includes every every PowerPC based system since 1998 or so (AltiVec) and every x86 based system since 1996 or so (MMX). That is, assuming it is running an operating system like OS X or Windows or BSD or Linux. I apologize for using such a vague term as "real" to describe such exotic systems. ... OK, to be serious for a moment. I am sorry if I am being a jerk, but I really do think any argument for interceptors based on "speed" is nonsense. And arguments based on "we can't emulate that", when there is in fact a clear way to emulate that, increasingly sound to me like nonsense. I am using a compiler, right now, today, that generates these optimizations routinely. GCC has not caught up yet, but I continue to believe that it will. Actually, I hope it does soon, because obviously Valgrind's developers will not care about this issue until then, and meanwhile Valgrind is completely unusable for me. Interceptors are (a) a band-aid solution that will need to be applied over and over and over, and (b) useless for fixing my problem (i.e. my compiler is too smart). Of course, it is free software, so I guess I should fix it myself. But that is not a trivial undertaking and I really do not have the time. So as this problem hits people over and over and over, I encourage those encountering it to add their voices to mine to see it fixed once, permanently. - Pat |
|
From: John R. <jr...@bi...> - 2012-08-13 15:53:12
|
On 08/12/2012 09:08 PM, Patrick J. LoPresti wrote: > ... I really do think any argument for interceptors based on "speed" is nonsense. Amdahl's Law says that interceptor speed does matter. Amdahl's Law also applies when slowing down operations, as well as when speeding them up. One paraphrase of Amdahl's Law is, "If you speed up every floating point operation by a factor of 10, then the speed of a whole program will increase by at most 9%, because floating point operations tend to contribute at most 10% to the running time of the whole program." In the other direction, slowing down every floating point operation by a factor of 20 (and where floating point is 10% of the usual running time) will slow down the whole program by a factor of 2.9!! (10% * 20) + ((100% - 10%) * 1) = 290%. *This* is the really eye-opening application of Amdahl's Law! In the case at hand, we are interested in string operations, not floating point. Suppose that stpncpy contributes 0.25% (1/400) of the normal running time. Then instruction-by-instruction emulation of stpncpy by memcheck is about 20 * 0.25% = 5% of the normal running time. This exceeds the common human perception threshold of 3%. In contrast, an intercepted (re-directed) stpncpy would be imperceptible. -- |
|
From: Patrick J. L. <lop...@gm...> - 2012-08-13 17:21:40
|
On Mon, Aug 13, 2012 at 8:54 AM, John Reiser <jr...@bi...> wrote: > > In the case at hand, we are interested in string operations, not floating > point. > Suppose that stpncpy contributes 0.25% (1/400) of the normal running time. > Then > instruction-by-instruction emulation of stpncpy by memcheck is about 20 * > 0.25% = 5% > of the normal running time. Yes, for the (approximately) one application on the planet for which (the non-standard) stpncpy consumes 1/400 of the running time, you would be right. If, that is, the rest of the application were not also running 20 times slower. Also, note that the same "argument" applies to every function in the standard library, leading to my earlier modest proposal to add interceptors for absolutely everything. Since I am starting to repeat myself, I guess I am done with this discussion. Feel free to have the last word. - Pat |