|
From: Florian K. <br...@ac...> - 2011-10-19 16:01:01
|
Hi, I'd like to formalize the names that we give to the .exp files in the regression test buckets. There are at least three reasons. First, when vg_regtest matches the .out file as obtained by running the test against the existing exp files it will report success if *any* of the existing exp files matches. That is the wrong thing because it can lead to false positives (claiming that a test passes where in fact it does not). In the future, if we're running on, say, ARM and there is an ARM-specific exp file then vg_regtest should only compare against that one and not any others. To do that we must be able to extract the architecture from the file name. Secondly, we have currently a bunch of platform-specific exp files that contain a result that is actually incorrect. E.g. the line numbers in a backtrace are off-by-one. vg_regtest will report success because the actual result matches the incorrect exp file and the issue is hidden. I've begun to name these files by appending -kfail to the name. With this notation I'm piggy-backing on notation invented for POSIX 1003.3 conformance testing. See here: http://www.delorie.com/gnu/docs/dejagnu/dejagnu_6.html This enables us to change vg_regtest to summarize how many testcases passed and how many kfail'ed. Thirdly, consistency. Always a good thing. And simplifies vg_regtest's file name matching. I propose to name exp files like so: BASE.STREAM.exp[-PLATFORM[-FAILCODE][-REASON]] Stuff written in all-caps indicates a syntactic variable, [...] is something optional and | means 'or'. All other characters stand for themselves. BASE := basename of the test; e.g. for foo.vgtest BASE would be foo STREAM := stdout | stderr | stdoutB | stderrB PLATFORM := x86 | x86-linux | x86-darwin | amd64 | arm | ppc32 | ppc64 | s390x FAILCODE := kfail REASON := any character string; needed e.g. if we need more than one exp file for a given platform As usual, matching is greedy. E.g. foo.exp-x86-linux means PLATFORM=x86-linux. It does not mean, PLATFORM=x86, REASON=linux The matching in vg_regtest would start with the most specialized exp file towards the generic one and it would stop as soon as a match is obtained. E.g. when looking for a stderr exp file for foo.vgtest while running on x86 Linux, vg_regtest would attempt to match in this order: foo.stderr.exp-x86-linux.* foo.stderr.exp-x86-* foo.stderr.exp-32bit-* foo.stderr.exp-* There is still a possibility for false positives in the case where there are several exp files that differ only in the REASON code. To handle that properly we would have to formalize the REASON code. I have no plans to do that as this happens rarely. Does anybody see a problem with this approach? Comments welcome. Florian |
|
From: Bart V. A. <bva...@ac...> - 2011-10-19 16:16:39
|
On Wed, Oct 19, 2011 at 6:00 PM, Florian Krohm <br...@ac...> wrote: > Does anybody see a problem with this approach? Comments welcome. Makes a lot of sense to me. But my opinion is still that a well-designed regression test has the same output on all platforms. Bart. |
|
From: Philippe W. <phi...@sk...> - 2011-10-19 21:29:13
|
Have a naming convention for these files looks good.
> I propose to name exp files like so:
>
> BASE.STREAM.exp[-PLATFORM[-FAILCODE][-REASON]]
In the above, we have both . and - as separator, while - is also used
either in the platform (e.g. x86-linux) or in the test name (e.g. threaded-fork.stderr.exp).
=> It might be easier for possible automatic processing and/or ease of wildcards
to systematically use the . to separate the pieces.
Wouldn't a more flexible format be needed ?
With the above it is e.g.
mandatory to give the platform for a kfail.
and if multiple platforms share the same output file, then we need
two (or more) identical files.
So, maybe something like:
BASE.STREAM.exp[.[PLATFORM][.FAILCODE[.REASON]]]
and then in PLATFORM, allows an "or" of platforms (with an _ to separate them ?)
An empty PLATFORM (i.e. two successive . characters) would match any
platform. Not clear to me if you expect that the REASON can only be given
when a FAILCODE is given. The above assumes that.
It might also be better to have a proposal which keeps the "'fixed' file extension parts'
at the end to have e.g. *.exp
*.stdout.exp
matching
(otherwise, you need to always have the double of wildcards, e.g; *.exp *.exp.*).
Not clear however in which order to put what at the end.
> PLATFORM := x86 | x86-linux | x86-darwin | amd64 | arm | ppc32 | ppc64 |
> s390x
...
> The matching in vg_regtest would start with the most specialized exp
> file towards the generic one and it would stop as soon as a match is
> obtained. E.g. when looking for a stderr exp file for foo.vgtest while
> running on x86 Linux, vg_regtest would attempt to match in this order:
>
> foo.stderr.exp-x86-linux.*
> foo.stderr.exp-x86-*
> foo.stderr.exp-32bit-*
32bit is not one of the PLATFORM values above. I guess such 'generic' platforms
is however needed. vg_regtest will have the knowledge to translate a specific
platform in one or more of such generic platforms.
> foo.stderr.exp-*
I do not understand the matching logic and how you will avoid false positive.
This last line will e.g. match foo.stderr:exp-ARM.
And on linux, foo.stderr.exp-x86-* will match foo.stderr.exp-x86-darwin
> There is still a possibility for false positives in the case where
> there are several exp files that differ only in the REASON code. To
> handle that properly we would have to formalize the REASON code.
> I have no plans to do that as this happens rarely.
I am not sure to understand the concept of false positive you are looking at
(mentionned here, but also at the begin of the proposal).
I understand the one at the beginning of the mail i.e. that matching any file can
cause that a regression for a platform matches a "kfail" of another platform, and so
is considered "normal expected failure" while it is in fact a brand new regression.
I do not understand this second "false positive" concept:
Imagine that I have a test which (for bad reasons, but let's assume not easy to fix)
can give randomly two outputs:
foo.stderr.exp-x86-linux-kfail-notfixablebehaviour1
foo.stderr.exp-x86-linux-kfail-notfixablebehaviour2
The above only differs in the REASON code. Why/how can this give a false positive ?
Philippe
|
|
From: Florian K. <br...@ac...> - 2011-10-22 15:57:16
|
On 10/19/2011 05:29 PM, Philippe Waroquiers wrote: > Have a naming convention for these files looks good. > >> I propose to name exp files like so: >> >> BASE.STREAM.exp[-PLATFORM[-FAILCODE][-REASON]] > > In the above, we have both . and - as separator, while - is also used > either in the platform (e.g. x86-linux) or in the test name (e.g. > threaded-fork.stderr.exp). > => It might be easier for possible automatic processing and/or ease of > wildcards > to systematically use the . to separate the pieces. > That is true. It will simplify things and that's good. > Wouldn't a more flexible format be needed ? I don't want to go there. As Bart has written in his reply testcases ought to be written such that they pass everywhere. That will not always be possible but we should strive towards that goal. Unless of course we're testing something platform-specific in which case the testcases should go into a subdirectory. In essence a platform-specific exp file is an exception and I rather not come up with a complex scheme to support it. > With the above it is e.g. > mandatory to give the platform for a kfail. Yes. I think that can be salvaged. We actually have a few .exp-kfail files already. > and if multiple platforms share the same output file, then we need > two (or more) identical files. > > So, maybe something like: > BASE.STREAM.exp[.[PLATFORM][.FAILCODE[.REASON]]] > > and then in PLATFORM, allows an "or" of platforms (with an _ to separate > them ?) > An empty PLATFORM (i.e. two successive . characters) would match any > platform. Not clear to me if you expect that the REASON can only be given > when a FAILCODE is given. The above assumes that. > There are currently only two testcases memcheck/tests/origin5-bz2 and memcheck/tests/partiallydefinedeq where this would be helpful. Not enough to get me excited about. >> vg_regtest would attempt to match in this order: >> >> foo.stderr.exp-x86-linux.* >> foo.stderr.exp-x86-* >> foo.stderr.exp-32bit-* > 32bit is not one of the PLATFORM values above. I guess such 'generic' > platforms > is however needed. vg_regtest will have the knowledge to translate a > specific > platform in one or more of such generic platforms. > Yes, right. >> foo.stderr.exp-* > I do not understand the matching logic and how you will avoid false > positive. > This last line will e.g. match foo.stderr:exp-ARM. > And on linux, foo.stderr.exp-x86-* will match foo.stderr.exp-x86-darwin > I think this will be resolved by adopting your suggestion from above, i.e. using '.' as separators. I was planning to do that. > >> There is still a possibility for false positives in the case where >> there are several exp files that differ only in the REASON code. To >> handle that properly we would have to formalize the REASON code. >> I have no plans to do that as this happens rarely. > I am not sure to understand the concept of false positive you are > looking at > (mentionned here, but also at the begin of the proposal). > ...snip... > I do not understand this second "false positive" concept: > Imagine that I have a test which (for bad reasons, but let's assume not > easy to fix) > can give randomly two outputs: > foo.stderr.exp-x86-linux-kfail-notfixablebehaviour1 > foo.stderr.exp-x86-linux-kfail-notfixablebehaviour2 > The above only differs in the REASON code. Why/how can this give a false > positive ? A test that gives random outputs (i.e. behaves non-deterministically) should not be in the test suite to begin with. Now suppose the output is deterministic and you're running the tests on two x86-linux based platforms P1 and P2. On P1 the output is foo.stderr.exp-x86-linux-kfail-notfixablebehaviour1. On P2 the output is foo.stderr.exp-x86-linux-kfail-notfixablebehaviour2. We cannot express that pairing today and the proposal provides no machinery to do that. All we can do is to compare the stderr.out file against all provided stderr.exp files (here ...-notfixablebehaviour1 and ...-notfixablebehaviour2). Should stderr.out from running on P1 match foo.stderr.exp-x86-linux-kfail-notfixablebehaviour2 we would report that as a passing testcase. But it isn't. Hence a false positive. Florian |
|
From: Julian S. <js...@ac...> - 2011-10-25 08:09:58
|
> > can give randomly two outputs: > > foo.stderr.exp-x86-linux-kfail-notfixablebehaviour1 > > foo.stderr.exp-x86-linux-kfail-notfixablebehaviour2 > > > > The above only differs in the REASON code. Why/how can this give a false > > positive ? > > A test that gives random outputs (i.e. behaves non-deterministically) > should not be in the test suite to begin with. If the test suite was composed purely of well behaved, correct programs that might be so. But it also contains programs which are broken in various weird ways, and I suspect some level of nondeterminism is unavoidable. I'm thinking mostly of the threaded tests, viz, the DRD and Helgrind tests. J |
|
From: Philippe W. <phi...@sk...> - 2011-10-23 18:56:42
|
>> Wouldn't a more flexible format be needed ? > I don't want to go there. As Bart has written in his reply testcases Looks reasonable to me. >>> vg_regtest would attempt to match in this order: >>> >>> foo.stderr.exp-x86-linux.* >>> foo.stderr.exp-x86-* >>> foo.stderr.exp-32bit-* >> 32bit is not one of the PLATFORM values above. I guess such 'generic' >> platforms >> is however needed. vg_regtest will have the knowledge to translate a >> specific >> platform in one or more of such generic platforms. >> > > Yes, right. > >>> foo.stderr.exp-* >> I do not understand the matching logic and how you will avoid false >> positive. >> This last line will e.g. match foo.stderr:exp-ARM. >> And on linux, foo.stderr.exp-x86-* will match foo.stderr.exp-x86-darwin >> > > I think this will be resolved by adopting your suggestion from above, > i.e. using '.' as separators. I was planning to do that. I do not think replacing - by . will avoid false positive, as we will have e.g. on x86 foo.stderr.exp.* that will match foo.stderr.exp.ARM. I think the matching logic should use the platform + a bunch of "generic" platforms (e.g. 32 bits), and then try to match with these platforms. It should never try to match the platform part with a *, so foo.stderr.exp.* is not ok, even after having tried the "specific" such as foo.stderr.exp.x86 > On P1 the output is foo.stderr.exp-x86-linux-kfail-notfixablebehaviour1. > On P2 the output is foo.stderr.exp-x86-linux-kfail-notfixablebehaviour2. > > We cannot express that pairing today and the proposal provides no > machinery to do that. All we can do is to compare the stderr.out file > against all provided stderr.exp files (here ...-notfixablebehaviour1 and > ...-notfixablebehaviour2). Should stderr.out from running on P1 match > foo.stderr.exp-x86-linux-kfail-notfixablebehaviour2 we would report that > as a passing testcase. But it isn't. Hence a false positive. The failure will just "change of reason", but the test will not become succesful. So, I do not think this is a problem. Philippe |
|
From: Julian S. <js...@ac...> - 2011-10-25 07:56:56
|
Interesting. For sure something better than what we have now would be good. Your proposal seems a bit complex, but I haven't digested all the details yet. > PLATFORM := x86 | x86-linux | x86-darwin | amd64 | arm | ppc32 | ppc64 | > s390x That gives a terminological inconsistency with how "platform" is used in the rest of the code base. The factoring used is arch -- cpu architecture, eg, x86, arm, etc os -- the OS (often as a proxy for the ABI): linux, darwin Then platform is an (arch,os) pairing. This is the basis of the preprocessor symbols VGA_<arch>, VGO_<os> and VGP_<arch>_<os>. Recently I had to add "platform variant" so as to do Android, which is very similar to arm-linux, but not identical, using VGPV_<arch>_<os>_<variant>, where "variant" is normally "vanilla". Anyway .. it would be nice to maintain terminological consistency w.r.t "platform" as it is already used. > The matching in vg_regtest would start with the most specialized exp > file towards the generic one and it would stop as soon as a match is > obtained. E.g. when looking for a stderr exp file for foo.vgtest while > running on x86 Linux, vg_regtest would attempt to match in this order: > > foo.stderr.exp-x86-linux.* > foo.stderr.exp-x86-* > foo.stderr.exp-32bit-* > foo.stderr.exp-* Doesn't this assume that you can always uniquely identify a "most specialised exp file"? IOW, you're saying that the set of exp files forms a finite partial order which has both a top point (least specialised, foo.stderr.exp) and a bottom point (most specialised). But what if you have the following 3 files: foo.stderr.exp-x86 foo.stderr.exp-linux foo.stderr.exp What's the most-specialised exp file here? I think the idea is a good one, but you need to nail down the semantics a bit. Also, your approach implies that the perl driver script will have the ability to compute these partial ordering relations on the exp filenames, which sounds a bit complex. How do other variants fix it? For example just the other day, we had to do a new-glibc vs older-glibc .stdout.exp split (because of problems with -ve nan printing iirc). J |
|
From: Ashley P. <as...@pi...> - 2011-10-25 09:36:57
|
On 19 Oct 2011, at 17:00, Florian Krohm wrote: > Does anybody see a problem with this approach? Comments welcome. This is more a general comment about testing Valgrind that your proposals but I think it's relevant given you are looking at this area. Some time ago when I used Valgrind daily as part of my job I did a lot of work with the xml output of Valgrind which generally isn't well tested. One thing I was able to do was write a utility which would convert the xml output back into the normal or non-xml output. In doing so I was able to find several bugs and make a moderately good xml parser which I was then able to use for other things. It was easy enough to modify the test suite to run each test twice, once in xml mode and once in normal mode, I then converted the xml output back to normal output and could diff the two to find errors. At this point it did occur to be that once the above is in place and correct it would be possible to check for all the different expected outputs pragmatically using the xml rather than doing a series of string compares. I think I got as far as posting the testing changes to this list and I've definitely posted the xml-to-normal output converter, I'm sure either or both of them could be re-created in a couple of days however or I could dig them out and submit them again if requested. Ashley, |