Re: [Smtcomp-discussion] [SMT-COMP] SMT-COMP 2017: Call for Comments, Benchmarks, Solvers
Tools and documentation of the annual SMT competition
Brought to you by:
webertj
From: Tjark W. <tja...@it...> - 2017-03-27 12:53:14
|
Andrew, On Sun, 2017-03-26 at 18:03 -0500, Andrew Reynolds wrote: > On Fri, Mar 24, 2017 at 6:13 PM, Tjark Weber <tja...@it...> > wrote: > On the other hand, one might argue that an unsound solver > should not be trusted in any division. [...] > > Here, I would argue that the implementation of SMT solvers tends to be > fairly modular, meaning that if an SMT solver gets a wrong answer for > logic A, then the bug is very likely in the component that > specifically handles A, and the solver is likely still sound for > everything besides A. [...] I agree with your point about modularity. But note that the solver's (likely) soundness for everything besides A is only established in hindsight, through the results of the competition. The issue, of course, is that the solver was originally advertised as sound for everything including A. Once the competition results are known, it is easy to turn a partially unsound solver into a useful tool by restricting it to appropriate logics. But it is hard to know in advance for which logics a solver is (likely) sound if the developer's claims cannot be relied on. > Let's consider another case: > Logic A, B with weights 1,1. > Solver X gets 1 wrong answer in both A and B, it's total score is -1*1 > + -1*1 = -2 > Solver Y solves all benchmarks in A, but gets 4 wrong answers in B, > its total score is 1 + -4*1 = -3 > Here, I would argue that Solver Y is the better solver: it is (likely) > sound for logic A, whereas Solver X is definitely unsound for both A > and B. Note that one could also argue the opposite point of view: suppose you are asked to solve a benchmark that will be chosen at random from SMT-LIB (including logics A and B) by applying one of these solvers. (You'll have to take the solver's answer at face value.) Which solver would you rather use, X or Y? > I do think it is very important, and I do believe that an unsound > solver is worse than a solver that solves no benchmarks. My opposition > is to multiplying by the #wrong benchmarks, since this skews the > results unfairly. A punishment of -1*(weight of logic) is still very > severe, but is more appropriate for overall score, in my opinion. I tend to agree that simply multiplying by #wrong benchmarks is awkward: we already have a weight for each logic that takes the number of benchmarks in that logic into account. I suspect the original motivation was to punish solvers that get many benchmarks wrong more severely than solvers that get only a few benchmarks wrong. This is easy to achieve for the division rankings, but I am not sure whether this property is worth retaining for the competition-wide ranking. Arguably, any solver that is unsound in some division shouldn't be used to solve benchmarks from that division. Best, Tjark |