Re: [Smtcomp-discussion] [SMT-COMP] SMT-COMP 2017: Call for Comments, Benchmarks, Solvers
Tools and documentation of the annual SMT competition
Brought to you by:
webertj
From: Andrew R. <and...@gm...> - 2017-03-30 17:21:53
|
Hi Tjark, On Mon, Mar 27, 2017 at 7:53 AM, Tjark Weber <tja...@it...> wrote: > Andrew, > > On Sun, 2017-03-26 at 18:03 -0500, Andrew Reynolds wrote: > > > On Fri, Mar 24, 2017 at 6:13 PM, Tjark Weber <tja...@it...> > > wrote: > > On the other hand, one might argue that an unsound solver > > should not be trusted in any division. [...] > > > > Here, I would argue that the implementation of SMT solvers tends to be > > fairly modular, meaning that if an SMT solver gets a wrong answer for > > logic A, then the bug is very likely in the component that > > specifically handles A, and the solver is likely still sound for > > everything besides A. [...] > > I agree with your point about modularity. But note that the solver's > (likely) soundness for everything besides A is only established in > hindsight, through the results of the competition. The issue, of > course, is that the solver was originally advertised as sound for > everything including A. > > Once the competition results are known, it is easy to turn a partially > unsound solver into a useful tool by restricting it to appropriate > logics. But it is hard to know in advance for which logics a solver is > (likely) sound if the developer's claims cannot be relied on. > > One point here is that the competition should encourage innovation. So if there is a difficult new logic (e.g. floating points or strings), and I'd like to enter, but I'm not 100% confident in my solver for that logic, it still would be nice for me to be able to enter without having to risk the score for the logics where I am sure my solver is correct. > > Let's consider another case: > > > Logic A, B with weights 1,1. > > Solver X gets 1 wrong answer in both A and B, it's total score is -1*1 > > + -1*1 = -2 > > Solver Y solves all benchmarks in A, but gets 4 wrong answers in B, > > its total score is 1 + -4*1 = -3 > > > Here, I would argue that Solver Y is the better solver: it is (likely) > > sound for logic A, whereas Solver X is definitely unsound for both A > > and B. > > Note that one could also argue the opposite point of view: suppose you > are asked to solve a benchmark that will be chosen at random from > SMT-LIB (including logics A and B) by applying one of these solvers. > (You'll have to take the solver's answer at face value.) Which solver > would you rather use, X or Y? > > The use case of randomly selecting a benchmark in any logic isn't very interesting, in my opinion. In practice, I should be able to select a solver based on the logic. So in this case I would be confident in solver Y for A, and if I had a problem in logic B, neither solver would be a good option. I wouldn't even necessarily say that Solver X is much better than Solver Y for logic B, despite the 4 vs 1 wrong answers, I wouldn't use either one. > > I do think it is very important, and I do believe that an unsound > > solver is worse than a solver that solves no benchmarks. My opposition > > is to multiplying by the #wrong benchmarks, since this skews the > > results unfairly. A punishment of -1*(weight of logic) is still very > > severe, but is more appropriate for overall score, in my opinion. > > I tend to agree that simply multiplying by #wrong benchmarks is > awkward: we already have a weight for each logic that takes the number > of benchmarks in that logic into account. > > I suspect the original motivation was to punish solvers that get many > benchmarks wrong more severely than solvers that get only a few > benchmarks wrong. This is easy to achieve for the division rankings, > but I am not sure whether this property is worth retaining for the > competition-wide ranking. Arguably, any solver that is unsound in some > division shouldn't be used to solve benchmarks from that division. > > Right, this is my point. Another option would be to multiply by something like -( #wrong / ( 1 + #wrong ))*(weight of logic), where you'd get a factor -1/2, -2/3, -3/4, ... as number of wrong benchmarks increase, but this is limited by -1. Let me know what you think. Cheers, Andy > Best, > Tjark > > > |