Re: [Smtcomp-discussion] [SMT-COMP] SMT-COMP 2017: Call for Comments, Benchmarks, Solvers

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Tjark,

On Mon, Mar 27, 2017 at 7:53 AM, Tjark Weber <tja...@it...> wrote:

> Andrew,
>
> On Sun, 2017-03-26 at 18:03 -0500, Andrew Reynolds wrote:
>
> > On Fri, Mar 24, 2017 at 6:13 PM, Tjark Weber <tja...@it...>
> > wrote:
> >         On the other hand, one might argue that an unsound solver
> >         should not be trusted in any division. [...]
> >
> > Here, I would argue that the implementation of SMT solvers tends to be
> > fairly modular, meaning that if an SMT solver gets a wrong answer for
> > logic A, then the bug is very likely in the component that
> > specifically handles A, and the solver is likely still sound for
> > everything besides A. [...]
>
> I agree with your point about modularity. But note that the solver's
> (likely) soundness for everything besides A is only established in
> hindsight, through the results of the competition. The issue, of
> course, is that the solver was originally advertised as sound for
> everything including A.
>
> Once the competition results are known, it is easy to turn a partially
> unsound solver into a useful tool by restricting it to appropriate
> logics. But it is hard to know in advance for which logics a solver is
> (likely) sound if the developer's claims cannot be relied on.
>
>
One point here is that the competition should encourage innovation.  So if
there is a difficult new logic (e.g. floating points or strings), and I'd
like to enter, but I'm not 100% confident in my solver for that logic, it
still would be nice for me to be able to enter without having to risk the
score for the logics where I am sure my solver is correct.

> > Let's consider another case:
>
> > Logic A, B with weights 1,1.
> > Solver X gets 1 wrong answer in both A and B, it's total score is -1*1
> > + -1*1 = -2
> > Solver Y solves all benchmarks in A, but gets 4 wrong answers in B,
> > its total score is 1 + -4*1 = -3
>
> > Here, I would argue that Solver Y is the better solver: it is (likely)
> > sound for logic A, whereas Solver X is definitely unsound for both A
> > and B.
>
> Note that one could also argue the opposite point of view: suppose you
> are asked to solve a benchmark that will be chosen at random from
> SMT-LIB (including logics A and B) by applying one of these solvers.
> (You'll have to take the solver's answer at face value.) Which solver
> would you rather use, X or Y?
>
>
The use case of randomly selecting a benchmark in any logic isn't very
interesting, in my opinion.

In practice, I should be able to select a solver based on the logic. So in
this case I would be confident in solver Y for A, and if I had a problem in
logic B, neither solver would be a good option.  I wouldn't even
necessarily say that Solver X is much better than Solver Y for logic B,
despite the 4 vs 1 wrong answers, I wouldn't use either one.

> > I do think it is very important, and I do believe that an unsound
> > solver is worse than a solver that solves no benchmarks. My opposition
> > is to multiplying by the #wrong benchmarks, since this skews the
> > results unfairly. A punishment of -1*(weight of logic) is still very
> > severe, but is more appropriate for overall score, in my opinion.
>
> I tend to agree that simply multiplying by #wrong benchmarks is
> awkward: we already have a weight for each logic that takes the number
> of benchmarks in that logic into account.
>
> I suspect the original motivation was to punish solvers that get many
> benchmarks wrong more severely than solvers that get only a few
> benchmarks wrong. This is easy to achieve for the division rankings,
> but I am not sure whether this property is worth retaining for the
> competition-wide ranking. Arguably, any solver that is unsound in some
> division shouldn't be used to solve benchmarks from that division.
>
>
Right, this is my point.

Another option would be to multiply by something like -( #wrong / ( 1 +
#wrong ))*(weight of logic), where you'd get a factor -1/2, -2/3, -3/4, ...
as number of wrong benchmarks increase, but this is limited by -1.

Let me know what you think.

Cheers,
Andy

> Best,
> Tjark
>
>
>

Re: [Smtcomp-discussion] [SMT-COMP] SMT-COMP 2017: Call for Comments, Benchmarks, Solvers

Tools and documentation of the annual SMT competition

Re: [Smtcomp-discussion] [SMT-COMP] SMT-COMP 2017: Call for Comments, Benchmarks, Solvers