Thread: [Smtcomp-discussion] SMT-COMP 2017: Call for Comments, Benchmarks, Solvers
Tools and documentation of the annual SMT competition
Brought to you by:
webertj
From: Tjark W. <tja...@it...> - 2017-03-15 15:06:01
|
====================================================================== 12th International Satisfiability Modulo Theories Competition (SMT-COMP'17) July 23, 2017 Heidelberg, Germany CALL FOR COMMENTS CALL FOR BENCHMARKS PRELIMINARY CALL FOR SOLVERS ====================================================================== SMT-COMP is the annual competition among Satisfiability Modulo Theories (SMT) solvers. The goals of SMT-COMP are to spur solver advances, collect benchmarks, and encourage adoption of the SMT-LIB standard, through friendly competition. SMT-COMP'17 will end in time to be described and the results announced at the SMT Workshop (July 22-23), which is affiliated with the International Conference on Computer-Aided Verification (CAV 2017). SMT-COMP'17 is organized under the direction of the SMT Steering committee. The organizing team for SMT-COMP'17 is . Matthias Heizmann - University of Freiburg, Germany . Giles Reger - University of Manchester, UK . Tjark Weber - Uppsala University, Sweden This is a call for three things: CALL FOR COMMENTS: The organizing team is preparing the schedule and rules for 2017. Any comments you may have to improve the competition over past years or to redirect its focus are welcome and will be considered by the team. We particularly appreciate comments received until *** March 31, 2017. *** For SMT-COMP'17, we are especially interested in comments regarding: - Benchmarks with unknown status: Previous SMT-COMPs have only used benchmarks whose status (sat/unsat) was known. One might argue that this favors imitation of existing solvers over innovation. Should benchmarks with unknown status be used in the competition? If so, how should answers for such benchmarks be scored? Can scoring for one solver remain independent of the answers given by other solvers, or should disagreements between solvers lead to negative scores for a solver? - Benchmark weighting: To avoid over-representation of certain (large) benchmark families, a weighting scheme for benchmarks was introduced in 2016. This causes division scores to be real numbers rather than simply an integer indicating the number of benchmarks solved, and the winning solver in a division is (intentionally) no longer necessarily the solver that solved the largest number of benchmarks. Do the methodological benefits of this weighting scheme justify the increased complexity in interpreting division scores? - Scope: The new (draft) version 2.6 of the SMT-LIB standard includes support for algebraic datatypes. Are corresponding benchmarks and sufficiently stable solver implementations available for SMT-COMP to feature an (experimental) datatype division in 2017? Should the scope or emphasis of the competition be adapted in other ways? CALL FOR BENCHMARKS: The competition is enhanced by new sets of benchmarks to add to those already in use (and used in past competitions). A selection of benchmarks in various categories is used for the competition (according to rules that are currently being revised). If you have benchmarks that you think would be useful to SMT-LIB and SMT-COMP (and may be made public), please let us know as soon as possible, even if the material is not quite ready. We will work in close cooperation with the SMT-LIB maintainers to integrate such benchmarks into the SMT-LIB in time to be taken into account for the competition. The deadline for benchmarks to be used in the 2017 competition will be announced when the schedule and rules are finalized. PRELIMINARY CALL FOR SOLVERS: A submission deadline for solvers will be announced along with the rules. However, it is useful to the organizing team to know in advance which and how many solvers may be entering. Thus we request that you let us know at your earliest convenience if you think you may be submitting one or more solvers to SMT-COMP'17; particularly if you have not submitted a solver before, or if you think there may be unusual circumstances. COMMUNICATION: The competition web-site will be at www.smtcomp.org. Public email regarding the competition may be sent to smt...@cs.... Announcements will be sent on both smt...@cs... and sm...@cs.... Non-confidential email to the organizers may be sent to smt...@li.... Sincerely, The organizing team |
From: Andrew R. <and...@gm...> - 2017-03-16 15:38:47
|
Hello organizers, I have the following opinions about the competition this year: (1) Unknown benchmarks I am in favor of including "unknown" status benchmarks in the competition. Regarding disagreements between solvers, last year, as I recall, there was a separate discussion period where most (or all) disagreements between solvers were resolved e.g. a developer would say "my solver was wrong for this one", etc. Perhaps the competition should: - Run all benchmarks/solvers, note the set of unknown status benchmarks D that two solvers disagree on. - Have a rebuttal period where solver developers can justify their answers for benchmarks in D. - Based on discussion, set the status on benchmarks in D, and then compute the final scores. There are a number of ways to determine which solver is correct, e.g. if A says "sat" and B says "unsat", then check if A also says "sat" for B's unsat core, B also says "unsat" for the benchmark + A's model. If the status cannot be resolved even after discussion, then the benchmark's status remains "unknown" and is not counted in the scoring. (2) Benchmark weighting The benchmark weighting scheme is fine, but I'd like if #solved benchmarks was also shown in the results tables e.g. on the pages http://smtcomp.sourceforge.net/2016/results-AUFLIRA.shtml?v=1467876482 in addition to "Correctly solved score". (3) Datatypes I am highly in favor of including datatypes in SMT LIB and SMT COMP. CVC4 has full support for datatypes in combination with all of its other theories. I have numerous benchmark sets involving datatypes that I've accumulated over the years that involve datatypes that I have been unable to add to SMT LIB due to lack of a standard. I can provide these if you want. (4) Scoring: One thing that has bothered me in previous competitions is how overall scoring is computed with respect to wrong benchmarks. Say I have logics A, B, C, D with weights 1, and I solve all benchmarks in A,B,C, but I get 3 wrong benchmarks in D. Then my overall score is: 1 + 1 + 1 + -3*1 = 0 I would prefer if wrong benchmarks were punished slightly less, so that if a solver gets *any* number of wrong benchmarks in a logic, then its overall score is negative of the weight of that logic, so instead I'd have: 1 + 1 + 1 + -1*1 = 2 I think this is better since it still punishes heavily for wrong benchmarks, but doesn't skew the results so that a solver's overall score can be entirely ruined by one logic. (Here, my solver was correct for A,B,C so I should be rewarded for that). I would also argue that a solver that gets 1 wrong benchmark is not much better than one that gets 3 wrong benchmarks, since both are unsound. I believe this would encourage more participation in logics. For instance, CVC4 may in the future get support for floating points, but we would be discouraged from entering QF_FP since there is a chance that a handful of wrong answers there could single-handedly negate our overall score. Cheers, Andrew Reynolds On Wed, Mar 15, 2017 at 9:46 AM, Tjark Weber <tja...@it...> wrote: > ====================================================================== > > 12th International Satisfiability Modulo Theories Competition > (SMT-COMP'17) > > July 23, 2017 > Heidelberg, Germany > > CALL FOR COMMENTS > CALL FOR BENCHMARKS > PRELIMINARY CALL FOR SOLVERS > > ====================================================================== > > SMT-COMP is the annual competition among Satisfiability Modulo Theories > (SMT) solvers. The goals of SMT-COMP are to spur solver advances, > collect benchmarks, and encourage adoption of the SMT-LIB standard, > through friendly competition. > > SMT-COMP'17 will end in time to be described and the results announced > at the SMT Workshop (July 22-23), which is affiliated with the > International Conference on Computer-Aided Verification (CAV 2017). > > SMT-COMP'17 is organized under the direction of the SMT Steering > committee. The organizing team for SMT-COMP'17 is > > . Matthias Heizmann - University of Freiburg, Germany > . Giles Reger - University of Manchester, UK > . Tjark Weber - Uppsala University, Sweden > > This is a call for three things: > > > CALL FOR COMMENTS: > > The organizing team is preparing the schedule and rules for 2017. > Any comments you may have to improve the competition over past years > or to redirect its focus are welcome and will be considered by the > team. We particularly appreciate comments received until > > *** March 31, 2017. *** > > For SMT-COMP'17, we are especially interested in comments regarding: > > - Benchmarks with unknown status: Previous SMT-COMPs have only used > benchmarks whose status (sat/unsat) was known. One might argue that > this favors imitation of existing solvers over innovation. Should > benchmarks with unknown status be used in the competition? If so, > how should answers for such benchmarks be scored? Can scoring for > one solver remain independent of the answers given by other solvers, > or should disagreements between solvers lead to negative scores for > a solver? > > - Benchmark weighting: To avoid over-representation of certain (large) > benchmark families, a weighting scheme for benchmarks was introduced > in 2016. This causes division scores to be real numbers rather than > simply an integer indicating the number of benchmarks solved, and > the winning solver in a division is (intentionally) no longer > necessarily the solver that solved the largest number of benchmarks. > Do the methodological benefits of this weighting scheme justify the > increased complexity in interpreting division scores? > > - Scope: The new (draft) version 2.6 of the SMT-LIB standard includes > support for algebraic datatypes. Are corresponding benchmarks and > sufficiently stable solver implementations available for SMT-COMP to > feature an (experimental) datatype division in 2017? Should the > scope or emphasis of the competition be adapted in other ways? > > > CALL FOR BENCHMARKS: > > The competition is enhanced by new sets of benchmarks to add to those > already in use (and used in past competitions). A selection of > benchmarks in various categories is used for the competition (according > to rules that are currently being revised). > > If you have benchmarks that you think would be useful to SMT-LIB and > SMT-COMP (and may be made public), please let us know as soon as > possible, even if the material is not quite ready. We will work in > close cooperation with the SMT-LIB maintainers to integrate such > benchmarks into the SMT-LIB in time to be taken into account for the > competition. The deadline for benchmarks to be used in the 2017 > competition will be announced when the schedule and rules are > finalized. > > > PRELIMINARY CALL FOR SOLVERS: > > A submission deadline for solvers will be announced along with the > rules. However, it is useful to the organizing team to know in advance > which and how many solvers may be entering. Thus we request that you > let us know at your earliest convenience if you think you may be > submitting one or more solvers to SMT-COMP'17; particularly if you > have not submitted a solver before, or if you think there may be > unusual circumstances. > > > COMMUNICATION: > > The competition web-site will be at www.smtcomp.org. > > Public email regarding the competition may be sent to > smt...@cs.... > > Announcements will be sent on both smt...@cs... and > sm...@cs.... > > Non-confidential email to the organizers may be sent to > smt...@li.... > > Sincerely, > The organizing team > > > _______________________________________________ > SMT-COMP mailing list > SMT...@cs... > http://cs.nyu.edu/mailman/listinfo/smt-comp > |
From: Tjark W. <tja...@it...> - 2017-03-24 23:13:26
|
Andrew, Thank you for your feedback! On Thu, 2017-03-16 at 10:38 -0500, Andrew Reynolds wrote: > Regarding disagreements between solvers, last year, as I recall, there > was a separate discussion period where most (or all) disagreements > between solvers were resolved e.g. a developer would say "my solver > was wrong for this one", etc. Perhaps the competition should: > - Run all benchmarks/solvers, note the set of unknown status > benchmarks D that two solvers disagree on. > - Have a rebuttal period where solver developers can justify their > answers for benchmarks in D. > - Based on discussion, set the status on benchmarks in D, and then > compute the final scores. I am concerned about some of the practical implications of this suggestion. Preliminary results would have to be available probably 1-2 weeks earlier than in the past (to allow sufficient time for the steps above); the competition would depend on status updates by the SMT-LIB maintainers in a time-critical fashion; and solvers would be run against one SMT-LIB release but scored against another. It's not impossible, of course. But given the size of the competition (with more than 200,000 benchmarks and 1.5 M job pairs in 2016), these issues have the potential to cause some headache. I would much prefer a simpler scoring solution that doesn't involve manual review. > (4) Scoring: > One thing that has bothered me in previous competitions is how overall > scoring is computed with respect to wrong benchmarks. > Say I have logics A, B, C, D with weights 1, and I solve all > benchmarks in A,B,C, but I get 3 wrong benchmarks in D. Then my > overall score is: > 1 + 1 + 1 + -3*1 = 0 > I would prefer if wrong benchmarks were punished slightly less, so > that if a solver gets *any* number of wrong benchmarks in a logic, > then its overall score is negative of the weight of that logic, so > instead I'd have: > 1 + 1 + 1 + -1*1 = 2 On the other hand, one might argue that an unsound solver should not be trusted in any division. There is an even harsher alternative to the current scoring rule that has been considered before: to disqualify unsound solvers entirely. In other words, the current scoring rule already is a compromise. I would be interested to hear further opinions on this issue. How much emphasis should the competition place on soundness of solvers? Best, Tjark |
From: Andrew R. <and...@gm...> - 2017-03-26 23:03:40
|
Hi Tjark, On Fri, Mar 24, 2017 at 6:13 PM, Tjark Weber <tja...@it...> wrote: > Andrew, > > Thank you for your feedback! > > On Thu, 2017-03-16 at 10:38 -0500, Andrew Reynolds wrote: > > > Regarding disagreements between solvers, last year, as I recall, there > > was a separate discussion period where most (or all) disagreements > > between solvers were resolved e.g. a developer would say "my solver > > was wrong for this one", etc. Perhaps the competition should: > > > - Run all benchmarks/solvers, note the set of unknown status > > benchmarks D that two solvers disagree on. > > - Have a rebuttal period where solver developers can justify their > > answers for benchmarks in D. > > - Based on discussion, set the status on benchmarks in D, and then > > compute the final scores. > > I am concerned about some of the practical implications of this > suggestion. Preliminary results would have to be available probably 1-2 > weeks earlier than in the past (to allow sufficient time for the steps > above); the competition would depend on status updates by the SMT-LIB > maintainers in a time-critical fashion; and solvers would be run > against one SMT-LIB release but scored against another. > > It's not impossible, of course. But given the size of the competition > (with more than 200,000 benchmarks and 1.5 M job pairs in 2016), these > issues have the potential to cause some headache. I would much prefer a > simpler scoring solution that doesn't involve manual review. > > Right, I understand this would be a challenge. The proposal I mentioned is more of a wish than a suggestion. :) > > (4) Scoring: > > > One thing that has bothered me in previous competitions is how overall > > scoring is computed with respect to wrong benchmarks. > > > Say I have logics A, B, C, D with weights 1, and I solve all > > benchmarks in A,B,C, but I get 3 wrong benchmarks in D. Then my > > overall score is: > > > 1 + 1 + 1 + -3*1 = 0 > > > I would prefer if wrong benchmarks were punished slightly less, so > > that if a solver gets *any* number of wrong benchmarks in a logic, > > then its overall score is negative of the weight of that logic, so > > instead I'd have: > > > 1 + 1 + 1 + -1*1 = 2 > > On the other hand, one might argue that an unsound solver should not be > trusted in any division. There is an even harsher alternative to the > current scoring rule that has been considered before: to disqualify > unsound solvers entirely. In other words, the current scoring rule > already is a compromise. > > Here, I would argue that the implementation of SMT solvers tends to be fairly modular, meaning that if an SMT solver gets a wrong answer for logic A, then the bug is very likely in the component that specifically handles A, and the solver is likely still sound for everything besides A. Historically, this has been the case for CVC4. In my recollection: SMT comp 2012 CVC4 had a wrong answer in QF_BV related to an unsound bit-vector rewrite, SMT comp 2012 CVC4 had a wrong answer in QF_LIA related to a bug the arithmetic theory solver, SMT comp 2016 CVC4 had a wrong answer in QF_NIA related to the encoding of QF_NIA constraints as QF_BV ones. In each of these cases, CVC4 was still (to my knowledge) sound for everything besides the logics that had the bug. Of course this is not guaranteed to be the case, there could be a solver with a bug that effects all logics, but this is much less likely. Let's consider another case: Logic A, B with weights 1,1. Solver X gets 1 wrong answer in both A and B, it's total score is -1*1 + -1*1 = -2 Solver Y solves all benchmarks in A, but gets 4 wrong answers in B, its total score is 1 + -4*1 = -3 Here, I would argue that Solver Y is the better solver: it is (likely) sound for logic A, whereas Solver X is definitely unsound for both A and B. However, Solver X wins because it got lucky that only 1 benchmark exercised its unsoundness in A and B, whereas Solver Y was unlucky and had 4. If instead we used a punishment of ( -1*weight ) for the logic regardless of #wrong benchmarks, Solver Y would have a total score of 0, and Solver X would still have -2, which I believe would be more appropriate. I would be interested to hear further opinions on this issue. How much > emphasis should the competition place on soundness of solvers? > > I do think it is very important, and I do believe that an unsound solver is worse than a solver that solves no benchmarks. My opposition is to multiplying by the #wrong benchmarks, since this skews the results unfairly. A punishment of -1*(weight of logic) is still very severe, but is more appropriate for overall score, in my opinion. Let me know what you think, Andy > Best, > Tjark > > > |
From: Mate S. <soo...@gm...> - 2017-03-26 23:35:00
|
Maybe I shouldn't interject here, but I think we could use a bit of fuzzing on SMT solvers. The year when STP had a bug, we learned our lesson and started using fuzzsmt (http://fmv.jku.at/fuzzsmt/). It generated a small instance that exposed the bug within about 3 minutes. It now runs on every push to the repo. Of course fuzzsmt doesn't expose all bugs, but it's a start. I personally also AFL-fuzz the underlying SAT solver, it catches some more. Running the fuzzsmt instances under valgrind (or clang's sanitizers) also helps, especially when it catches undefined behaviour (typically uninitialized var used to make a decision). There is probably some work missing on SMT fuzzers, though, maybe that could be encouraged? (Special track maybe?) It's actually hard to write them, see csmith ( https://embed.cs.utah.edu/csmith/ -- note, one of the authors, John Regehr was at CAV in 2016), and the number of papers that came out of that. C is a complicated language, though, but frankly, the SMT spec isn't exactly for the faint of heart, either. In the long run, I'd be good to make the solvers usable in the industry, and that means trust is really important, so maybe building tools&possibly also infrastructure for this could be interesting. There is already infrastructure for checking speed (starexec), so doing something similar for validity checking is not unimaginable. I guess what I'm saying is, instead of only punishing solvers that do the wrong thing, we could encourage to build tools and systems to make it easier to do the right thing. Just my 2 cents, Mate On 27 March 2017 at 00:03, Andrew Reynolds <and...@gm...> wrote: > Hi Tjark, > > On Fri, Mar 24, 2017 at 6:13 PM, Tjark Weber <tja...@it...> wrote: > >> Andrew, >> >> Thank you for your feedback! >> >> On Thu, 2017-03-16 at 10:38 -0500, Andrew Reynolds wrote: >> >> > Regarding disagreements between solvers, last year, as I recall, there >> > was a separate discussion period where most (or all) disagreements >> > between solvers were resolved e.g. a developer would say "my solver >> > was wrong for this one", etc. Perhaps the competition should: >> >> > - Run all benchmarks/solvers, note the set of unknown status >> > benchmarks D that two solvers disagree on. >> > - Have a rebuttal period where solver developers can justify their >> > answers for benchmarks in D. >> > - Based on discussion, set the status on benchmarks in D, and then >> > compute the final scores. >> >> I am concerned about some of the practical implications of this >> suggestion. Preliminary results would have to be available probably 1-2 >> weeks earlier than in the past (to allow sufficient time for the steps >> above); the competition would depend on status updates by the SMT-LIB >> maintainers in a time-critical fashion; and solvers would be run >> against one SMT-LIB release but scored against another. >> >> It's not impossible, of course. But given the size of the competition >> (with more than 200,000 benchmarks and 1.5 M job pairs in 2016), these >> issues have the potential to cause some headache. I would much prefer a >> simpler scoring solution that doesn't involve manual review. >> >> Right, I understand this would be a challenge. The proposal I mentioned > is more of a wish than a suggestion. :) > > >> > (4) Scoring: >> >> > One thing that has bothered me in previous competitions is how overall >> > scoring is computed with respect to wrong benchmarks. >> >> > Say I have logics A, B, C, D with weights 1, and I solve all >> > benchmarks in A,B,C, but I get 3 wrong benchmarks in D. Then my >> > overall score is: >> >> > 1 + 1 + 1 + -3*1 = 0 >> >> > I would prefer if wrong benchmarks were punished slightly less, so >> > that if a solver gets *any* number of wrong benchmarks in a logic, >> > then its overall score is negative of the weight of that logic, so >> > instead I'd have: >> >> > 1 + 1 + 1 + -1*1 = 2 >> >> On the other hand, one might argue that an unsound solver should not be >> trusted in any division. There is an even harsher alternative to the >> current scoring rule that has been considered before: to disqualify >> unsound solvers entirely. In other words, the current scoring rule >> already is a compromise. >> >> Here, I would argue that the implementation of SMT solvers tends to be > fairly modular, meaning that if an SMT solver gets a wrong answer for logic > A, then the bug is very likely in the component that specifically handles > A, and the solver is likely still sound for everything besides A. > Historically, this has been the case for CVC4. In my recollection: > > SMT comp 2012 CVC4 had a wrong answer in QF_BV related to an unsound > bit-vector rewrite, > SMT comp 2012 CVC4 had a wrong answer in QF_LIA related to a bug the > arithmetic theory solver, > SMT comp 2016 CVC4 had a wrong answer in QF_NIA related to the encoding of > QF_NIA constraints as QF_BV ones. > > In each of these cases, CVC4 was still (to my knowledge) sound for > everything besides the logics that had the bug. Of course this is not > guaranteed to be the case, there could be a solver with a bug that effects > all logics, but this is much less likely. > > Let's consider another case: > > Logic A, B with weights 1,1. > Solver X gets 1 wrong answer in both A and B, it's total score is -1*1 + > -1*1 = -2 > Solver Y solves all benchmarks in A, but gets 4 wrong answers in B, its > total score is 1 + -4*1 = -3 > > Here, I would argue that Solver Y is the better solver: it is (likely) > sound for logic A, whereas Solver X is definitely unsound for both A and > B. However, Solver X wins because it got lucky that only 1 benchmark > exercised its unsoundness in A and B, whereas Solver Y was unlucky and had > 4. > If instead we used a punishment of ( -1*weight ) for the logic regardless > of #wrong benchmarks, Solver Y would have a total score of 0, and Solver X > would still have -2, which I believe would be more appropriate. > > I would be interested to hear further opinions on this issue. How much >> emphasis should the competition place on soundness of solvers? >> >> > I do think it is very important, and I do believe that an unsound solver > is worse than a solver that solves no benchmarks. My opposition is to > multiplying by the #wrong benchmarks, since this skews the results > unfairly. A punishment of -1*(weight of logic) is still very severe, but is > more appropriate for overall score, in my opinion. > > Let me know what you think, > Andy > > > > >> Best, >> Tjark >> >> >> > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Smtcomp-discussion mailing list > Smt...@li... > https://lists.sourceforge.net/lists/listinfo/smtcomp-discussion > > |
From: Tjark W. <tja...@it...> - 2017-03-27 12:53:14
|
Andrew, On Sun, 2017-03-26 at 18:03 -0500, Andrew Reynolds wrote: > On Fri, Mar 24, 2017 at 6:13 PM, Tjark Weber <tja...@it...> > wrote: > On the other hand, one might argue that an unsound solver > should not be trusted in any division. [...] > > Here, I would argue that the implementation of SMT solvers tends to be > fairly modular, meaning that if an SMT solver gets a wrong answer for > logic A, then the bug is very likely in the component that > specifically handles A, and the solver is likely still sound for > everything besides A. [...] I agree with your point about modularity. But note that the solver's (likely) soundness for everything besides A is only established in hindsight, through the results of the competition. The issue, of course, is that the solver was originally advertised as sound for everything including A. Once the competition results are known, it is easy to turn a partially unsound solver into a useful tool by restricting it to appropriate logics. But it is hard to know in advance for which logics a solver is (likely) sound if the developer's claims cannot be relied on. > Let's consider another case: > Logic A, B with weights 1,1. > Solver X gets 1 wrong answer in both A and B, it's total score is -1*1 > + -1*1 = -2 > Solver Y solves all benchmarks in A, but gets 4 wrong answers in B, > its total score is 1 + -4*1 = -3 > Here, I would argue that Solver Y is the better solver: it is (likely) > sound for logic A, whereas Solver X is definitely unsound for both A > and B. Note that one could also argue the opposite point of view: suppose you are asked to solve a benchmark that will be chosen at random from SMT-LIB (including logics A and B) by applying one of these solvers. (You'll have to take the solver's answer at face value.) Which solver would you rather use, X or Y? > I do think it is very important, and I do believe that an unsound > solver is worse than a solver that solves no benchmarks. My opposition > is to multiplying by the #wrong benchmarks, since this skews the > results unfairly. A punishment of -1*(weight of logic) is still very > severe, but is more appropriate for overall score, in my opinion. I tend to agree that simply multiplying by #wrong benchmarks is awkward: we already have a weight for each logic that takes the number of benchmarks in that logic into account. I suspect the original motivation was to punish solvers that get many benchmarks wrong more severely than solvers that get only a few benchmarks wrong. This is easy to achieve for the division rankings, but I am not sure whether this property is worth retaining for the competition-wide ranking. Arguably, any solver that is unsound in some division shouldn't be used to solve benchmarks from that division. Best, Tjark |
From: Andrew R. <and...@gm...> - 2017-03-30 17:21:53
|
Hi Tjark, On Mon, Mar 27, 2017 at 7:53 AM, Tjark Weber <tja...@it...> wrote: > Andrew, > > On Sun, 2017-03-26 at 18:03 -0500, Andrew Reynolds wrote: > > > On Fri, Mar 24, 2017 at 6:13 PM, Tjark Weber <tja...@it...> > > wrote: > > On the other hand, one might argue that an unsound solver > > should not be trusted in any division. [...] > > > > Here, I would argue that the implementation of SMT solvers tends to be > > fairly modular, meaning that if an SMT solver gets a wrong answer for > > logic A, then the bug is very likely in the component that > > specifically handles A, and the solver is likely still sound for > > everything besides A. [...] > > I agree with your point about modularity. But note that the solver's > (likely) soundness for everything besides A is only established in > hindsight, through the results of the competition. The issue, of > course, is that the solver was originally advertised as sound for > everything including A. > > Once the competition results are known, it is easy to turn a partially > unsound solver into a useful tool by restricting it to appropriate > logics. But it is hard to know in advance for which logics a solver is > (likely) sound if the developer's claims cannot be relied on. > > One point here is that the competition should encourage innovation. So if there is a difficult new logic (e.g. floating points or strings), and I'd like to enter, but I'm not 100% confident in my solver for that logic, it still would be nice for me to be able to enter without having to risk the score for the logics where I am sure my solver is correct. > > Let's consider another case: > > > Logic A, B with weights 1,1. > > Solver X gets 1 wrong answer in both A and B, it's total score is -1*1 > > + -1*1 = -2 > > Solver Y solves all benchmarks in A, but gets 4 wrong answers in B, > > its total score is 1 + -4*1 = -3 > > > Here, I would argue that Solver Y is the better solver: it is (likely) > > sound for logic A, whereas Solver X is definitely unsound for both A > > and B. > > Note that one could also argue the opposite point of view: suppose you > are asked to solve a benchmark that will be chosen at random from > SMT-LIB (including logics A and B) by applying one of these solvers. > (You'll have to take the solver's answer at face value.) Which solver > would you rather use, X or Y? > > The use case of randomly selecting a benchmark in any logic isn't very interesting, in my opinion. In practice, I should be able to select a solver based on the logic. So in this case I would be confident in solver Y for A, and if I had a problem in logic B, neither solver would be a good option. I wouldn't even necessarily say that Solver X is much better than Solver Y for logic B, despite the 4 vs 1 wrong answers, I wouldn't use either one. > > I do think it is very important, and I do believe that an unsound > > solver is worse than a solver that solves no benchmarks. My opposition > > is to multiplying by the #wrong benchmarks, since this skews the > > results unfairly. A punishment of -1*(weight of logic) is still very > > severe, but is more appropriate for overall score, in my opinion. > > I tend to agree that simply multiplying by #wrong benchmarks is > awkward: we already have a weight for each logic that takes the number > of benchmarks in that logic into account. > > I suspect the original motivation was to punish solvers that get many > benchmarks wrong more severely than solvers that get only a few > benchmarks wrong. This is easy to achieve for the division rankings, > but I am not sure whether this property is worth retaining for the > competition-wide ranking. Arguably, any solver that is unsound in some > division shouldn't be used to solve benchmarks from that division. > > Right, this is my point. Another option would be to multiply by something like -( #wrong / ( 1 + #wrong ))*(weight of logic), where you'd get a factor -1/2, -2/3, -3/4, ... as number of wrong benchmarks increase, but this is limited by -1. Let me know what you think. Cheers, Andy > Best, > Tjark > > > |
From: Tjark W. <tja...@it...> - 2017-03-31 13:08:47
|
Andrew, On Thu, 2017-03-30 at 12:21 -0500, Andrew Reynolds wrote: > One point here is that the competition should encourage innovation. > So > if there is a difficult new logic (e.g. floating points or strings), > and I'd like to enter, but I'm not 100% confident in my solver for > that > logic, it still would be nice for me to be able to enter without > having > to risk the score for the logics where I am sure my solver is correct. Of course, I see your point. If you are only concerned about one or two logics, there is (usually) the option to enter the solver into these logics as non-competing. But in general, I think it makes sense that solver developers need to commit to a set of logics in advance. > The use case of randomly selecting a benchmark in any logic isn't very > interesting, in my opinion. The random choice is merely intended to model some real-world application that generates problems in logics A and B. (This is based on the assumption that SMT-LIB benchmarks in some sense approximate real-world problems, which is certainly debatable.) > Another option would be to multiply by something like -( #wrong / ( 1 > + #wrong ))*(weight of logic), where you'd get a factor -1/2, -2/3, > -3/4, ... as number of wrong benchmarks increase, but this is limited > by -1. In principle, other functions to compute the factor could perhaps also be considered, as long as they are (i) monotonic and (ii) take values between 0 and 1 (or -1, as the case may be). For instance, the function to compute the factor for correct answers is (#correct/#total)^2. Your proposals seem more appropriate though: they give a greater (negative) factor as soon as there is at least one incorrect answer, in line with the view that a solver shouldn't be used for a logic for which it isn't sound. I wonder how the "value" of a correct answer compares to the "cost" of an incorrect answer in a typical application of SMT solvers. It is hard to make this precise, of course, but my hunch is that the latter is typically far greater. Your original motivation to encourage innovation notwithstanding, to me this would suggest another factor (some constant > 1) to place more emphasis on negative scores for incorrect answers. I would also like to caution that the competition-wide ranking is in a sense comparing apples and oranges. As this discussion shows, mapping the performance of solvers that support very different logics to a single number necessarily comes with some challenges. Anyone interested in specific logics should probably look at the competition's results for those logics instead. Best, Tjark |