From: Julian S. <js...@ac...> - 2003-04-04 21:06:41
|
On Friday 04 April 2003 8:17 pm, John Regehr wrote: > > -- you've no idea how much that helps. Reproducing problems that people > > report is the #1 problem we have in debugging V; once we reproduce a > > problem, tracking it down is simple. > > Is it just reproducing the problem that's hard, or do you mean > "reproducing in a reasonable sized program"? Reproducing it at all. Quite often we get reports of the form I have a 1/2 million line fortran program for doing geophysics calculations. Under some obscure circumstances, this causes V to bomb out with ... assertion failure. I am running on MutantLinux 12.34.567 (with foobar-1.9 patch) and the code is compiled by ExpensiveRealMoneyCompiler v 41.97. Our code is proprietary, so unfortunately we can't send you the source. Can you help us? and in these circumstances there's practically nothing we can do apart from note the bug and hope that someone finds a more tractable test case for it. Even if we could have the sources, setting up the precise environment to repro it is very time consuming, and we all have day jobs (etc). Interestingly, one solution to the above is for the bug reporter to make me an account on their machine and allow me to ssh in, so I can reproduce the bug in-place. This has proved very effective in the half-dozen or so times I've done it, and I appreciate the trust of those who allow it. I bet not many people can say they have used emacs at a distance of 12000 miles -- the most recent example of this, the bug was is New Zealand, and I'm in the UK. > If the latter, then there are techniques that might be able to help. > They basically perform a space-wise or time-wise binary search in order to > narrow down the problem, exploiting the fact that we have a known-correct > implementation of an x86. Yes, that's how V was debugged in the first place. I knew from the start that making the virtual CPU work properly would be a problem. So a fundamental design decision was that the program, when run on valgrind, had a memory layout which allows switching over to the real CPU at any point. By changing the switchover point, you can do a binary search to find the exact basic block which is being mistranslated. This is controlled by the --stop-after= flag. Without that, V would never have worked. Design for debuggability / verifiability, I say. Automated debugging is the way to go. J |