Re: [Valgrind-users] Possible bug in valgrind-1.9.4

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Friday 04 April 2003 8:17 pm, John Regehr wrote:
> > -- you've no idea how much that helps.  Reproducing problems that people
> > report is the #1 problem we have in debugging V; once we reproduce a
> > problem, tracking it down is simple.
>
> Is it just reproducing the problem that's hard, or do you mean
> "reproducing in a reasonable sized program"?

Reproducing it at all.  Quite often we get reports of the form

  I have a 1/2 million line fortran program for doing geophysics
  calculations.  Under some obscure circumstances, this causes
  V to bomb out with ... assertion failure.  I am running on 
  MutantLinux 12.34.567 (with foobar-1.9 patch) and the code is
  compiled by ExpensiveRealMoneyCompiler v 41.97.  Our code is
  proprietary, so unfortunately we can't send you the source.

  Can you help us?

and in these circumstances there's practically nothing we can do apart
from note the bug and hope that someone finds a more tractable test
case for it.  Even if we could have the sources, setting up the precise
environment to repro it is very time consuming, and we all have 
day jobs (etc).

Interestingly, one solution to the above is for the bug reporter to
make me an account on their machine and allow me to ssh in, so I can
reproduce the bug in-place.  This has proved very effective in the
half-dozen or so times I've done it, and I appreciate the trust of
those who allow it.  I bet not many people can say they have used 
emacs at a distance of 12000 miles -- the most recent example of this,
the bug was is New Zealand, and I'm in the UK.

> If the latter, then there are techniques that might be able to help.
> They basically perform a space-wise or time-wise binary search in order to
> narrow down the problem, exploiting the fact that we have a known-correct
> implementation of an x86.

Yes, that's how V was debugged in the first place.  I knew from the 
start that making the virtual CPU work properly would be a problem.
So a fundamental design decision was that the program, when run on
valgrind, had a memory layout which allows switching over to the
real CPU at any point.  By changing the switchover point, you can 
do a binary search to find the exact basic block which is being
mistranslated.  This is controlled by the --stop-after= flag.
Without that, V would never have worked.

Design for debuggability / verifiability, I say.  Automated debugging
is the way to go.

J