Vesta Configuration Management System / Bugs / #107 Reproducibility violation with fatal unused code

Tim Mann - 2005-10-13

Logged In: YES
user_id=95236

This issue is not a surprise to me, and I find it hard to
get excited about it. Well, I admit it's a reproducibility
issue: there's a chance that you did a build once that
should have failed but didn't, and later when you try to
recreate it, it does fail.

This is bad, but as a consolation, it should be easy to fix
the failing build -- the error should appear when evaluating
the bad function, and since the build result does not depend
on that function in any way, the fix is simply to delete the
call to it. I guess you can create scenarios in which the
call is hard to remove because of some elaborate code
structure, and you also can't delete the function body
because it's called elsewhere. Sigh.

Another small consolation is that it shouldn't happen much
since it should be fairly uncommon to call a function and
discard its result. Shouldn't it?

I didn't find the idea for making it possible for failures
to return a noncacheable ERR and continue to be very
convincing. This is actually an idea we had in the initial
Vesta SDL design (ERR might not even exist if we'd never had
the idea), but it didn't work out well and we discarded it.
I suppose it's possible we just didn't understand how to
make it work, but it still seems kind of questionable. One
specific problem is that if you hit an early failure but
there is still a ton more work that could be done on the
evaluation, you'll do all that work instead of stopping and
reporting the failure back to the user early. I think the
latter is usually what people would prefer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kenneth C. Schalk - 2005-11-13

Logged In: YES
user_id=304837

Steve Hoover wrote:
> One solution might be to defer the fatality of a function.
> Return a "fatal" indication as the function result, and
> propagate this through the evaluation. Only if the build
> result depends on the failing function's return value
> would the build fail. Unfortunately, this approach is
> nearly impossible to implement.

I don't think this is "nearly impossible to implement",
though it has some potential problems.

Once upon a time, run-time errors during SDL evaluation were
not immediately fatal. Instead, an operation with an error
resulted in the special error value ERR. It would not be
hard to go back to that way of doing things, as making such
problems fatal was done by making the constructor for the
ErrorVC class throw an exception. (See line 1250 in
/vesta/vestasys.org/vesta/eval/72/src/Val.C.)

I'm not sure I recall the exact motivations for this change
(it was made more than 4 years ago, before Vesta was
released as free software and before I became the primary
maintainer of it), but I think they were:

- Failing immediately rather than propagating ERR values up
the call chain makes it easier to debug errors, because you
can see precisely where the error occurs. (When the ERR
value was instead generated, it might propagate up through
several levels of the SDL call stack losing the exact point
where the error occurred.)

- Bridges would sometimes handle or generate ERR values,
which added a bunch of additional complexity to them. (For
an example of this see line 21 of
/vesta/vestasys.org/bridges/mtex/1/build.ves.)

- Tool failures were already immediately fatal (which also
simplified bridges and debugging), and it seemed confusing
to have two different policies on how failures would affect
the continuation of an evaluation

I don't think it would be too hard to deal with these issues
and still allow execution to proceed past errors. Aside
from the idea Steve (the submitter) proposed of having a
single fatal/non-fatal switch for error handling (and we
could use the existing "-k" flag for that), here are some
other ideas:

- We could have the error value record and carry with it a
stack trace of where it cam from. This would make it
possible to determine the cause of an error even if it makes
its way all the way up to the final result of a build. To
really make this work we would have to have all the
operators and primitive functions handle ERR as an input
value and add more information to this stored error context,
rather than simply recording a new error context and
discarding the old one.

- In the "non-fatal" case we could have a failed _run_tool
return ERR rather than its normal result.

Of course much of this might become moot is the evaluator
had a built-in debugger:

https://sourceforge.net/tracker/index.php?func=detail&aid=1223244&group_id=34164&atid=410430

This whole problem seems to me to represent a fundamental
tension between the desired behavior while a user is making
changes and when they wish to repeat an existing build.
While actively working on changes, errors should be reported
immediately and precisely and impede forward progress. When
repeating a previous build, errors should only be an issue
if the result actually depends upon them.

Tim Mann wrote:
> This is bad, but as a consolation, it should be easy to
> fix the failing build -- the error should appear when
> evaluating the bad function, and since the build result
> does not depend on that function in any way, the fix is
> simply to delete the call to it.

While you're right that it shouldn't be too hard for someone
knowledgeable in SDL to fix any individual instance of this,
I don't think we can just brush this off for several
reasons:

1. In my our experience, the vast majority of users aren't
very knowledgeable about SDL. This means that most users
would be confounded by running into such a mysterious
failure. I think it may also make it more likely for this
sort of problem to occur in the first place: a novice SDL
writer is more likely to write code which calls a function
but doesn't use its result.

2. Suppose that we did just accept that any such broken but
previously successful builds would have to be patched up
when a user revisited it. It seems likely to me that for
some builds this could happen repeatedly, which could really
waste the user's time. How would a user find a previously
patched-up version of the build they're interested in?

3. This could go unnoticed for a long time. The longer the
cache lasts (and these days a catastrophic loss of the cache
is much less common than it used to be), the more such
broken but previously successful builds could be created.
It could even take place at multiple sites simultaneously,
if they all first evaluated the necessary successful build
to cause a cache hit in the right place.

4. Records referencing specific builds (such as issue
tracking databases) could be left with pointers to broken
but previously successful builds. In general it would be
impractical to find and fix such references.

I can imagine some paranoid users wanting to qualify builds
by using an empty cache or after discovering such a problem
wanting to alter the immutable SDL file which holds the
broken function to fix up everything at once.

I have to agree with Steve that it's worth doing something
about this.

Tim Mann wrote:
> Another small consolation is that it shouldn't happen much
> since it should be fairly uncommon to call a function and
> discard its result. Shouldn't it?

How would you know how often it happens? The only builds
you can be sure don't have this problem are ones built
against an empty cache. That doesn't happen very often in
practice.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Mann - 2005-11-16

Logged In: YES
user_id=95236

Thinking about this some more, I was way too dismissive
about it. Sorry. I'm sure we did think about this problem
early on, and it's why ERR got into the language in the
first place. We had some problems working the idea out
correctly and eventually dropped it, but I'm thinking now
that that was a mistake.

It would make sense, I think, to expand the -k flag so that
"fatal" errors return a noncacheable ERR instead. Maybe
that's what we should have done in the first place, instead
of switching from always returning ERR on these errors to
always failing immediately.

There were a few points that we (or at least I) found
confusing when we had the feature of errors returning ERR
and continuing.

One basic one is that we made ERR a constant in the language
that you could write explicitly. This created a lot of
confusion between ERR caused by an error and ERR written by
a programmer intentionally as an out of band value like NULL.

It was especially confusing since you could write things
like "foo == ERR". If foo is an ERR value propagated from
an error, should foo == ERR return TRUE or continue to
propagate the ERR? Similarly, should _isbool(foo) return
FALSE or ERR?

It would be nice to get rid of the ERR constant to avoid
this confusion, but it's been in the language for years and
so there could be lots of code that uses it. One idea is to
say that the literal ERR is simply a way to intentionally
generate a fatal error. If you evaluate an ERR in the "stop
on errors" mode (no -k flag), the evaluation would stop
right there. This is an attractive notion but may break
existing code. (I wonder if we ever had a version of the
system that did this? I'm not sure.) Anyway, in the
following I'm going to ignore non fatal ERR written by the
programmer; I'll use ERR only to mean a real error that
would be fatal in the current evaluator.

In general, specifying when an ERR input to a function
should propagate and when it should be discarded is a tricky
issue. In order to fix the problem you're concerned with, I
think ERR must be discarded whenever the result would have
no dependency on the ERR and kept whenever it would. It
seems like this makes the definition
implementation-dependent -- it depends on how fine-grained
and smart the dependency analysis is. I'm not sure, though;
maybe there's a clear way to specify it. Maybe what we had
in the language spec before we made errors always fatal
actually was right, or close, despite being confusingly worded.

Hmm, I see that wording is still in SRC-1997-005c. I guess
we didn't take it out until the book draft became the
maintained version.

Note that it has to be possible to carry a fatal ERR around
in a binding, since you can have an SDL program that
evaluates several things, puts them into a binding, then
selects one of them and ignores the rest, so that the result
has no dependency on those. Is the same true for lists?

I think when an ERR is discarded the result can be made
cacheable again. In fact, you need this so that evauations
of the sort you're concerned about will get cached when you
reproduce them. Knowing when ERR is fully gone is not
entirely trivial, though, since an ERR can be nested deep
inside a binding. A function has to have a result with no
ERRs in it to be cacheable.

Note that we don't have to worry separately about possibly
caching a value with a dependency on ERR, if the principle I
gave above is correct -- any value that depends on an ERR
has to contain an ERR.

Just some thoughts.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reproducibility violation with fatal unused code

Group

Searches

Help

#107 Reproducibility violation with fatal unused code

Discussion