Say foo() calls bar(), but foo()'s result does not depend
upon bar()'s result. There is no dependence for foo() on
the function body of bar(). So, if bar() is changed, foo()
can still cache hit. However, a fatal bug may be
introduced into bar(). Foo()'s cache hit would hide this
fatal bug. So a developer could introduce a bug, but still
build successfully. The bug would only surface once
the cache is flushed (likely after the developer released
a seemingly functional build).
Fundamentally, the issue is that the fatality of a function
is not considered to be part of its result. And the cache
does not record dependencies upon a function's fatality.
Ken has prepared a reproducer, included inline below.
It might at first appear sufficient to record dependencies
upon all function bodies that are called, regardless of
any dependency upon their results, but, as Ken pointed
out, this is not sufficient. The secondary inputs of foo()
can affect the code flow, and therefore, whether bar() is
called. A cache entry can be created for a foo() call
which does not call bar(), and has no knowledge that bar
() has any potential to be called. There is no easy way
to record the dependence on the secondary input which
affected the code flow.
One solution might be to defer the fatality of a function.
Return a "fatal" indication as the function result, and
propagate this through the evaluation. Only if the build
result depends on the failing function's return value
would the build fail. Unfortunately, this approach is
nearly impossible to implement. The evaluator's
determination of what the build depends on is
conservative, meaning it might record more
dependencies than are necessary or a courser-grained
dependency than necessary. This would leave open the
possibility that a call of foo() which does call bar()
is "conservatively" detected as being dependent on bar
(), and therefore can be fatal based on bar(), when it is
not truly dependent. In this case, a cache entry could
exist for a foo() call which does not call bar() and hits for
this falsely-dependent-and-therefore-fatal build. So, the
evaluator's dependence analysis cannot be used to
determine the fatality of a bug. (At least that's the
conclusion I got to.)
It would seem the only feasible solution is to remove
fatality from the build process altogether. This means
not just failing _run_tool() invocations, but any SDL
errors for a function as well. This would not exactly be
a backward compatible change, but it would only
change the behavior of previously-failing builds. This is
probably not a real problem. SDL could check the
return values for failure (ERR?), but special-casing
failure is not required. Even if not explicitly checked,
the error would be reported to stderr. If the SDL is not
written to be resilient to errors, this error would likely
lead to many others downstream. As with many
compilers/interpreters, only the first bug is guaranteed
to be meaningful, but it would be a potential benefit to
see multiple bugs from a single evaluation. (Or the
evaluator could hide these.) The additional stderr output
would be clutter, but existing error reports have a clutter
issue already. It would be good for the evaluator to flag
the error and disable subsequent generation of cache
entries. This would enable regeneration of the error
output. Also, a flag to make errors fatal would be fine
too -- wait, this gets us right back where we started --
yes, but now there is a non-fatal mode of operation that
is guaranteed to produce a reproducible result.
So, maybe all we need is to add a "-non-fatal" flag that
will, on previously-fatal-error, disable caching, and return
out of the enclosing function with an ERR return value.
Builds without this flag would have the reproducibility
issue, but with this flag would not. And no current
behavior would be affected. A project must consider a
build "good", independent of the build's fatality without -
non-fatal, as it is not possible to prevent these "hidden"
errors from sneaking in. Any rebuilds of successful
builds, as well as integration builds (which merge
successful builds), should use -non-fatal. General
building for development might or might not use this
flag, but code should not be released that has visible
errors. Looking at errors from things like integration
builds is still useful. Unfortunately, it would be difficult
to distinguish hidden errors from those released by
someone who did not follow proper procedures, so there
could still be some misdirected blame. And this very
complex issue is exposed to the user through the -non-
fatal flag. But it seems to be our best option.
Note: The -k option to "vesta" controls fatality of
_run_tool() calls.
Ken's reproducer:
{
/**nocache**/
good()
{
_=_print("good function called");
return 1;
};
call_but_dont_use(b)
{
// Note: the function has to be passed in a complex
data structure
// like a binding so it doesn't become part of the PK of
this
// function.
_=b/f();
// Return something which doesn't depend on the
function so we
// don't record any dependency on it.
return 1;
};
return <call_but_dont_use([f=good]), // This
populates the cache.
call_but_dont_use([f="This isn't even a function!"])>;
// This would fail, but it is instead a cache hit!!!
}
Logged In: YES
user_id=95236
This issue is not a surprise to me, and I find it hard to
get excited about it. Well, I admit it's a reproducibility
issue: there's a chance that you did a build once that
should have failed but didn't, and later when you try to
recreate it, it does fail.
This is bad, but as a consolation, it should be easy to fix
the failing build -- the error should appear when evaluating
the bad function, and since the build result does not depend
on that function in any way, the fix is simply to delete the
call to it. I guess you can create scenarios in which the
call is hard to remove because of some elaborate code
structure, and you also can't delete the function body
because it's called elsewhere. Sigh.
Another small consolation is that it shouldn't happen much
since it should be fairly uncommon to call a function and
discard its result. Shouldn't it?
I didn't find the idea for making it possible for failures
to return a noncacheable ERR and continue to be very
convincing. This is actually an idea we had in the initial
Vesta SDL design (ERR might not even exist if we'd never had
the idea), but it didn't work out well and we discarded it.
I suppose it's possible we just didn't understand how to
make it work, but it still seems kind of questionable. One
specific problem is that if you hit an early failure but
there is still a ton more work that could be done on the
evaluation, you'll do all that work instead of stopping and
reporting the failure back to the user early. I think the
latter is usually what people would prefer.
Logged In: YES
user_id=304837
Steve Hoover wrote:
> One solution might be to defer the fatality of a function.
> Return a "fatal" indication as the function result, and
> propagate this through the evaluation. Only if the build
> result depends on the failing function's return value
> would the build fail. Unfortunately, this approach is
> nearly impossible to implement.
I don't think this is "nearly impossible to implement",
though it has some potential problems.
Once upon a time, run-time errors during SDL evaluation were
not immediately fatal. Instead, an operation with an error
resulted in the special error value ERR. It would not be
hard to go back to that way of doing things, as making such
problems fatal was done by making the constructor for the
ErrorVC class throw an exception. (See line 1250 in
/vesta/vestasys.org/vesta/eval/72/src/Val.C.)
I'm not sure I recall the exact motivations for this change
(it was made more than 4 years ago, before Vesta was
released as free software and before I became the primary
maintainer of it), but I think they were:
- Failing immediately rather than propagating ERR values up
the call chain makes it easier to debug errors, because you
can see precisely where the error occurs. (When the ERR
value was instead generated, it might propagate up through
several levels of the SDL call stack losing the exact point
where the error occurred.)
- Bridges would sometimes handle or generate ERR values,
which added a bunch of additional complexity to them. (For
an example of this see line 21 of
/vesta/vestasys.org/bridges/mtex/1/build.ves.)
- Tool failures were already immediately fatal (which also
simplified bridges and debugging), and it seemed confusing
to have two different policies on how failures would affect
the continuation of an evaluation
I don't think it would be too hard to deal with these issues
and still allow execution to proceed past errors. Aside
from the idea Steve (the submitter) proposed of having a
single fatal/non-fatal switch for error handling (and we
could use the existing "-k" flag for that), here are some
other ideas:
- We could have the error value record and carry with it a
stack trace of where it cam from. This would make it
possible to determine the cause of an error even if it makes
its way all the way up to the final result of a build. To
really make this work we would have to have all the
operators and primitive functions handle ERR as an input
value and add more information to this stored error context,
rather than simply recording a new error context and
discarding the old one.
- In the "non-fatal" case we could have a failed _run_tool
return ERR rather than its normal result.
Of course much of this might become moot is the evaluator
had a built-in debugger:
https://sourceforge.net/tracker/index.php?func=detail&aid=1223244&group_id=34164&atid=410430
This whole problem seems to me to represent a fundamental
tension between the desired behavior while a user is making
changes and when they wish to repeat an existing build.
While actively working on changes, errors should be reported
immediately and precisely and impede forward progress. When
repeating a previous build, errors should only be an issue
if the result actually depends upon them.
Tim Mann wrote:
> This is bad, but as a consolation, it should be easy to
> fix the failing build -- the error should appear when
> evaluating the bad function, and since the build result
> does not depend on that function in any way, the fix is
> simply to delete the call to it.
While you're right that it shouldn't be too hard for someone
knowledgeable in SDL to fix any individual instance of this,
I don't think we can just brush this off for several
reasons:
1. In my our experience, the vast majority of users aren't
very knowledgeable about SDL. This means that most users
would be confounded by running into such a mysterious
failure. I think it may also make it more likely for this
sort of problem to occur in the first place: a novice SDL
writer is more likely to write code which calls a function
but doesn't use its result.
2. Suppose that we did just accept that any such broken but
previously successful builds would have to be patched up
when a user revisited it. It seems likely to me that for
some builds this could happen repeatedly, which could really
waste the user's time. How would a user find a previously
patched-up version of the build they're interested in?
3. This could go unnoticed for a long time. The longer the
cache lasts (and these days a catastrophic loss of the cache
is much less common than it used to be), the more such
broken but previously successful builds could be created.
It could even take place at multiple sites simultaneously,
if they all first evaluated the necessary successful build
to cause a cache hit in the right place.
4. Records referencing specific builds (such as issue
tracking databases) could be left with pointers to broken
but previously successful builds. In general it would be
impractical to find and fix such references.
I can imagine some paranoid users wanting to qualify builds
by using an empty cache or after discovering such a problem
wanting to alter the immutable SDL file which holds the
broken function to fix up everything at once.
I have to agree with Steve that it's worth doing something
about this.
Tim Mann wrote:
> Another small consolation is that it shouldn't happen much
> since it should be fairly uncommon to call a function and
> discard its result. Shouldn't it?
How would you know how often it happens? The only builds
you can be sure don't have this problem are ones built
against an empty cache. That doesn't happen very often in
practice.
Logged In: YES
user_id=95236
Thinking about this some more, I was way too dismissive
about it. Sorry. I'm sure we did think about this problem
early on, and it's why ERR got into the language in the
first place. We had some problems working the idea out
correctly and eventually dropped it, but I'm thinking now
that that was a mistake.
It would make sense, I think, to expand the -k flag so that
"fatal" errors return a noncacheable ERR instead. Maybe
that's what we should have done in the first place, instead
of switching from always returning ERR on these errors to
always failing immediately.
There were a few points that we (or at least I) found
confusing when we had the feature of errors returning ERR
and continuing.
One basic one is that we made ERR a constant in the language
that you could write explicitly. This created a lot of
confusion between ERR caused by an error and ERR written by
a programmer intentionally as an out of band value like NULL.
It was especially confusing since you could write things
like "foo == ERR". If foo is an ERR value propagated from
an error, should foo == ERR return TRUE or continue to
propagate the ERR? Similarly, should _isbool(foo) return
FALSE or ERR?
It would be nice to get rid of the ERR constant to avoid
this confusion, but it's been in the language for years and
so there could be lots of code that uses it. One idea is to
say that the literal ERR is simply a way to intentionally
generate a fatal error. If you evaluate an ERR in the "stop
on errors" mode (no -k flag), the evaluation would stop
right there. This is an attractive notion but may break
existing code. (I wonder if we ever had a version of the
system that did this? I'm not sure.) Anyway, in the
following I'm going to ignore non fatal ERR written by the
programmer; I'll use ERR only to mean a real error that
would be fatal in the current evaluator.
In general, specifying when an ERR input to a function
should propagate and when it should be discarded is a tricky
issue. In order to fix the problem you're concerned with, I
think ERR must be discarded whenever the result would have
no dependency on the ERR and kept whenever it would. It
seems like this makes the definition
implementation-dependent -- it depends on how fine-grained
and smart the dependency analysis is. I'm not sure, though;
maybe there's a clear way to specify it. Maybe what we had
in the language spec before we made errors always fatal
actually was right, or close, despite being confusingly worded.
Hmm, I see that wording is still in SRC-1997-005c. I guess
we didn't take it out until the book draft became the
maintained version.
Note that it has to be possible to carry a fatal ERR around
in a binding, since you can have an SDL program that
evaluates several things, puts them into a binding, then
selects one of them and ignores the rest, so that the result
has no dependency on those. Is the same true for lists?
I think when an ERR is discarded the result can be made
cacheable again. In fact, you need this so that evauations
of the sort you're concerned about will get cached when you
reproduce them. Knowing when ERR is fully gone is not
entirely trivial, though, since an ERR can be nested deep
inside a binding. A function has to have a result with no
ERRs in it to be cacheable.
Note that we don't have to worry separately about possibly
caching a value with a dependency on ERR, if the principle I
gave above is correct -- any value that depends on an ERR
has to contain an ERR.
Just some thoughts.