We have now seen a very strange problem at Intel with a corrupt cache entry twice (once In February 2007 and once in April 2007). In both cases the symptom was this assertion failure:
PrefixTbl.C:167: FV2::T* PrefixTbl::Get(int) const: Assertion `0 <= index && index < this->numArcs
In both cases PrefixTbl::Get was called from PickleC::UnpickleDepPath. From the debugger, it was immediately apparent that index passed to PrefixTbl::Get was invalid. After digging through the bytes of the pickled value with a hex editor, it became clear that there was a string that had been written into the pickle which was longer than the preceding length written into the pickle. This caused the unpickling code to go wrong by interpreting bytes that were actually part of the string in other ways.
In both cases, the string would have been written by the function PickleC::PickleText. In one case the string was the name in a binding element. In the other it was the embedded name in a text value (the "name" member variable of the TextVC class).
In one case the length written was 10 when 11 bytes of string were written. In the other case the length written was also 10 and 15 bytes of string were written.
Unfortunately, I'm having a really difficult time seeing any possibility for the code involved to go wrong. The only theories I have at this point require something highly improbable to happen (like the "len" local variable in PickleC::PickleText spontaneously changing in the middle of the function). Let's enumerate some of the functions other than PickleC::PickleText that could theoretically be at fault:
- PickleC::PickleInt16 is used to write the length into the pickle byte stream after converting it to network byte order
- FS::Write (from basics/os) is used to actually write the bytes into the stream
- Basics::BufStream (from basics/basics) is the class used to accumulate the pickled bytes
There don't really seem to be any more likely candidates in those pieces of code either. If the pickling/unpickling code were more modular and re-usable (see the RFE "Unify evaluator pickle reading code"), I suppose we could write a random exerciser for it, but I doubt that approach would get us any answers.
Another possibility I suppose we have to consider is that this could have been caused by an evaluator running a flaky machine (e.g. with bad physcial memory). It seems a bit too convenient that it the evaluator managed to just write one incorrect byte into data it stored in the cache rather than crashing, but I can't completely rule it out.
If we can't find a fixable bug at fault for this, we should make the evaluator able to tolerate more cases of invalid pickled values. Rather than crashing with an assertion failure, we could detect such a corrupt pickle and treat it like a cache miss. (This is already done in some cases.) While it won't be possible to detect all such cases, we can certainly detect more of them than we do now. Any like this one that are detected with an assertion failure could be converted to a more graceful degradation that doesn't stop users in their tracks. While I'm not happy with automatically "sweeping bugs under the rug", the reality is that such a corrupt cache entry has now twice brought build work to a grinding halt until the corrupt entry could be located and removed from the cache.
Logged In: YES
user_id=304837
Originator: YES
I've been investigating another pickle corruption that seems
to be related. The symptom was a different assertion this
time:
Val.C:2543: ValC* LookupPath(DepPath*, const Context&): Assertion `result->vKind == ModelVK' failed.
That assertion indicates that an E type dependency
(expression, i.e. function body) was specified for a value
that was neither a user-defined function, nor a model. In
this case the type was actually UnbndVK which is a special
value type used internally by the evaluator to represent
that a binding lookup found no match for the specified name.
In searching through the rest of the pickle, the only other
dependency paths I find that reference the same index do so
with the '!' (existence test) dependency type, which makes
sense since it doesn't exist.
At first I had thought that perhaps some earlier corruption
in the pickle had led to mis-interpreting bytes as a
dependency path. However, after careful dissection of the
pickled value the surrounding structure seems to be fine. I
have come to the conclusion that the dependency path stored
in the pickle had its index into the PrefixTbl off by 6
(specifically 969 was stored when it should have been 975).
(Dependency paths are not stored in-line as text in the
pickle, but rather as 16-bit indexes into a PrefixTbl.) I
feel reasonably certain that the path index was off by 6
because there were several other sub-values with identical
dependency sets that were the same as this one except for
this one index. Also, based on examination of some of the
SDL code involved I'm quite certain that other dependencies
in this set should not be present without a dependency that
has a path index that is 6 higher than the one in the
pickle.
The reason I believe this is related to the two earlier
failures is that the same function (PickleC::PickleInt16)
was used to store the integers which seem to be incorrect in
each of the three corrupted pickle cases that I have
examined. (This function was used to store the length of
the text string in the two previous cases I looked at.)
Interestingly, the corrupted pickle in this latest failure
also had a text string with an incorrect length later in the
pickle. Similar to an earlier case, the correct length of
the string was 15 but the length stored in the pickle was
10. (It's possible there are some other problems in it, but
that seems to be the only structural one.)
I'm still at a loss for a fixable bug that could cause this.
I've continued to consider a variety of possible
explanations, but most of them seem pretty unlikely:
- If there was a bug in the Basics::BufStream class or the
FS::Write function, we should see corruption in many places
throughout pickled evaluator values. (Both of these are
also used other places where we haven't seen such failures.)
- If there was a bug in the network code used to send the
pickled bytes to the server, we would again expect to see
corruption throughout pickled evaluator values (as well as
failures in a variety of other places that use the same
code).
- If there was something wrong with the functions for
converting 16-bit integers to network byte order, we would
see failures in many places.
The best idea I can come up with for how to avoid this is to
have the pickling code read back the bytes it pickled,
convert them back to a host-order integer, and assert that
they are equal to the integer being stored. If the source
of the problem occurs during the pickling on the client end
(which I strongly suspect), then this could catch it before
a bad pickle makes it into the cache.
It would still be a good idea to be able to detect a bad
pickle and continue with a build treating it as a cache
miss, but at this point I don't think we know how much work
that could be. Though we've identified two cases of an
incorrectly encoded 16-bit integer, I don't we know all the
possible failure modes they could result in. Discovering
them would require a comprehensive code review.
Logged In: YES
user_id=304837
Originator: YES
I've checked in a very simple change which might catch the source of the incorrect pickles, depending on where the problem occurs:
/vesta/vestasys.org/vesta/eval/100
Another thing we may want to consider is modifying the cache server code to remember more about where each cache entry comes from (e.g. which client host added the entry and what time the entry was added at).
Logged In: YES
user_id=304837
Originator: YES
I've had several additional opportunities to study this
problem recently. One that was particularly illuminating
involved two bad cache entries for the normal and special
entries for the same model. These were almost certainly
created by the same evaluator process, as normally the
special entry gets added right after the normal entry.
The key facts I have discovered recently are:
1. Contrary to what I had previously said, the incorrect
bytes are *not* limited to the 16-bit integers encoded in
the pickle bytes. Errors at those points are probably just
more noticeable than errors in the middle of a string,
shortid, or fingerprint which are essentially opaque data to
the pickling code.
2. In the cases which I've been looking at recently (which
all occur on one server), the corruption is always just one
byte that is incorrect and it is always at a byte offset
within the pickle bytes that ends in 0x974 when printed in
hex. I have searched for an explanation (i.e. a bug that
could cause corruption) of that offset in every place I can
think of, and so far I can't explain it. None of the code
that the pickled bytes pass through (Basics::BufStream,
SRPC, FS::FdStream) seem capable of doing something at this
particular byte offset in every case. It doesn't even seem
to make sense as an offset in the on-disk MPKFile (i.e. with
filesystem corruption on the server as the explanation).
3. The single similar incident we observed on a different
server back in February did *not* share this common factor
of having the incorrect byte at an offset ending in 0x974.
This seems to lend credence to my theory that there is a
client with some sort of hardware fault that is causing
this. While the code I added in eval/100 might catch such a
problem, since I now have direct evidence that it occurs in
more places than just the encoded integers that definitely
isn't a sufficient check. Also, we don't know whether the
corruption occurs right away as bytes are accumulated or at
some point later when they are copied into a larger memory
buffer.
This is a problem that we should attack in a variety of
ways, so here's a rough to-do list:
- Each value stored in the cache has an extra fingerprint
(the "fp" member variable of VestaVal::T) that's not
currently used for anything useful. The evaluator stores
something in it, but it never checks it. We could change to
using this as a fingerprint of the pickled bytes as a data
integrity check. If we fingerprint as we build up the
pickled bytes, it would make it possible for an evaluator to
detect some cases of silent data corruption before adding an
entry to the cache. (The key would be to make sure that
we're not simply fingerprinting the buffer we've already
written into, as that may be where the corruption occurs.)
It would also allow an evaluator that gets a cache hit to
perform a data integrity check on the pickled bytes before
trying to unpickle them. We could even have the cache
daemon perform the data integrity check itself, however we
should consider how that might impact future changes we may
want to make such as compressing the pickled bytes.
- Identifying the client host that added a cache entry is
not currently possible. In cases like this where there's a
possibility of a client with a hardware fault it would be
very helpful to have that information. For that reason, I
think the cache server should begin storing an additional
annotation with each cache entry: the network address of the
client which added the entry. As long as we're doing that,
we should probably also store a timestamp that identifies
when the entry was added as we've wanted that a few times in
the past. (It's worth noting that both of those can be done
without any changes to the cache network protocol.) We
should probably also put the client host in the graph long
entries, including the graph log roots.
- As suggested earlier, we should also make changes to allow
the evaluator to detect and proceed past cache entries which
have the kind of problems we've encountered with this sort
of corrupted pickle. Using the fingerprint to check the
integrity of the pickled bytes is a good first step, but we
really should convert at least the two assertions that have
been the major symptoms of this problem into more graceful
failures.
- If the evaluator detects an entry that seems to be bad, it
would be nice if it could tell the cache server about it and
keep other clients from getting cache hits on it. We could
do this by adding a new procedure which simply marks a CI to
be unavailable for future hits until the entry is deleted
(similar to the way the current "hit filter" works during
the weeding process). It would also be useful for
administrators to have a utility which could do this, as we
occasionally have problems with tools that are less
deterministic than we would like and create cache entries
with bad results in them. However, this sort of capability
is a little problematic if there are clients with hardware
problems (and we have to assume that there may be). How can
we differentiate between a fault in the client that added a
cache entry and the client that reports the entry as bad?
One solution would be to wait until there are reports from
multiple different client hosts of the same entry being bad.
Two clients might be a good default, though we could make it
configurable. (Of course two wouldn't be a useful value for
single-machine installations.)