Hello,
After having some troubles with fsck, I suspect there's a problem/bug in
colinux kernel regarding floating point context across processes.
Here's a very short description of the original problem: sometimes (quite a
lot), fsck crashes with segmentation fault or floating point exception
(sometimes just saying signal 11 or signal 8).
After recompiling e2fsck from sources with debugging information, and adding
a lot of traces, here's what I've discovered:
In e2fsprogs package, in lib/ext2fs/icount.c, function get_icount_el(), the
following calculation exists inside a while loop:
range = ((float) (ino - lowval)) / (highval - lowval);
followed later by:
mid = low + ((int) (f * (high-low)));
Now for the bug: after a lot of cycles (of the loop), "range" becomes NaN
(not a number, error, not a floating point value, whatever...), causing
"mid" to go wild (negative), and later causing an exception, becuase it is
indexing an array. Note: I even saw that there are times "range" is bad and
times it is good, when the parameters in the calculation are the SAME (which
should give the same result...).
POSSIBLE WORKAROUNDS:
Well, anything to get around that floating point should work. The funny
thing is that there is a "#if 0 ... #else ..." there. The code left out (in
the "#if 0") is a simple "mid=(low+high)/2". The active code ("#else...") is
with the floating point. A comment there says "Interpolate for efficiency".
I tried the simple one (changing to "#if 1"), and it looks like it is
working ok. THIS IS A WORKAROUND, and not a fix. The problem is not in
e2fsck... but if you can't mount your image, a workaround is better than
nothing.
Note that with some of my traces, fsck worked fine too. Seems like if you
surround the part of the "range" calculation with some extra code, the
problem likeliness drops. But the above workaround is better (I think).
The workaround enabled me to get my filesystem back, but it is not a
solution. I'm affraid that once again this means diving into kernel
debugging. To this point I never connected a debugger to the colinux kernel,
so any tips could help.
Oh, and by the way, here's some things I tried before realizing it is the
floating point calculation... (in case you wondered):
- more memory: usually I run with 32MB. Trying 64MB didn't seem to help at
all.
- defrag windows disk: I though maybe the cobd driver fails if the image
file is very fragmentad on the windows disk (it was...). Obviously, it
wasn't the case - the cobd driver works great.
- using dan's patch (posted here about a week ago as a reply on "Invalid
argument zeroing block") - I'm already using it.
- trying the original 0.5.3 release with the 1gb debian image - still fails
if I try to "fsck" itself (after "killall5; umount -ar", of course).
Nir
|