From: Nir P. <ni...@em...> - 2004-02-23 20:01:25
|
Hello, After having some troubles with fsck, I suspect there's a problem/bug in colinux kernel regarding floating point context across processes. Here's a very short description of the original problem: sometimes (quite a lot), fsck crashes with segmentation fault or floating point exception (sometimes just saying signal 11 or signal 8). After recompiling e2fsck from sources with debugging information, and adding a lot of traces, here's what I've discovered: In e2fsprogs package, in lib/ext2fs/icount.c, function get_icount_el(), the following calculation exists inside a while loop: range = ((float) (ino - lowval)) / (highval - lowval); followed later by: mid = low + ((int) (f * (high-low))); Now for the bug: after a lot of cycles (of the loop), "range" becomes NaN (not a number, error, not a floating point value, whatever...), causing "mid" to go wild (negative), and later causing an exception, becuase it is indexing an array. Note: I even saw that there are times "range" is bad and times it is good, when the parameters in the calculation are the SAME (which should give the same result...). POSSIBLE WORKAROUNDS: Well, anything to get around that floating point should work. The funny thing is that there is a "#if 0 ... #else ..." there. The code left out (in the "#if 0") is a simple "mid=(low+high)/2". The active code ("#else...") is with the floating point. A comment there says "Interpolate for efficiency". I tried the simple one (changing to "#if 1"), and it looks like it is working ok. THIS IS A WORKAROUND, and not a fix. The problem is not in e2fsck... but if you can't mount your image, a workaround is better than nothing. Note that with some of my traces, fsck worked fine too. Seems like if you surround the part of the "range" calculation with some extra code, the problem likeliness drops. But the above workaround is better (I think). The workaround enabled me to get my filesystem back, but it is not a solution. I'm affraid that once again this means diving into kernel debugging. To this point I never connected a debugger to the colinux kernel, so any tips could help. Oh, and by the way, here's some things I tried before realizing it is the floating point calculation... (in case you wondered): - more memory: usually I run with 32MB. Trying 64MB didn't seem to help at all. - defrag windows disk: I though maybe the cobd driver fails if the image file is very fragmentad on the windows disk (it was...). Obviously, it wasn't the case - the cobd driver works great. - using dan's patch (posted here about a week ago as a reply on "Invalid argument zeroing block") - I'm already using it. - trying the original 0.5.3 release with the 1gb debian image - still fails if I try to "fsck" itself (after "killall5; umount -ar", of course). Nir |