Hi, I'm not sure if this is a bug or feature: :-)
I formatted a new partition with e2fsprogs-1.29, then
proceeded to run a long compilation job on it
(rebuiling a Linux box from source). At the end of the
compilation, the partition had to be remounted
read-only (probably due to a file being deleted while
in use?), so wasn't cleanly unmounted. With e2fsprogs
<= 1.28, an fsck would run on the next reboot, and fix
things OK. With e2fsprogs-1.29, I'm getting a message like:
"Inode XXX has INDEX_FL set but is not a directory".
(that's from e2fsck/problem.c:678)
e2fsck then exits with an error forcing me to manually
run e2fsck to repair the filesystem (it then gives me a
LONG list of inodes, asking me if I want to clear the
Htree info on each).
I believe I can reproduce this - if it is indeed a bug,
what other information would you need? (sorry I don't
have a minimal test case yet - this occurred after
running for 18 hours :-).
FYI, this is with linux-2.4.19, glibc-2.2.5 (compiled
vs. kernel 2.4.19). Haven't tried other configs yet,
but I can.
thanks,
frank
Logged In: YES
user_id=628
Hmm..... OK, first of all, we need to figure out why your
filesystem is getting remounted read-only. That indicates
something is going wrong, and the kernel has detected a
filesystem inconsistency. Merely deleting a file which is
use shouldn't do that. In fact, no normal system activity
should be able to cause a filesystem consistency problem,
except for (a) kernel bugs, or (b) hardware problems/errors.
It is true that e2fsprogs 1.29 is more sensitive to an
incorrect setting of the INDEX_FL flag, whereas older
versions of e2fsprogs won't complain. But the problem is
what's setting the INDEX_FL flag in the first place. If
you're not using the htree patches, the INDEX_FL flag
shouldn't be set at all, and if you are using htree patches,
it should only be set on directories.
Can you replicate whatever is causing the filesystem
consistency problem is the first place?
Logged In: YES
user_id=153116
Hi, over the past few weeks, I've managed to fix my
buildscripts to no longer trigger the INDEX_FL fsck
messages. However, I am confident that I can go back now and
take out my fixes (one-by-one) until I trigger the condition
again.
Just to verify a few things from my first post:
(a) This is 100% reproducible (not sure if it's only linux
2.4, or 2.2 also, though - I should check that).
(b) I am using a stock 2.4.19, no htree patches, and no
special flags to mkfs, fsck, etc., to turn on or check for
htrees.
(c) The condition is definitely (well, I'm 99% certain :-)
triggered by deleting files that are in use. I went through
a long process to fix my buildscripts so they now always
move files out of the way before overwriting (e.g. moving
bash before doing 'make install' on a new build). Once I
fixed those build scripts to not overwrite in-use files
(specifically bash,init,klogd,syslogd,perl,devfsd, and
glibc), the problem went away.
The way I finally tracked this down was to do an 'lsof >
/LSOF' before rebooting each time, and spotting the files
marked "deleted" and/or TYPE=DEL and fixing those
buildscripts to not overwrite those in-use files.
I'll be glad to go back and reproduce this, but can you tell
me what to look for, and what info you need? (It takes ~24
hours per iteration to do this ... I know I need a faster
machine :-)
Logged In: YES
user_id=628
If you could try to trigger the messages, I'd really
appreciate it, since it would help point out a 2.4.19 kernel
bug. Once you can reproduce it, here's what I could really
use, in order of importance:
1) Run dmesg and get me any kernel messages that might
refer to filesystem errors. If the filesystem is being
remounted read-only, the kernel will tell you why it thinks
such a step is necessary, and that's very, very useful
information. Also be on the lookout for any IDE device
driver errors signally hardware errors reported by the disk
drive. It sounds unlikely given the symptoms you've
reported, but it's always something to look out for.
2) 2.4.20-rc1 has been released, and that kernel has some
ext3 fixups. So if once you've tweaked your userspace into
being able to replicate the filesystem corruption, please
try building and booting into the 2.4.20-rc1 kernel, and try
replicating the problem. If the problem has gone away, then
we're done.
3) Reboot into single user mode, but *before* you run fsck
on the partition, run the program "e2image -r /dev/hdXXX - |
bzip2 > /safe-place/to-store/big-file" This will create a
compressed raw image file which I can use to figure out
exactly what's happening. Please read the man page for
e2image before you use this command and send me the results.
It will tell you that it only saves the filesystem metadata,
and no user data. The only sensitive information that I
will see is the filenames in the directories, and of course
I will promise to keep that confidential and private.
Note that you will almost certainly need to mount another
filesystem temporarily that is big enough to store the
compressed e2image file, since (a) the root filesystem is
almost certainly not big enough, and (b) you definitely
don't want to do it on the filesystem that you're trying to
dump, for obvious reasons.
4) If for whatever reason you can't give me the e2image raw
file, try running "e2fsck -n /dev/hdXXX > /tmp/transcript",
and send me the result. Look for any inodes referenced in
the e2fsck transcript, and use the debugfs program to stat
any files that are mentioned with the INDEX_FL file, and
send me the results. E2fsck should tell you the pathname of
the files in question, but if it doesn't, and only gives you
an inode number, you can use debugfs's ncheck command to
translate an inode number into a pathname. If you can send
me the e2image file, though, all this won't be necessary,
since I'll be able to carry out all of these experiments
(and more) on my own.
Thanks for offering to help try to track this down!
Logged In: YES
user_id=153116
Hi, I emailed the requested e2image files to your thunk.org
address - just wanted to make sure you received them. I'm
sure you're busy with other things, but I've been continuing
to work on this, and I noticed something weird - the error
can show up on a filesystem marked as "clean". For example:
$ fsck -f -y /dev/hda6 (hda6 was cleanly unmounted)
<snip ... lots of the same old HTREE messages>
Pass 2: Checking directory structure
Entry 'tputs.3x' in /usr/man/man3 (244852) has an incorrect
filetype (was 7, should be 0).
Fix? yes
<snip>
Pass 3A: Optimizing directories
Optimizing directories: 81695 130630 244852 506526
$ fsck /dev/hda6
(e2fsck reports /dev/hda6 is clean)
$ fsck -f /dev/hda6
fsck 1.31 (08-Nov-2002)
e2fsck 1.31 (08-Nov-2002)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Setting filetype for entry 'tputs.3x' in /usr/man/man3
(244852) to 7.
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/hda6: ***** FILE SYSTEM WAS MODIFIED *****
/dev/hda6: 31817/897600 files (0.7% non-contiguous),
301559/1792090 blocks
-------------
Note how fsck sets the type of tputs.3x (244852) to "0" the
first time, and back to "7" the second time (it is a
symlink, so I think 7 is correct). Also, why does 244852
show up after the message about "optimzing directories"? Is
that a bug? Does e2fsck think tputs.3x is a directory?
FWIW, I have been unable to duplicate this problem under
Slackware (on the SAME drive, same partition and same
kernel), so hopefully that rules out a hardware problem (?).
I'm going to try Redhat 8 on the same partition soon.
I *think* the "deleted file" issue was a red-herring - it
was only important in that it forced a fsck run. Now I'm
manually forcing fsck to run and am seeing the problem
without any deleted-while-open files present.
Logged In: YES
user_id=153116
OK, I think I found the real cause. I'm attaching several
files showing what I did. The root cause appears to be that
e2fsck is setting INDEX_FL (incorrectly, I think), and the
kernel is propogating the flag to everything under those
directories.See files for more.
e2fsck patch to show when INDEX_FL is set