Menu

#41 INDEX_FL set on non-directory w/1.29

closed-out-of-date
e2fsck (61)
5
2004-10-08
2002-10-03
No

Hi, I'm not sure if this is a bug or feature: :-)

I formatted a new partition with e2fsprogs-1.29, then
proceeded to run a long compilation job on it
(rebuiling a Linux box from source). At the end of the
compilation, the partition had to be remounted
read-only (probably due to a file being deleted while
in use?), so wasn't cleanly unmounted. With e2fsprogs
<= 1.28, an fsck would run on the next reboot, and fix
things OK. With e2fsprogs-1.29, I'm getting a message like:

"Inode XXX has INDEX_FL set but is not a directory".

(that's from e2fsck/problem.c:678)

e2fsck then exits with an error forcing me to manually
run e2fsck to repair the filesystem (it then gives me a
LONG list of inodes, asking me if I want to clear the
Htree info on each).

I believe I can reproduce this - if it is indeed a bug,
what other information would you need? (sorry I don't
have a minimal test case yet - this occurred after
running for 18 hours :-).

FYI, this is with linux-2.4.19, glibc-2.2.5 (compiled
vs. kernel 2.4.19). Haven't tried other configs yet,
but I can.

thanks,
frank

Discussion

  • Theodore Ts'o

    Theodore Ts'o - 2002-10-31

    Logged In: YES
    user_id=628

    Hmm..... OK, first of all, we need to figure out why your
    filesystem is getting remounted read-only. That indicates
    something is going wrong, and the kernel has detected a
    filesystem inconsistency. Merely deleting a file which is
    use shouldn't do that. In fact, no normal system activity
    should be able to cause a filesystem consistency problem,
    except for (a) kernel bugs, or (b) hardware problems/errors.

    It is true that e2fsprogs 1.29 is more sensitive to an
    incorrect setting of the INDEX_FL flag, whereas older
    versions of e2fsprogs won't complain. But the problem is
    what's setting the INDEX_FL flag in the first place. If
    you're not using the htree patches, the INDEX_FL flag
    shouldn't be set at all, and if you are using htree patches,
    it should only be set on directories.

    Can you replicate whatever is causing the filesystem
    consistency problem is the first place?

     
  • Anonymous

    Anonymous - 2002-10-31

    Logged In: YES
    user_id=153116

    Hi, over the past few weeks, I've managed to fix my
    buildscripts to no longer trigger the INDEX_FL fsck
    messages. However, I am confident that I can go back now and
    take out my fixes (one-by-one) until I trigger the condition
    again.

    Just to verify a few things from my first post:
    (a) This is 100% reproducible (not sure if it's only linux
    2.4, or 2.2 also, though - I should check that).
    (b) I am using a stock 2.4.19, no htree patches, and no
    special flags to mkfs, fsck, etc., to turn on or check for
    htrees.
    (c) The condition is definitely (well, I'm 99% certain :-)
    triggered by deleting files that are in use. I went through
    a long process to fix my buildscripts so they now always
    move files out of the way before overwriting (e.g. moving
    bash before doing 'make install' on a new build). Once I
    fixed those build scripts to not overwrite in-use files
    (specifically bash,init,klogd,syslogd,perl,devfsd, and
    glibc), the problem went away.

    The way I finally tracked this down was to do an 'lsof >
    /LSOF' before rebooting each time, and spotting the files
    marked "deleted" and/or TYPE=DEL and fixing those
    buildscripts to not overwrite those in-use files.

    I'll be glad to go back and reproduce this, but can you tell
    me what to look for, and what info you need? (It takes ~24
    hours per iteration to do this ... I know I need a faster
    machine :-)

     
  • Theodore Ts'o

    Theodore Ts'o - 2002-11-01

    Logged In: YES
    user_id=628

    If you could try to trigger the messages, I'd really
    appreciate it, since it would help point out a 2.4.19 kernel
    bug. Once you can reproduce it, here's what I could really
    use, in order of importance:

    1) Run dmesg and get me any kernel messages that might
    refer to filesystem errors. If the filesystem is being
    remounted read-only, the kernel will tell you why it thinks
    such a step is necessary, and that's very, very useful
    information. Also be on the lookout for any IDE device
    driver errors signally hardware errors reported by the disk
    drive. It sounds unlikely given the symptoms you've
    reported, but it's always something to look out for.

    2) 2.4.20-rc1 has been released, and that kernel has some
    ext3 fixups. So if once you've tweaked your userspace into
    being able to replicate the filesystem corruption, please
    try building and booting into the 2.4.20-rc1 kernel, and try
    replicating the problem. If the problem has gone away, then
    we're done.

    3) Reboot into single user mode, but *before* you run fsck
    on the partition, run the program "e2image -r /dev/hdXXX - |
    bzip2 > /safe-place/to-store/big-file" This will create a
    compressed raw image file which I can use to figure out
    exactly what's happening. Please read the man page for
    e2image before you use this command and send me the results.
    It will tell you that it only saves the filesystem metadata,
    and no user data. The only sensitive information that I
    will see is the filenames in the directories, and of course
    I will promise to keep that confidential and private.
    
    Note that you will almost certainly need to mount another
    filesystem temporarily that is big enough to store the
    compressed e2image file, since (a) the root filesystem is
    almost certainly not big enough, and (b) you definitely
    don't want to do it on the filesystem that you're trying to
    dump, for obvious reasons.

    4) If for whatever reason you can't give me the e2image raw
    file, try running "e2fsck -n /dev/hdXXX > /tmp/transcript",
    and send me the result. Look for any inodes referenced in
    the e2fsck transcript, and use the debugfs program to stat
    any files that are mentioned with the INDEX_FL file, and
    send me the results. E2fsck should tell you the pathname of
    the files in question, but if it doesn't, and only gives you
    an inode number, you can use debugfs's ncheck command to
    translate an inode number into a pathname. If you can send
    me the e2image file, though, all this won't be necessary,
    since I'll be able to carry out all of these experiments
    (and more) on my own.

    Thanks for offering to help try to track this down!

     
  • Anonymous

    Anonymous - 2002-11-11

    Logged In: YES
    user_id=153116

    Hi, I emailed the requested e2image files to your thunk.org
    address - just wanted to make sure you received them. I'm
    sure you're busy with other things, but I've been continuing
    to work on this, and I noticed something weird - the error
    can show up on a filesystem marked as "clean". For example:

    $ fsck -f -y /dev/hda6 (hda6 was cleanly unmounted)

    <snip ... lots of the same old HTREE messages>

    Pass 2: Checking directory structure
    Entry 'tputs.3x' in /usr/man/man3 (244852) has an incorrect
    filetype (was 7, should be 0).
    Fix? yes

    <snip>

    Pass 3A: Optimizing directories
    Optimizing directories: 81695 130630 244852 506526

    $ fsck /dev/hda6

    (e2fsck reports /dev/hda6 is clean)

    $ fsck -f /dev/hda6
    fsck 1.31 (08-Nov-2002)
    e2fsck 1.31 (08-Nov-2002)
    Pass 1: Checking inodes, blocks, and sizes
    Pass 2: Checking directory structure
    Setting filetype for entry 'tputs.3x' in /usr/man/man3
    (244852) to 7.
    Pass 3: Checking directory connectivity
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information

    /dev/hda6: ***** FILE SYSTEM WAS MODIFIED *****
    /dev/hda6: 31817/897600 files (0.7% non-contiguous),
    301559/1792090 blocks

    -------------
    Note how fsck sets the type of tputs.3x (244852) to "0" the
    first time, and back to "7" the second time (it is a
    symlink, so I think 7 is correct). Also, why does 244852
    show up after the message about "optimzing directories"? Is
    that a bug? Does e2fsck think tputs.3x is a directory?

    FWIW, I have been unable to duplicate this problem under
    Slackware (on the SAME drive, same partition and same
    kernel), so hopefully that rules out a hardware problem (?).
    I'm going to try Redhat 8 on the same partition soon.

    I *think* the "deleted file" issue was a red-herring - it
    was only important in that it forced a fsck run. Now I'm
    manually forcing fsck to run and am seeing the problem
    without any deleted-while-open files present.

     
  • Anonymous

    Anonymous - 2002-11-11

    Logged In: YES
    user_id=153116

    OK, I think I found the real cause. I'm attaching several
    files showing what I did. The root cause appears to be that
    e2fsck is setting INDEX_FL (incorrectly, I think), and the
    kernel is propogating the flag to everything under those
    directories.See files for more.

     
  • Anonymous

    Anonymous - 2002-11-11

    e2fsck patch to show when INDEX_FL is set

     
  • Theodore Ts'o

    Theodore Ts'o - 2004-10-08
    • status: open --> closed-out-of-date