Menu

#698 e2fsck not detected ext4 filesystem error

open
None
9
2014-08-18
2012-10-31
No

Hello,
I am jyothishree NK from LG Electronics.
This request is regarding Android ICS, e2fsck program.
Currently ICS use this version E2FSPROGS_VERSION "1.41.11"
There is some problem in detectng filesystem error through e2fsck program.

Issue: ext4 Filesystem errors "JBD: Spotted dirty metadata buffer" is not detected by e2fsck program.
Description: In the source we do this.
1. Creat filesystem (if it is not exist) on partition.
2. Check filesystem erros using "exec /system/bin/e2fsck -p /dev/block/mmcblk0p20"
3. mount the filesystem as "ext4 mount ext4 /dev/block/mmcblk0p20 /data nosuid nodev noatime barrier=1,data=ordered,noauto_da_alloc,errors=continue".

As per the logic If there is any error in filesystem then "e2fsck -p" option should automatically repair the erros.
Then mount the filesystem.

As per the attached kernel logs, mount has displayed error that e2fsck check is required even though we run e2fsck before mount.
"[ 4.784468] EXT4-fs (mmcblk0p20): warning: mounting fs with errors, running e2fsck is recommended".
And it continue mounting but internally ext4 fs has found filesystem error

"[ 7.251313] EXT4-fs error (device mmcblk0p20): ext4_lookup:1044: inode #2: comm port-bridge: deleted inode referenced: 12
[ 7.390184] EXT4-fs error (device mmcblk0p20): ext4_mb_generate_buddy:736: group 14, 31867 blocks in bitmap, 32231 in gd
[ 7.390369] JBD: Spotted dirty metadata buffer (dev = mmcblk0p20, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 7.535184] warning: `rild' uses 32-bit capabilities (legacy support in use)
[ 7.739751] EXT4-fs error (device mmcblk0p20): ext4_mb_generate_buddy:736: group 3, 32061 blocks in bitmap, 32088 in gd
[ 7.739904] JBD: Spotted dirty metadata buffer (dev = mmcblk0p20, blocknr = 0). There's a risk of filesystem corruption in case of system crash."

Because of this, system display error.
Please let us know why e2fsck could not repair this filesystem error.

Regards,
Jyothishree NK

Discussion

  • Jyothishree NK

    Jyothishree NK - 2012-10-31
    • priority: 5 --> 9
     
  • Jyothishree NK

    Jyothishree NK - 2012-10-31
    • assigned_to: nobody --> tytso
     
  • Theodore Ts'o

    Theodore Ts'o - 2012-10-31

    There are two possible reasons why e2fsck may not have been able to fix the problem.

    The first is that the corruption was in the in-memory version of the block in the buffer-cache, but it was never written out to disk (i.e., the metadata was corrupted in memory or as the data was transferred from the eMMC device to memory). The other possibility is that you're hitting a bug which was introduced in the 3.2 kernel. The fix, commit ID b0dd6b70f0f, landed in the 3.5 kernel, and was backported to 3.2.20 and 3.4.3.

    The error messages:
    [ 7.390184] EXT4-fs error (device mmcblk0p20): ext4_mb_generate_buddy:736: group 14, 31867 blocks in bitmap, 32231 in gd
    [ 7.390369] JBD: Spotted dirty metadata buffer (dev = mmcblk0p20, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

    Are consistent with this bug. However, I would not expect that bug to trigger this ext4 error message:

    [ 7.251313] EXT4-fs error (device mmcblk0p20): ext4_lookup:1044: inode #2: comm port-bridge: deleted inode referenced: 12

    ... which is leaving me a bit puzzled.

    What kernel version did you see this on?

    I'm quite confident that the apparently corruption in the block allocation bitmap and inode allocation bitmaps are things that e2fsck would have easily fixed, so I suspect the file system image was in fact fine. The question then is why was it that ext4 thought the file system was corrupted....

    BTW, if it's more convenient for you to cotninue this on e-mail, feel free to contact me at tytso@mit.edu (or if you need to send me something which might be covered by a LG/Google NDA, tytso@google.com --- although in general I and other kernel developers will abide by gentleman's NDA; if you tell us something should be kept confidential, in general we will honor it, or tell you before you disclose something to us why we might not be able to honor that gentleman's pledge).

     
  • Jyothishree NK

    Jyothishree NK - 2012-11-01

    Hi,
    An email is sent along with logs.
    We see this issue in kernel 3.0.8 versuin.

    Regards,
    Jyothishree NK

     
  • Jyothishree NK

    Jyothishree NK - 2012-11-02

    Hi,

    Please let us know your opinion on previous email.

    Regards,
    Jyothishree NK

     
  • Theodore Ts'o

    Theodore Ts'o - 2012-12-06

    I looked at this again, and I realized what's going on. E2fsck -p is a preen pass; it only does a full file system check if the file system is marked as containing an error. If you want to force a full file system check, you'll need to use e2fsck -fy instead of e2fsck -p. So if the file system is getting corrupted, but the kernel doesn't realize it before the system reboots, e2fsck will not do a full check.

    After a file system error such as:

    [ 7.390184] EXT4-fs error (device mmcblk0p20): ext4_mb_generate_buddy:736: group 14, 31867 blocks in bitmap, 32231 in gd

    a subsequent e2fsck -p should force a full file system check.

     
  • Theodore Ts'o

    Theodore Ts'o - 2012-12-06

    I'm going to guess by the way that what is happening is that the flash is lying to us, and is not guaranteeing that the data is pushed out to stable storage after a FLUSH CACHE (aka barrier) command, and so on a power failure, the file system is getting corrupted because the correctness of the journal recovery depends on the storage device guaranteeing that blocks written before the FLUSH CACHE are stably written to disk.

    I'm going to hope that flash isn't crappy enough such that blocks written long ago --- but which are in an erase block which might have been copied at the time of the power outage --- are getting corrupted on a power drop, but I've seen such crappy behaviour with MMC devices, and I've heard that eMMC devices can be even crappier since handset manufacturers are trying to save pennies per device (since when you are manufacturing millions of devices at a time, it adds up to real money). If that is the case, one possibility is to use e2fsck -fy after every reboot. But, (a) this will slow down the boot times, and (b) if the flash is that crappy, then there's the possibility of data blocks getting corrupted on a power drop, and nothing will save there. All you can do is yell at the flash manufacturer, and ask them to provide you with devices that don't randomly lose data on a power drop.