Ext2/Ext3/Ext4 Filesystems Utilities / Support Requests / #698 e2fsck not detected ext4 filesystem error

Jyothishree NK - 2012-10-31

priority: 5 --> 9
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jyothishree NK - 2012-10-31

assigned_to: nobody --> tytso
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Theodore Ts'o - 2012-10-31

There are two possible reasons why e2fsck may not have been able to fix the problem.

The first is that the corruption was in the in-memory version of the block in the buffer-cache, but it was never written out to disk (i.e., the metadata was corrupted in memory or as the data was transferred from the eMMC device to memory). The other possibility is that you're hitting a bug which was introduced in the 3.2 kernel. The fix, commit ID b0dd6b70f0f, landed in the 3.5 kernel, and was backported to 3.2.20 and 3.4.3.

The error messages:
[ 7.390184] EXT4-fs error (device mmcblk0p20): ext4_mb_generate_buddy:736: group 14, 31867 blocks in bitmap, 32231 in gd
[ 7.390369] JBD: Spotted dirty metadata buffer (dev = mmcblk0p20, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Are consistent with this bug. However, I would not expect that bug to trigger this ext4 error message:

[ 7.251313] EXT4-fs error (device mmcblk0p20): ext4_lookup:1044: inode #2: comm port-bridge: deleted inode referenced: 12

... which is leaving me a bit puzzled.

What kernel version did you see this on?

I'm quite confident that the apparently corruption in the block allocation bitmap and inode allocation bitmaps are things that e2fsck would have easily fixed, so I suspect the file system image was in fact fine. The question then is why was it that ext4 thought the file system was corrupted....

BTW, if it's more convenient for you to cotninue this on e-mail, feel free to contact me at tytso@mit.edu (or if you need to send me something which might be covered by a LG/Google NDA, tytso@google.com --- although in general I and other kernel developers will abide by gentleman's NDA; if you tell us something should be kept confidential, in general we will honor it, or tell you before you disclose something to us why we might not be able to honor that gentleman's pledge).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jyothishree NK - 2012-11-01

Hi,
An email is sent along with logs.
We see this issue in kernel 3.0.8 versuin.

Regards,
Jyothishree NK

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jyothishree NK - 2012-11-02

Hi,

Please let us know your opinion on previous email.

Regards,
Jyothishree NK

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Theodore Ts'o - 2012-12-06

I looked at this again, and I realized what's going on. E2fsck -p is a preen pass; it only does a full file system check if the file system is marked as containing an error. If you want to force a full file system check, you'll need to use e2fsck -fy instead of e2fsck -p. So if the file system is getting corrupted, but the kernel doesn't realize it before the system reboots, e2fsck will not do a full check.

After a file system error such as:

[ 7.390184] EXT4-fs error (device mmcblk0p20): ext4_mb_generate_buddy:736: group 14, 31867 blocks in bitmap, 32231 in gd

a subsequent e2fsck -p should force a full file system check.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Theodore Ts'o - 2012-12-06

I'm going to guess by the way that what is happening is that the flash is lying to us, and is not guaranteeing that the data is pushed out to stable storage after a FLUSH CACHE (aka barrier) command, and so on a power failure, the file system is getting corrupted because the correctness of the journal recovery depends on the storage device guaranteeing that blocks written before the FLUSH CACHE are stably written to disk.

I'm going to hope that flash isn't crappy enough such that blocks written long ago --- but which are in an erase block which might have been copied at the time of the power outage --- are getting corrupted on a power drop, but I've seen such crappy behaviour with MMC devices, and I've heard that eMMC devices can be even crappier since handset manufacturers are trying to save pennies per device (since when you are manufacturing millions of devices at a time, it adds up to real money). If that is the case, one possibility is to use e2fsck -fy after every reboot. But, (a) this will slow down the boot times, and (b) if the flash is that crappy, then there's the possibility of data blocks getting corrupted on a power drop, and nothing will save there. All you can do is yell at the flash manufacturer, and ask them to provide you with devices that don't randomly lose data on a power drop.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

e2fsck not detected ext4 filesystem error

Group

Searches

Help

#698 e2fsck not detected ext4 filesystem error

Discussion