One of our oracle databases crashed while maintenance was being carried out on the application that connects to it.
As we have recently been getting "oracle logical block errors" for one dbf's on the SAN (EVA4000) partition (listed in fstab as sda1) I took the opportunity to fsck it while it was unmounted.
The application server's OS is RedHat Linux Server 5.
fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/dev/sda1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda1: 38/6553600 files (36.8% non-contiguous), 2786445/13107199 blocks
Now to the question:
Can anybody please tell me (or direct me to documentation) that defines what "contains a file system with errors" covers.
What probably happenned is that a disk buffer got corrupted, either in memory, or while it was being transferred from disk to memory. The kernel noticed the problem, and so it set the EXT2_ERROR_FS flag in the superblock. If the filesystem wasn't where /var is located and/or your kernel wasn't configured to force a reboot or remount the filesystem read/only there may be an entry in your system's log file regarding the error that your system found.
Thanks for taking the time to reply.
Oh, let me clarify one thing. The reason why I suspect this is what happened (in case it wasn't obvious) was because e2fsck didn't find any errors, but in order for the EXT3_ERROR_FS to have been set, the kernel must have found some filesystem inconsistency that was so obvious that it doesn't require looking at large portions of the filesystem at the same time.
So for example, if you are deleting a file, and the kernel is freeing blocks, and it finds that the blocks are already marked as freed (for example, if the block bitmap was corrupted while it was being transferred from disk to memory), it will log that fact, set EXT3_ERROR_FS, and either (a) continue, (b) panic, or (c) remount the filesystem read-only. (a) is the default, but I have had people who have argued that (b) or (c) should be the default, since if the filesystem is damaged, forcing a reboot (and letting a hot standby take over), or remounting read-only (to avoid the filesystem damage from spreading causing more data loss) might be the better choice.