Thread: [Jfs-discussion] filesystem corruption
Brought to you by:
blaschke-oss,
shaggyk
From: James C. <clo...@jh...> - 2005-09-04 16:23:48
|
I got some corruption with a post 2.6.13 kernel (shortly after the jfs changes where merged). I didn't notice the oops at first -- except that the: mount -n -o remount,rw was stuck in text.lock (IIRC; text.something in any case). In addition to the remount getting stuck, sync(1) and umount(8) also got stuck, so I was forced to do an emergency sync/sync/umount/boot. I dropped back to a kernel closer to 2.6.13 as released and that is working fine, AFAICT. The oops is below. The symptom is mostly in the form of meta-data corruption for anything that was changed under that kernel. Several binaries ended up with 0666 rather than 0755 perms, as an example. /etc/ld.so.conf was empty (but easily recoverable as Gentoo frequently auto-gens it based on what packages are installed; in fact that frequent auto-generation is probably why it ended up empty). I suspect the emerge process copies the files into the filesystem such that they are opened with 666 perms and then chmod(2)ed to the perms they had in the staging install tree, which suggests that the failure is only with meta-data changes. I still have that kernel installed, so I can do some more debugging if helpful, but only minor stuff as I need to boot with init=/bin/bash to keep the box usable.... ,---- | [4294746.121000] BUG at fs/jfs/jfs_logmgr.c:1622 assert(list_empty(&log->cqueue)) | [4294746.121000] ------------[ cut here ]------------ | [4294746.121000] kernel BUG at fs/jfs/jfs_logmgr.c:1622! | [4294746.121000] invalid operand: 0000 [#1] | [4294746.121000] Modules linked in: i8k snd_pcm_oss snd_mixer_oss | snd_maestro3 snd_ac97_codec | snd_ac97_bus snd_pcm snd_timer | snd soundcore snd_page_alloc | uhci_hcd e100 | [4294746.121000] CPU: 0 | [4294746.121000] EIP: 0060:[<c0219f12>] Not tainted VLI | [4294746.121000] EFLAGS: 00010286 (2.6.13-lug2) | [4294746.121000] EIP is at jfs_flush_journal+0x1c2/0x250 | [4294746.121000] eax: 00000058 ebx: 000000c8 ecx: c04fb22c edx: c04fb22c | [4294746.121000] esi: df8aa354 edi: df8aa2c0 ebp: debebeac esp: debebe3c | [4294746.121000] ds: 007b es: 007b ss: 0068 | [4294746.121000] Process mount (pid: 2246, threadinfo=debea000 task=df75f040) | [4294746.121000] Stack: c047124d c047167c 00000656 c0471663 df8aa2c0 df8aa32c dee7c000 00000000 | [4294746.121000] 00000002 c02362be c1721200 dee7c000 debebebc c01fe15d 00000000 00000001 | [4294746.121000] c15601e0 ffffffff debebec8 00000000 00000000 00000000 00000000 00000000 | [4294746.121000] Call Trace: | [4294746.121000] [<c01034ba>] show_stack+0x7a/0x90 | [4294746.121000] [<c0103639>] show_registers+0x149/0x1c0 | [4294746.121000] [<c010382b>] die+0xbb/0x140 | [4294746.121000] [<c0103931>] do_trap+0x81/0xc0 | [4294746.121000] [<c0103ba5>] do_invalid_op+0xa5/0xb0 | [4294746.121000] [<c0103133>] error_code+0x4f/0x54 | [4294746.121000] [<c0202430>] jfs_umount_rw+0x20/0x70 | [4294746.121000] [<c01fe49e>] jfs_remount+0x13e/0x170 | [4294746.121000] [<c015ff0b>] do_remount_sb+0xcb/0x150 | [4294746.121000] [<c0174c98>] do_remount+0x88/0xe0 | [4294746.121000] [<c01755bd>] do_mount+0x18d/0x1c0 | [4294746.121000] [<c0175928>] sys_mount+0x68/0xa0 | [4294746.121000] [<c0102ebf>] sysenter_past_esp+0x54/0x75 | [4294746.121000] Badness in do_exit at kernel/exit.c:787 | [4294746.180000] [<c01034e7>] dump_stack+0x17/0x20 | [4294746.183000] [<c011e603>] do_exit+0x353/0x360 | [4294746.185000] [<c01038a4>] die+0x134/0x140 | [4294746.186000] [<c0103931>] do_trap+0x81/0xc0 | [4294746.188000] [<c0103ba5>] do_invalid_op+0xa5/0xb0 | [4294746.190000] [<c0103133>] error_code+0x4f/0x54 | [4294746.191000] [<c0202430>] jfs_umount_rw+0x20/0x70 | [4294746.193000] [<c01fe49e>] jfs_remount+0x13e/0x170 | [4294746.195000] [<c015ff0b>] do_remount_sb+0xcb/0x150 | [4294746.196000] [<c0174c98>] do_remount+0x88/0xe0 | [4294746.198000] [<c01755bd>] do_mount+0x18d/0x1c0 | [4294746.200000] [<c0175928>] sys_mount+0x68/0xa0 | [4294746.201000] [<c0102ebf>] sysenter_past_esp+0x54/0x75 `---- -JimC -- James H. Cloos, Jr. <cl...@jh...> |
From: Dave K. <sh...@au...> - 2005-09-04 17:26:19
|
On Sun, 2005-09-04 at 12:23 -0400, James Cloos wrote: > I got some corruption with a post 2.6.13 kernel (shortly after the jfs > changes where merged). Is this a Linus kernel, or something different? There has only been one post-2.6.13 change to jfs in Linus' kernel, and that looks pretty harmless. > I didn't notice the oops at first -- except that the: > > mount -n -o remount,rw > > was stuck in text.lock (IIRC; text.something in any case). > > In addition to the remount getting stuck, sync(1) and umount(8) also > got stuck, so I was forced to do an emergency sync/sync/umount/boot. >From the look of the stack trace, the oops must have taken place during the 'mount -n -o remount,ro /' in /etc/init.d/checkroot. It probably left something locked when it oopsed, causing subsequent operations to hang. > I dropped back to a kernel closer to 2.6.13 as released and that is > working fine, AFAICT. > > The oops is below. > > The symptom is mostly in the form of meta-data corruption for anything > that was changed under that kernel. Several binaries ended up with > 0666 rather than 0755 perms, as an example. /etc/ld.so.conf was empty > (but easily recoverable as Gentoo frequently auto-gens it based on > what packages are installed; in fact that frequent auto-generation is > probably why it ended up empty). > > I suspect the emerge process copies the files into the filesystem such > that they are opened with 666 perms and then chmod(2)ed to the perms > they had in the staging install tree, which suggests that the failure > is only with meta-data changes. > > I still have that kernel installed, so I can do some more debugging if > helpful, but only minor stuff as I need to boot with init=/bin/bash to > keep the box usable.... > > ,---- > | [4294746.121000] BUG at fs/jfs/jfs_logmgr.c:1622 assert(list_empty(&log->cqueue)) > | [4294746.121000] ------------[ cut here ]------------ > | [4294746.121000] kernel BUG at fs/jfs/jfs_logmgr.c:1622! Hmm. For some reason, jfs was unable to write everything the journal. I don't know what could have triggered this in a recent kernel. Is the file system on an ide drive? I don't see any recent changes to ide, or anything else post-2.6.13 that would explain this. I wonder if whatever caused this is a bug in 2.6.13, but it's just not easily reproduced. Did you try it more than once on the later kernel? -- David Kleikamp IBM Linux Technology Center |
From: James C. <clo...@jh...> - 2005-09-04 21:48:45
|
>>>>> "Dave" == Dave Kleikamp <sh...@au...> writes: Dave> Is this a Linus kernel, or something different? There has only Dave> been one post-2.6.13 change to jfs in Linus' kernel, and that Dave> looks pretty harmless. Yes. Via the hg repo, but that, AIUI, tracks the git repo real-time. There've been a few changesets since I compiled, but mostly non-x86. Dave> From the look of the stack trace, the oops must have taken place Dave> during the 'mount -n -o remount,ro /' in /etc/init.d/checkroot. Dave> It probably left something locked when it oopsed, causing Dave> subsequent operations to hang. Exactly. Looking thru System.map, I see there are 237 entries that match the glob '.text.lock.*'. Given top(1)'s minimal column width for wchan I don't know which lock mount was spinning on. Dave> For some reason, jfs was unable to write everything the Dave> journal. I don't know what could have triggered this in a Dave> recent kernel. Is the file system on an ide drive? I don't see Dave> any recent changes to ide, or anything else post-2.6.13 that Dave> would explain this. Laptop, ide. Dave> I wonder if whatever caused this is a bug in 2.6.13, but it's Dave> just not easily reproduced. Did you try it more than once on Dave> the later kernel? The oops occurred every time I booted the later kernel w/o init=/bin/bash. With init(8) bypassed I could remount the filesystem as often as I tried w/o an oops, but dev (udev) and sys were not mounted when I did that test, only / and proc. As I didn't write anything beyond mtab and mtab's lock files (by forgetting -n in mount(8)'s args) I don't know whether any damage would have occurred even w/o the oops. I'm pretty sure I've fully recovered, byt re-emerging as necessary, although I may have lost some (non-critical) mail. -JimC -- James H. Cloos, Jr. <cl...@jh...> |
From: James C. <clo...@jh...> - 2005-09-14 09:59:28
|
I've now upgraded to 2.6.14-rc1 and can still trigger the oops. But I found out that it is dependant on which modules are loaded. (I rebooted w/o running make install_modules w/o an oops, but when I fixed that and rebooted again, I got the oops.) My autoloaded modules are: e100 uhci-hcd snd_maestro3 snd_mixer_oss snd_pcm_oss i8k plus thier dependancies. All of the dependencies are alsa; I beleive the full list of dependencies is: soundcore snd snd_seq_device snd_timer snd_seq snd_page_alloc snd_pcm snd_ac97_codec snd_rtctimer snd_seq_oss I have to go back and try each one at a time, but I wanted to get this out first. There has been an alsa merge since -rc1, so if it is alsa that triggers the difference I'll have to upgrade and test yet again.... But with all of those modules loaded the oops is reproducable, and w/o them it (so far) is not. Also, as long as I reboot right away and fsck, it seems that I can avoid any data loss. But if I let it run anything written (or perhaps anything written after the log gets full) is lost. (As an example I discovered only after the first two notes, mozilla lost its cookie permissions file, but other files that it updates were OK.) -JimC -- James H. Cloos, Jr. <cl...@jh...> |
From: Dave K. <sh...@au...> - 2005-09-14 13:31:12
|
On Wed, 2005-09-14 at 05:59 -0400, James Cloos wrote: > I've now upgraded to 2.6.14-rc1 and can still trigger the oops. > > But I found out that it is dependant on which modules are loaded. I was confused before. I was thinking that the oops happened on the boot side of the reboot, but looking back, I see that it is happening on the shutdown side. Is the hang happening during shutdown or boot? > (I rebooted w/o running make install_modules w/o an oops, but when I > fixed that and rebooted again, I got the oops.) > > My autoloaded modules are: > > e100 uhci-hcd snd_maestro3 snd_mixer_oss snd_pcm_oss i8k > > plus thier dependancies. All of the dependencies are alsa; > I beleive the full list of dependencies is: > > soundcore snd snd_seq_device snd_timer snd_seq snd_page_alloc snd_pcm > snd_ac97_codec snd_rtctimer snd_seq_oss > > I have to go back and try each one at a time, but I wanted to get this > out first. > > There has been an alsa merge since -rc1, so if it is alsa that > triggers the difference I'll have to upgrade and test yet again.... > > But with all of those modules loaded the oops is reproducable, and w/o > them it (so far) is not. I don't really know why the modules would cause jfs to misbehave. I'll have to try booting from jfs on my laptop (which is also gentoo + kernel 2.6.14-rc1). I don't run jfs on the root because I like to rebuild jfs.ko and reload it often. > Also, as long as I reboot right away and fsck, it seems that I can > avoid any data loss. But if I let it run anything written (or perhaps > anything written after the log gets full) is lost. > > (As an example I discovered only after the first two notes, mozilla > lost its cookie permissions file, but other files that it updates > were OK.) This is odd. Can you rebuild the kernel with CONFIG_JFS_DEBUG set, and look for any suspicious dmesg output that may show up while the system is running or right before the oops? > > -JimC -- David Kleikamp IBM Linux Technology Center |
From: Dave K. <sh...@au...> - 2005-09-15 19:12:02
|
On Wed, 2005-09-14 at 08:31 -0500, Dave Kleikamp wrote: > I don't really know why the modules would cause jfs to misbehave. I'll > have to try booting from jfs on my laptop (which is also gentoo + kernel > 2.6.14-rc1). I don't run jfs on the root because I like to rebuild > jfs.ko and reload it often. Well, not my laptop, but I have desktop machine running gentoo + 2.6.14-rc1, booting from jfs, and it's not having any problems rebooting. It has an AC97 soundcard, but it currently that code compiled into the kernel, rather than modules. The only loaded module is e100. -- David Kleikamp IBM Linux Technology Center |
From: James C. <clo...@jh...> - 2005-09-19 08:28:06
|
The oops does occur at boot time; gentoo autoloads modules before it starts running the /etc/runlevel/* scripts, including the one that remounts / rw. It turned out that the alsa modules cause the oops. Specifically loading maestro3 and its dependencies. I beleive there was some alsa patches that went in since I last compiled, so I will try that later today. I'll also post about this on the alsa list and see what develops. -JimC -- James H. Cloos, Jr. <cl...@jh...> |
From: Dave K. <sh...@au...> - 2005-09-19 13:07:14
|
On Mon, 2005-09-19 at 04:27 -0400, James Cloos wrote: > The oops does occur at boot time; gentoo autoloads modules before it > starts running the /etc/runlevel/* scripts, including the one that > remounts / rw. Are you initially mounting / read-only, as in specifying the ro kernel parameter in grub.conf? You should be, since jfs won't mount a dirty volume rw until fsck replays the journal. And if you are, remounting shouldn't require shutting down the journaling code. > It turned out that the alsa modules cause the oops. Specifically > loading maestro3 and its dependencies. > > I beleive there was some alsa patches that went in since I last > compiled, so I will try that later today. > > I'll also post about this on the alsa list and see what develops. Even though the alsa drivers appear to trigger the problem, it really looks like it's something that needs to be fixed in jfs, if I can figure out what's happening. > -JimC Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center |
From: James C. <clo...@jh...> - 2005-09-23 11:23:27
|
>>>>> "Dave" == Dave Kleikamp <sh...@au...> writes: Dave> Are you initially mounting / read-only, as in specifying the ro Dave> kernel parameter in grub.conf? Yes. Always have done. And I certainly do have to remount rw when I try booting with init=/bin/bash added to the grub kernel line.... OTOH, the rc scripts do have code to remount ro; I don't /think/ they are doing anything silly like remount rw then remount ro then rw. Also proc, sys and udev are mounted before the remount,rw. I know the tmpfs used for /dev won't unmount at shutdown (it is busy); I don't know whether that has an effect on the rw remount of /. Dave> And if Dave> you are, remounting shouldn't require shutting down the Dave> journaling code. Ok. >> It turned out that the alsa modules cause the oops. Specifically >> loading maestro3 and its dependencies. Dave> Even though the alsa drivers appear to trigger the problem, it Dave> really looks like it's something that needs to be fixed in jfs, Dave> if I can figure out what's happening. Ok. Let me know what I can do to help track that down. -JimC |
From: <Nic...@ho...> - 2005-09-23 13:56:14
|
We were planning to utilize JFS on SLES9 utilizing IBM's SDD on an IBM DS6800 Storage array, until I read the readme file (1), which says: 3.5 Correction to User's Guide o Supported Filesystem Statement In the current User's Guide, we make various statements regarding specific filesystem support. For Linux 2.6 kernels (the SLES 9 and RHEL 4 distributions) SDD will only support the following filesystems: o ext2 o ext3 Please ensure that you do not run any other filesystems on your SDD vpath devices. Maybe someone can help me understand why JFS would not be supported in this type of configuration. The last line mentions the vpath devices. We are going to be utilizing LVM, with vpath devices, but would not be formatting the vpath devices directly. Any idea if this may be supported? (1): ftp://ftp.software.ibm.com/storage/subsystem/linux/1.6.0.1-8/rd_linux.2.6.txt |
From: Dave K. <sh...@au...> - 2005-09-23 14:55:32
|
On Fri, 2005-09-23 at 09:59 -0400, Nic...@ho... wrote: > > We were planning to utilize JFS on SLES9 utilizing IBM's SDD on an IBM > DS6800 Storage array, until I read the readme file (1), which says: > > 3.5 Correction to User's Guide > > o Supported Filesystem Statement > > In the current User's Guide, we make various > statements regarding > specific filesystem support. For Linux 2.6 kernels > (the SLES 9 and > RHEL 4 distributions) SDD will only support the > following filesystems: > > o ext2 > o ext3 > > Please ensure that you do not run any other > filesystems on your SDD > vpath devices. > > Maybe someone can help me understand why JFS would not be supported in > this type of configuration. In the RHEL 4 case, it makes sense, since those are the only native file systems that Redhat supports. They neither build the other major file systems (jfs, xfs, reiserfs), nor are they eager to accept patches to these file systems to keep their kernel source tree up to date. I know of no technical issues why jfs wouldn't be supported on SLES 9. Maybe it's an issue of not wanting to test too many configurations. > The last line mentions the vpath devices. We are going to be > utilizing LVM, with vpath devices, but would not be formatting the > vpath devices directly. Any idea if this may be supported? > > > (1): > ftp://ftp.software.ibm.com/storage/subsystem/linux/1.6.0.1-8/rd_linux.2.6.txt > -- David Kleikamp IBM Linux Technology Center |
From: Christoph H. <hc...@in...> - 2005-09-23 16:19:57
|
On Fri, Sep 23, 2005 at 09:59:23AM -0400, Nic...@ho... wrote: > We were planning to utilize JFS on SLES9 utilizing IBM's SDD on an IBM > DS6800 Storage array, until I read the readme file (1), which says: IBM's SDD is completely broken. Please try to use the kernel device mapper multipath code. /me beeing grumpy that the morons at that ibm divsion even forward-ported this code.. |
From: <Nic...@ho...> - 2005-09-23 17:27:30
|
Christoph, Could you help me understand why IBM SDD is broke, vs using device mapper multipath? Thanks, Nick Christoph Hellwig <hc...@in...> Sent by: jfs...@li... 09/23/2005 12:19 PM To Nic...@ho... cc jfs...@li..., jfs...@li... Subject Re: [Jfs-discussion] JFS on IBM Subsystem Device Driver (SDD) 1.6 SLES9 On Fri, Sep 23, 2005 at 09:59:23AM -0400, Nic...@ho... wrote: > We were planning to utilize JFS on SLES9 utilizing IBM's SDD on an IBM > DS6800 Storage array, until I read the readme file (1), which says: IBM's SDD is completely broken. Please try to use the kernel device mapper multipath code. /me beeing grumpy that the morons at that ibm divsion even forward-ported this code.. ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Jfs-discussion mailing list Jfs...@li... https://lists.sourceforge.net/lists/listinfo/jfs-discussion |