Thread: [Jfs-discussion] jfs issues with power failure, fast reboot sequences and disk/fs unrelated kernel
Brought to you by:
blaschke-oss,
shaggyk
From: Fiedler R. <Rom...@ai...> - 2011-03-31 09:02:26
|
Hello list, Different observations with various kernels (stock hardy/lucid, so up to 2.6.32) made me wonder, if there are known issues with default jfs settings, general jfs stability or fsck functionality in case of fast/unclean jfs fs shutdown. The symptoms of the problem are always similar, although triggered by different events (see thread topic): * Normal reboot after shutdown or crash, mount works without problems, redo ok * If detected by chance, you might find files, where e.g. ls reports "stale nfs lock", although file is normal jfs file (e.g. /var/log/dmesg.3.gz) * When forcing a fsck, the "stale nfs lock" files will vanish, some data might reappear in /lost+found Do you know of any open bugs related to that or could it be, that patches to jfs were not correctly backported to Ubuntu? Could it be due to misconfiguration? Are there guidelines, how to use jfs in production environment? There are some old posts, that related jfs failures to the schedulers used in kernel. Are there known bad combinations? Are there rootkits or exploits known, that cause a similar file system disruption? Do you have trunk jfs modules for Ubuntu lucid kernel or even a whole trunk kernel with trunk jfs for testing? I could offer to try to reproduce on virtual machine. Kind regards, Roman Jfs log extracted from system suffering problems after hard power down: **Phase 0 - Replay Journal Log [xchkdsk.c:1871] LOGREDO: Log already redone! [logredo.c:555] logredo returned rc = 0 [xchkdsk.c:1903] **Phase 1 - Check Blocks, Files/Directories, and Directory Entries [xchkdsk.c:1996] File system object FF156982 has corrupt data (9). [fsckino.c:1977] File system object FF156983 has corrupt data (9). [fsckino.c:1977] **Phase 2 - Count links [xchkdsk.c:2087] **Phase 3 - Duplicate Block Rescan and Directory Connectedness [xchkdsk.c:2120] Directory entries for unallocated files have been detected. Will remove. [xchkdsk.c:748] **Phase 4 - Report Problems [xchkdsk.c:2198] File system object FF156982 is linked as: /var/log/ulog/old/pcap.log.13 [fsckino.c:336] cannot repair the data format error(s) in this file. [xchkdsk.c:1202] cannot repair FF156982. Will release. [xchkdsk.c:1244] File system object FF156983 is linked as: /var/log/ulog/old/syslogemu.log.13 [fsckino.c:336] cannot repair the data format error(s) in this file. [xchkdsk.c:1202] cannot repair FF156983. Will release. [xchkdsk.c:1244] File system object FF192674 is linked as: /var/log/dmesg.3.gz [fsckino.c:336] The path(s) refer to an unallocated file. Will remove. [xchkdsk.c:1177] File system object DF262359 is linked as: /var/log/ulog/old [fsckino.c:320] **Phase 5 - Check Connectivity [xchkdsk.c:2230] No paths were found for inode F192684. [fsckconn.c:311] **Phase 6 - Perform Approved Corrections [xchkdsk.c:2259] Superblock marked dirty because repairs are about to be written. [xchkdsk.c:2280] No \lost+found directory found in the filesystem. [xchkdsk.c:2803] Storage allocated to inode F156982 has been cleared. [xchkdsk.c:2582] Storage allocated to inode F156983 has been cleared. [xchkdsk.c:2582] Directory inode F172073 entry reference to inode F192674 removed. [xchkdsk.c:2700] Directory inode F262359 entry reference to inode F156983 removed. [xchkdsk.c:2700] Directory inode F262359 entry reference to inode F156982 removed. [xchkdsk.c:2700] File inode 192684 has been reconnected to /lost+found/. [fsckdtre.c:3948] 1 file reconnected to /lost+found/. [fsckdtre.c:4007] **Phase 7 - Rebuild File/Directory Allocation Maps [xchkdsk.c:2374] DI Roman Fiedler Safety & Security Department Information Management & eHealth AIT Austrian Institute of Technology GmbH Reininghausstrae 13/1 | 8020 Graz | Austria T +43(0) 316 586570-63 | M +43(0) 664 8561599 | F +43(0) 316 586570-12 rom...@ai... <mailto:rom...@ai...> | http://www.ait.ac.at <http://www.ait.ac.at/> http://www.ait.ac.at/eHealth/ <http://www.ait.ac.at/eHealth/> Kennen Sie die www.eHealth2011.at? FN: 115980 i HG Wien | UID: ATU14703506 This email and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient, please notify the sender by return e-mail or by telephone and delete this message from your system and any printout thereof. Any unauthorized use, reproduction, or dissemination of this message is strictly prohibited. Please note that e-mails are susceptible to change. AIT Austrian Institute of Technology GmbH shall not be liable for the improper or incomplete transmission of the information contained in this communication, nor shall it be liable for any delay in its receipt. |
From: Christian K. <li...@ne...> - 2011-04-05 06:24:05
|
On Thu, 31 Mar 2011 at 11:02, Fiedler Roman wrote: > Different observations with various kernels (stock hardy/lucid, so > up to 2.6.32) made me wonder, if there are known issues with default jfs > settings, general jfs stability or fsck functionality in case of > fast/unclean jfs fs shutdown. Well, there's https://bugzilla.kernel.org and of course the bugtracker on SF and also the mailing list archive for "known issues". > * Normal reboot after shutdown or crash, mount works without problems, redo ok > * If detected by chance, you might find files, where e.g. ls reports "stale nfs > lock", although file is normal jfs file (e.g. /var/log/dmesg.3.gz) Is NFS involved at all? If not, there was at least another post similar to this: http://www.mail-archive.com/jfs...@li.../msg01636.html > Do you have trunk jfs modules for Ubuntu lucid kernel or even a whole trunk kernel > with trunk jfs for testing? I could offer to try to reproduce on virtual JFS is in maintenance-mode, development has stopped for a while now, so ../fs/jfs from the Ubuntu kernel should be equal to the latest vanilla one. > Jfs log extracted from system suffering problems after hard power down: This is with jfsutils-1.1.15, right? Also, do you see any errors in your syslog for the disks in question? Christian. -- BOFH excuse #30: positron router malfunction |
From: Fiedler R. <Rom...@ai...> - 2011-04-06 15:50:49
|
> -----Ursprüngliche Nachricht----- > Von: Christian Kujau [mailto:li...@ne...] > Gesendet: Dienstag, 5. April 2011 08:06 > An: Fiedler Roman > Cc: jfs...@li... > Betreff: Re: [Jfs-discussion] jfs issues with power failure, fast reboot > sequences and disk/fs unrelated kernel freezes > > On Thu, 31 Mar 2011 at 11:02, Fiedler Roman wrote: > > Different observations with various kernels (stock hardy/lucid, so > > up to 2.6.32) made me wonder, if there are known issues with default jfs > > settings, general jfs stability or fsck functionality in case of > > fast/unclean jfs fs shutdown. > > Well, there's https://bugzilla.kernel.org and of course the bugtracker on > SF and also the mailing list archive for "known issues". I searched the issues, but haven't found a real lead into one direction yet. > > * Normal reboot after shutdown or crash, mount works without problems, > redo ok > > * If detected by chance, you might find files, where e.g. ls reports "stale nfs > > lock", although file is normal jfs file (e.g. /var/log/dmesg.3.gz) > > Is NFS involved at all? If not, there was at least another post similar to > this: http://www.mail-archive.com/jfs- > dis...@li.../msg01636.html The symptoms reported in the message and those observed in different events with Ubuntu lucid are nearly identical and could be summed up: * Rapid reboot or hard halt * No errors with fsck at normal bootup * Tools report "stale nfs lock", ls looks exactly like the one posted in the message * Forced fsck fixes errors, but file is lost (sometimes file parts appear in lost+found) I think it is quite likely, that the reporter observed the same problem. > > Do you have trunk jfs modules for Ubuntu lucid kernel or even a whole > trunk kernel > > with trunk jfs for testing? I could offer to try to reproduce on virtual > > JFS is in maintenance-mode, development has stopped for a while now, so > ../fs/jfs from the Ubuntu kernel should be equal to the latest vanilla > one. OK, so Ubuntu should be up to date. I created a virtual instance to try to reproduce it with a 20MB jfs filesystem, but no success so far. I could reproduce the root note corruption reported In another post, but that trashed only the /usr directory but did not lead to stale nfs locks so far. > > Jfs log extracted from system suffering problems after hard power down: > > This is with jfsutils-1.1.15, right? No ubuntu comes with ii jfsutils 1.1.12-2.1 utilities for managing the JFS filesystem > Also, do you see any errors in your syslog for the disks in question? No, everything seems normal, but I might search the logs again for any anomalies. Roman |
From: Christian K. <li...@ne...> - 2011-04-06 16:45:01
|
On Wed, 6 Apr 2011 at 17:50, Fiedler Roman wrote: > No ubuntu comes with > ii jfsutils 1.1.12-2.1 utilities for managing the JFS filesystem Well, can you upgrade then? jfsutils has received quite a few bugfixes since 1.1.12. Not sure if your symptoms are covered, but it's worth a try..? Christian. -- BOFH excuse #120: we just switched to FDDI. |
From: Fiedler R. <Rom...@ai...> - 2011-04-08 08:39:57
|
> -----Ursprüngliche Nachricht----- > Von: Christian Kujau [mailto:li...@ne...] > Gesendet: Mittwoch, 6. April 2011 18:45 > An: Fiedler Roman > Cc: jfs...@li... > Betreff: Re: AW: [Jfs-discussion] jfs issues with power failure, fast reboot > sequences and disk/fs unrelated kernel freezes > > On Wed, 6 Apr 2011 at 17:50, Fiedler Roman wrote: > > No ubuntu comes with > > ii jfsutils 1.1.12-2.1 utilities for managing the JFS filesystem > > Well, can you upgrade then? jfsutils has received quite a few bugfixes > since 1.1.12. Not sure if your symptoms are covered, but it's worth a > try..? I finally found a reproducer, that triggers the fault with approx ~3 reboots on my machine. The normal fsck 1.1.12 does not report any errors, but inode file with "stale NFS lock" is added to /lost+found. Running the 1.1.15 fsck on this already 1.1.12 checked fs in normal mode does not repair the fault. Both fix it when running them in forced mode. I also managed to reproduce it after replacing fsck.jfs system-wide with 1.1.15 and managed to capture an example on a 20MB test image. Following two commands report the fault before mounting it when using a loop device. if ! fsck.jfs "${loopDev}" || ! jfs_fsck -n "${loopDev}"; then echo "Fsck failed!" >&2 exit 1 fi Since the first fsck works, the volume can be mounted. Result: ./usr: total 0 drwx------ 3 root root 8 2011-04-08 07:34 . drwxr-xr-x 3 root root 8 2011-04-08 07:34 .. ?????????? ? ? ? ? ? bin I will try to reproduce it also without loop device, to see if this affects the outcome. Should I send you (or someone else interested in debugging) the disk image off-list? I have also upstart-based example that triggers test reboots via sysrequest, should be suitable to reproduce it on your systems, Roman > Christian. > -- > BOFH excuse #120: > > we just switched to FDDI. |
From: Fiedler R. <Rom...@ai...> - 2011-04-08 08:54:39
|
Update: Next reproducer run reproduced problem with fsck 1.1.15 on root partition /dev/sda1, seems not dependent on cryptoloop or loop devices at all. |
From: Dave K. <dav...@or...> - 2011-04-08 12:38:21
|
Sorry, I've been unresponsive. I'll carve out some time to look at this. On 04/08/2011 03:40 AM, Fiedler Roman wrote: >> -----Ursprüngliche Nachricht----- >> Von: Christian Kujau [mailto:li...@ne...] >> Gesendet: Mittwoch, 6. April 2011 18:45 >> An: Fiedler Roman >> Cc: jfs...@li... >> Betreff: Re: AW: [Jfs-discussion] jfs issues with power failure, fast reboot >> sequences and disk/fs unrelated kernel freezes >> >> On Wed, 6 Apr 2011 at 17:50, Fiedler Roman wrote: >>> No ubuntu comes with >>> ii jfsutils 1.1.12-2.1 utilities for managing the JFS filesystem >> >> Well, can you upgrade then? jfsutils has received quite a few bugfixes >> since 1.1.12. Not sure if your symptoms are covered, but it's worth a >> try..? > > I finally found a reproducer, that triggers the fault with approx ~3 reboots on my machine. The normal fsck 1.1.12 does not report any errors, but inode file with "stale NFS lock" is added to /lost+found. Running the 1.1.15 fsck on this already 1.1.12 checked fs in normal mode does not repair the fault. Both fix it when running them in forced mode. > > I also managed to reproduce it after replacing fsck.jfs system-wide with 1.1.15 and managed to capture an example on a 20MB test image. Following two commands report the fault before mounting it when using a loop device. > if ! fsck.jfs "${loopDev}" || ! jfs_fsck -n "${loopDev}"; then > echo "Fsck failed!">&2 > exit 1 > fi > > Since the first fsck works, the volume can be mounted. Result: > > ./usr: > total 0 > drwx------ 3 root root 8 2011-04-08 07:34 . > drwxr-xr-x 3 root root 8 2011-04-08 07:34 .. > ?????????? ? ? ? ? ? bin > > I will try to reproduce it also without loop device, to see if this affects the outcome. Should I send you (or someone else interested in debugging) the disk image off-list? I have also upstart-based example that triggers test reboots via sysrequest, should be suitable to reproduce it on your systems, Please send me the disk image. Are you actively doing anything on the affected file system when you trigger the reboot? > Roman > >> Christian. >> -- |
From: Fiedler R. <Rom...@ai...> - 2011-04-08 13:04:11
|
> -----Ursprüngliche Nachricht----- > Von: Dave Kleikamp [mailto:dav...@or...] > Gesendet: Freitag, 8. April 2011 14:38 > An: Fiedler Roman > Cc: Christian Kujau; jfs...@li... > Betreff: Re: [Jfs-discussion] jfs issues with power failure, fast reboot > sequences and disk/fs unrelated kernel freezes ... > > I will try to reproduce it also without loop device, to see if this > affects the outcome. Should I send you (or someone else interested in > debugging) the disk image off-list? I have also upstart-based example > that triggers test reboots via sysrequest, should be suitable to > reproduce it on your systems, > > Please send me the disk image. Are you actively doing anything on the > affected file system when you trigger the reboot? I'll send you the image off-list. Yes, I expanded a tar to the disk when triggering the hard reboot (without shutdown). This seems to increase the rate of failure. But also normal shutdown might cause similar problems, when the acpi powerdown is done quickly, perhaps before all caches were written to disk successfully. I have had two occurrences where a normal shutdown of a system with disk encryption caused data corruption in dpkg state list, which were not modified during that session. Probably the last modifications of syslog et al. before unmounts did not make it to the disk cleanly. By the way, since this can be reproduced at least on Ubuntu lucid, I filed a bugreport there to have a first identifier: https://bugs.launchpad.net/ubuntu/+source/jfsutils/+bug/754495 Roman |
From: Fiedler R. <Rom...@ai...> - 2011-04-22 12:37:44
|
Hello List, Has someone already tried the reproducer, also added to Launchpad bug report? https://bugs.launchpad.net/ubuntu/+source/jfsutils/+bug/754495 Did it work causing dataloss with just a few reboots or did it not? With this setup in a vbox virtual machine, jfs is much more vulnerable to this than ext3. Kind regards, Roman |
From: Dave K. <dav...@or...> - 2011-04-22 12:47:15
|
On 04/22/2011 07:37 AM, Fiedler Roman wrote: > Hello List, > > Has someone already tried the reproducer, also added to Launchpad bug > report? Sorry. I haven't done anything with this yet. I'll try to get to it soon. > https://bugs.launchpad.net/ubuntu/+source/jfsutils/+bug/754495 > > Did it work causing dataloss with just a few reboots or did it not? > With this setup in a vbox virtual machine, jfs is much more > vulnerable to this than ext3. > > Kind regards, Roman Shaggy |
From: Fiedler R. <Rom...@ai...> - 2011-05-05 16:15:30
|
Hi, It happened that I stumbled over the ubuntu shutdown scripts in lucid while looking on another issue and found out, that standard shutdown does not issue a sync in S60umountroot. Does the remount,ro trigger a sync automatically or is the sync missing? Kind regards, Roman |
From: Dave K. <dav...@or...> - 2011-05-05 16:51:09
|
On 05/05/2011 11:15 AM, Fiedler Roman wrote: > Hi, > > It happened that I stumbled over the ubuntu shutdown scripts in lucid > while looking on another issue and found out, that standard shutdown > does not issue a sync in S60umountroot. Does the remount,ro trigger a > sync automatically or is the sync missing? Looking at the code, all of the metadata gets written, but I'm not seeing anything that guarantees that the file data makes it to disk. I need to see if any other file systems explicitly sync the data. I'm not sure if this is a bug in jfs or not at this point. > > Kind regards, Roman Shaggy |
From: Fiedler R. <Rom...@ai...> - 2011-04-22 12:58:15
|
> -----Ursprüngliche Nachricht----- > Von: Dave Kleikamp [mailto:dav...@or...] > Gesendet: Freitag, 22. April 2011 14:47 > > On 04/22/2011 07:37 AM, Fiedler Roman wrote: > > Hello List, > > > > Has someone already tried the reproducer, also added to Launchpad bug > > report? > > Sorry. I haven't done anything with this yet. I'll try to get to it soon. No problem. I was just curious. I had 1 more event on a lucid machine where Xorg log file entry got corrupted during normal reboot sequence, so that X startup caused jfs to become readonly. So I'm just interested, if someone else can reproduce also, or if it is just some error using the filesystem and fsck on my side. Apart from that, I started to change automatic deployment default from jfs to ext3/ext4 to get good statistical data if ext variants are really not affected that much. Roman > > https://bugs.launchpad.net/ubuntu/+source/jfsutils/+bug/754495 > > > > Did it work causing dataloss with just a few reboots or did it not? > > With this setup in a vbox virtual machine, jfs is much more > > vulnerable to this than ext3. > > > > Kind regards, Roman > > Shaggy |
From: Fiedler R. <Rom...@ai...> - 2011-05-05 16:38:11
|
> -----Ursprüngliche Nachricht----- > Von: Sandon Van Ness [mailto:sa...@va...] > > Pretty darn sure that the linux kernel itself does a sync before a > reboot/powerdown as the last things I always see in the serial console > is syncing to sda/sdb/etc right before Power down (when its not even > doing shut down scripts anymore). It does a sync even with a reboot -f > (which bypasses the normal shutdown process). OK, so when running jfs on a native partition, this should write down the data to the physical media. Could it be a problem with md or lvm between jfs and physical device? > On 05/05/2011 09:15 AM, Fiedler Roman wrote: > > Hi, > > > > It happened that I stumbled over the ubuntu shutdown scripts in lucid > while looking on another issue and found out, that standard shutdown does > not issue a sync in S60umountroot. Does the remount,ro trigger a sync > automatically or is the sync missing? > > > > Kind regards, > > Roman |