Thread: [Jfs-discussion] first experiences with JFS
Brought to you by:
blaschke-oss,
shaggyk
From: <pg...@jf...> - 2005-09-14 22:41:34
|
Having had some time and feeling patient I have recently gone thru a few days of Linux IO subsystem and filesystem analysis and testing, as summarized (with lots of numbers and references) here: http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050906 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050907 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050908 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050909 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050910 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050911 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050912 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050913 http://WWW.sabi.co.UK/Notes/swhwAnno05.html#050914 and as a result of such tests I have decided to switch my filesystems to JFS to see how much performance degrades with time and files get deleted/added/rewritten. The good news is that JFS seemed to me one of the two (with 'ext3' with 1KiB blocks) most desirable choices, and it has turned out to have some unexpected boons too. The bad news is that I have already suffered from several crashes and one bizarre performance problem... My setup consists of an Athlon Xp 2000+, 512MB, 2x80GB and 2x160GB hard discs, running a mainline 2.6.13 kernel, with 1.1.18 'jfsprogs'. The incidents so far: * Some of my tests were tree traversals, that generate a flood of inode updates because, which hit the journal hard. So I wondered what would the timings be with '-o noatime', unfortunately I got a crash because of that. * When converting from 'ext3' to JFS file systems, I did this by copying things around, and I got a couple of lockups. It may be that these were related to high buffer cache traffic (I was doing a large 'dd' between partitions at one time) and races thereof. * When restoring a '.tar.bz2' held on a 'vfat' file system to a newly formatted 'jfs' one I got a dtree corruption, with no device errors. I 'fsck'ed it to fix that and redid the restore and it did not happen again. There was again a 'dd' between two partitions running at the same time. * Making a file system with a 30MiB log instead of the default 32MiB makes reading it with 'tar' over twice as slow. This for the same partition on the same hard disc with the same content freshly loaded (it was so strange I checked several times). All which leads to think that not many people have used non default log sizes, or used JFS with FAT32 or massive 'dd'ing, or with 'noatime'... :-) Some more context and some data... I was in multiuser but not GUI mode when the incidents above happened, with only a few d=E6mons running. The output of 'jfs=5Ffsck' after the =ABDT=5FGETPAGE: dtree page corrupt=BB errors: ---------------------------------------------------------------- jfs=5Ffsck version 1.1.8, 03-May-2005 processing started: 9/14/2005 18.52.55 Using default parameter: -p The current device is: /dev/hdb11 Block size in bytes: 4096 Filesystem size in blocks: 1028152 **Phase 0 - Replay Journal Log **Phase 1 - Check Blocks, Files/Directories, and Directory Entries **Phase 2 - Count links Incorrect link counts have been detected. Will correct. **Phase 3 - Duplicate Block Rescan and Directory Connectedness **Phase 4 - Report Problems File system object DF20499 is linked as: /var cannot repair the data format error(s) in this directory. cannot repair DF20499. Will release. File system object DF20512 is linked as: /dev/ida cannot repair DF20512. Will release. **Phase 5 - Check Connectivity **Phase 6 - Perform Approved Corrections 768 files reconnected to /lost+found/. **Phase 7 - Rebuild File/Directory Allocation Maps **Phase 8 - Rebuild Disk Allocation Maps 4112608 kilobytes total disk space. 35465 kilobytes in 13747 directories. 2756001 kilobytes in 135907 user files. 0 kilobytes in extended attributes 100356 kilobytes reserved for system use. 1291716 kilobytes are available for use. Filesystem is clean. ---------------------------------------------------------------- The one ''oops'' that got logged (it happened twice): ---------------------------------------------------------------- Unable to handle kernel paging request at virtual address cc05b9a4 printing eip: c0251f5d *pde =3D 00030067 *pte =3D 0c05b000 Oops: 0000 [#1] DEBUG=5FPAGEALLOC Modules linked in: binfmt=5Fmisc snd=5Fcmipci snd=5Fopl3=5Flib snd=5Fhw= dep snd=5Fseq=5Foss snd=5Fseq=5Fmidi snd=5Fseq=5Fmidi=5Fevent snd=5Fseq= snd=5Fvia82xx gameport snd=5Fac97=5Fcodec snd=5Fpcm=5Foss snd=5Fmixer=5F= oss snd=5Fpcm snd=5Ftimer snd=5Fpage=5Falloc snd=5Fmpu401=5Fuart snd=5F= rawmidi snd=5Fseq=5Fdevice snd soundcore 3c59x mii parport=5Fpc lp parp= ort video thermal processor fan container button battery ac it87 eeprom= i2c=5Fsensor i2c=5Fisa i2c=5Fdev i2c=5Fcore ntfs nls=5Fiso8859=5F1 nls= =5Fcp437 sg sr=5Fmod ide=5Fscsi scsi=5Fmod 8250 serial=5Fcore nvram rtc= CPU: 0 EIP: 0060:[txUpdateMap+333/656] Not tainted VLI EFLAGS: 00010246 (2.6.13p)=20 EIP is at txUpdateMap+0x14d/0x290 eax: cc05b97c ebx: e0996990 ecx: e08366c8 edx: 00000900 esi: 00000001 edi: e0996980 ebp: dfdc7f48 esp: dfdc7f10 ds: 007b es: 007b ss: 0068 Process jfsCommit (pid: 139, threadinfo=3Ddfdc7000 task=3Dc15725d0) Stack: e084be30 0000060c dfdc7f48 c024f181 00000000 00000040 d94596fc d= befc2fc=20 00000202 00000000 00000000 dc64d160 e08366c8 e08366c8 dfdc7f74 c= 02529b2=20 e08366c8 00000286 e0861514 dfdc7fe4 00000000 0000007b 0000007b d= c64d160=20 Call Trace: [show=5Fstack+127/160] show=5Fstack+0x7f/0xa0 [show=5Fregisters+343/448] show=5Fregisters+0x157/0x1c0 [die+332/688] die+0x14c/0x2b0 [do=5Fpage=5Ffault+921/1791] do=5Fpage=5Ffault+0x399/0x6ff [error=5Fcode+79/84] error=5Fcode+0x4f/0x54 [txLazyCommit+34/688] txLazyCommit+0x22/0x2b0 [jfs=5Flazycommit+844/1200] jfs=5Flazycommit+0x34c/0x4b0 [kernel=5Fthread=5Fhelper+5/16] kernel=5Fthread=5Fhelper+0x5/0x10 Code: f6 47 04 02 0f 85 4f 01 00 00 8d 5f 10 0f b6 43 03 85 c0 74 4d 89= c6 8d b4 26 00 00 00 00 f6 43 04 f0 0f 85 16 01 00 00 8b 47 0c <0f> b7= 40 28 25 00 f0 00 00 3d 00 40 00 00 0f 84 ef 00 00 00 8b=20 ---------------------------------------------------------------- Unable to handle kernel paging request at virtual address c2cc8804 printing eip: c0251f5d *pde =3D 0000b067 *pte =3D 02cc8000 Oops: 0000 [#1] DEBUG=5FPAGEALLOC Modules linked in: videodev loop binfmt=5Fmisc snd=5Fcmipci snd=5Fopl3=5F= lib snd=5Fhwdep snd=5Fseq=5Foss snd=5Fseq=5Fmidi snd=5Fseq=5Fmidi=5Feve= nt snd=5Fseq snd=5Fvia82xx gameport snd=5Fac97=5Fcodec snd=5Fpcm=5Foss = snd=5Fmixer=5Foss snd=5Fpcm snd=5Ftimer snd=5Fpage=5Falloc snd=5Fmpu401= =5Fuart snd=5Frawmidi snd=5Fseq=5Fdevice snd soundcore 3c59x mii parpor= t=5Fpc lp parport video thermal processor fan container button battery = ac it87 eeprom i2c=5Fsensor i2c=5Fisa i2c=5Fdev i2c=5Fcore ntfs nls=5Fi= so8859=5F1 nls=5Fcp437 sg sr=5Fmod ide=5Fscsi scsi=5Fmod 8250 serial=5F= core nvram rtc CPU: 0 EIP: 0060:[txUpdateMap+333/656] Not tainted VLI EFLAGS: 00010246 (2.6.13p)=20 EIP is at txUpdateMap+0x14d/0x290 eax: c2cc87dc ebx: e08b6b10 ecx: e0828700 edx: 00000900 esi: 00000001 edi: e08b6b00 ebp: dfdc7f48 esp: dfdc7f10 ds: 007b es: 007b ss: 0068 Process jfsCommit (pid: 139, threadinfo=3Ddfdc7000 task=3Dc15725d0) Stack: e0879094 00000b85 dfdc7f48 c024f181 00000000 00000040 db26571c d= 9c48cfc=20 00000206 00000000 00000000 d3ea84c0 e0828700 e0828700 dfdc7f74 c= 02529b2=20 e0828700 00000000 b2f1fb80 00989dee c04d0c60 c157270c dfdc7000 d= 3ea84c0=20 Call Trace: [show=5Fstack+127/160] show=5Fstack+0x7f/0xa0 [show=5Fregisters+343/448] show=5Fregisters+0x157/0x1c0 [die+332/688] die+0x14c/0x2b0 [do=5Fpage=5Ffault+921/1791] do=5Fpage=5Ffault+0x399/0x6ff [error=5Fcode+79/84] error=5Fcode+0x4f/0x54 [txLazyCommit+34/688] txLazyCommit+0x22/0x2b0 [jfs=5Flazycommit+844/1200] jfs=5Flazycommit+0x34c/0x4b0 [kernel=5Fthread=5Fhelper+5/16] kernel=5Fthread=5Fhelper+0x5/0x10 Code: f6 47 04 02 0f 85 4f 01 00 00 8d 5f 10 0f b6 43 03 85 c0 74 4d 89= c6 8d b4 26 00 00 00 00 f6 43 04 f0 0f 85 16 01 00 00 8b 47 0c <0f> b7= 40 28 25 00 f0 00 00 3d 00 40 00 00 0f 84 ef 00 00 00 8b=20 ---------------------------------------------------------------- Sorry for the relative lack of details, I hope that there is enough to start an investigation. |
From: Per J. <pe...@co...> - 2005-09-15 12:41:26
|
Peter Grandi wrote: > All which leads to think that not many people have used non > default log sizes, or used JFS with FAT32 or massive 'dd'ing, or > with 'noatime'... :-) Nope, haven't tried any of the above - for copying filesystems around, I tend to use rsync instead of dd. /Per Jessen, Z=FCrich |
From: <pg...@jf...> - 2005-09-16 14:30:10
|
[ ... ] >> All which leads to think that not many people have used non >> default log sizes, or used JFS with FAT32 or massive 'dd'ing, >> or with 'noatime'... :-) per> Nope, haven't tried any of the above Not surprising... :-). per> - for copying filesystems around, I tend to use rsync per> instead of dd. RSYNC is a bit slow though, especially same machine. I'll list some faster alternatives in case someone reading this may benefit... Most of the time I do partition image (disc-to-disc) backups, and actually using something a bit faster than 'dd', roughly equivalent to a double 'dd' like: dd if=/dev/hda6 bs=4k | dd bs=4k of=/dev/hdc6 for example. This gives me something like 25MiB/s sustained, which is not bad (around 15MiB sustained over FireWire); it also elegantly bypasses encryption layers etc, which are often slow. For full file-by-file copy I prefer the classic double 'tar': (cd /mnt/hda6 && tar -cS --one -b 8 -f - .) \ | (cd /mnt/hdc6 && tar -xS -p -b 8 -f -) with some GNU 'tar' specific options, and which usually gives me (on filesystems that are not too used0 something like 15-20MiB/s on ATA. After either an image or 'tar' backup one can use RSYNC, especially in '-c' (checksum) mode to verify the data. Incremental/differential backups can then be done if preferred using something like http://WWW.rsnapshot.com/ which is based on RSYNC. Sorry if I have just preached to the choir here :-). |
From: Per J. <pe...@co...> - 2005-09-17 11:02:07
|
Peter Grandi wrote: > per> - for copying filesystems around, I tend to use rsync > per> instead of dd. >=20 > RSYNC is a bit slow though, especially same machine. =20 I agree - I tend to go for rsync for large amounts of copying, when being able to restart can be really handy. > Incremental/differential backups can then be done if preferred > using something like http://WWW.rsnapshot.com/ which is based on > RSYNC. http://www.rsnapshot.org - and yep, I use this too.=20 /Per Jessen, Z=FCrich |
From: Dave K. <sh...@au...> - 2005-09-15 13:40:15
|
I'm kind of buried in work right now, but your problems appear serious, so I'm trying to respond as quick as I can. On Wed, 2005-09-14 at 23:41 +0100, Peter Grandi wrote: > The bad news is that I have already suffered from several > crashes and one bizarre performance problem... My setup > consists of an Athlon Xp 2000+, 512MB, 2x80GB and 2x160GB hard > discs, running a mainline 2.6.13 kernel, with 1.1.18 'jfsprogs'. >=20 > The incidents so far: >=20 > * Some of my tests were tree traversals, that generate a flood of > inode updates because, which hit the journal hard. So I wondered > what would the timings be with '-o noatime', unfortunately I > got a crash because of that. noatime is used a lot, I don't think noatime was a direct cause of the problem, but it might have affected memory reuse (as inodes should be easier to reclaim if they aren't being marked dirty). More on the crash below. > * When converting from 'ext3' to JFS file systems, I did this by > copying things around, and I got a couple of lockups. It may > be that these were related to high buffer cache traffic (I was > doing a large 'dd' between partitions at one time) and races > thereof. No idea what could be happening here. If you could capture a stack trace of the processes, it may give me a clue what's going on. 'echo t > /proc/sysrq-trigger' should dump the stack traces to the syslog. > * When restoring a '.tar.bz2' held on a 'vfat' file system to a > newly formatted 'jfs' one I got a dtree corruption, with no > device errors. I 'fsck'ed it to fix that and redid the restore > and it did not happen again. There was again a 'dd' between > two partitions running at the same time. I had another recent report of dtree corruption that I wasn't sure of the cause. I suspected it might be related to a case-insensitivity bug, but that couldn't be the case here. (I highly doubt that you ran mkfs -O to create the partition.) > * Making a file system with a 30MiB log instead of the default > 32MiB makes reading it with 'tar' over twice as slow. This for > the same partition on the same hard disc with the same content > freshly loaded (it was so strange I checked several times). You're right that this is strange. If you are running with noatime, the journal shouldn't be a factor at all when reading the volume. This one really puzzles me. > All which leads to think that not many people have used non > default log sizes, or used JFS with FAT32 or massive 'dd'ing, or > with 'noatime'... :-) Again I think that noatime is pretty common. I use it. > Some more context and some data... I was in multiuser but not > GUI mode when the incidents above happened, with only a few > d=E6mons running. >=20 > The output of 'jfs_fsck' after the =ABDT_GETPAGE: dtree page > corrupt=BB errors: If you see this again, run jfs_fsck with the -v flag. That may give me a better idea of the nature of the dtree corruption. >=20 > The one ''oops'' that got logged (it happened twice): >=20 > ---------------------------------------------------------------- > Unable to handle kernel paging request at virtual address cc05b9a4 > printing eip: > c0251f5d > *pde =3D 00030067 > *pte =3D 0c05b000 > Oops: 0000 [#1] > DEBUG_PAGEALLOC > Modules linked in: binfmt_misc snd_cmipci snd_opl3_lib snd_hwdep snd_se= q_oss snd_seq_midi snd_seq_midi_event snd_seq snd_via82xx gameport snd_ac= 97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_m= pu401_uart snd_rawmidi snd_seq_device snd soundcore 3c59x mii parport_pc = lp parport video thermal processor fan container button battery ac it87 e= eprom i2c_sensor i2c_isa i2c_dev i2c_core ntfs nls_iso8859_1 nls_cp437 sg= sr_mod ide_scsi scsi_mod 8250 serial_core nvram rtc > CPU: 0 > EIP: 0060:[txUpdateMap+333/656] Not tainted VLI > EFLAGS: 00010246 (2.6.13p)=20 > EIP is at txUpdateMap+0x14d/0x290 > eax: cc05b97c ebx: e0996990 ecx: e08366c8 edx: 00000900 > esi: 00000001 edi: e0996980 ebp: dfdc7f48 esp: dfdc7f10 > ds: 007b es: 007b ss: 0068 > Process jfsCommit (pid: 139, threadinfo=3Ddfdc7000 task=3Dc15725d0) > Stack: e084be30 0000060c dfdc7f48 c024f181 00000000 00000040 d94596fc d= befc2fc=20 > 00000202 00000000 00000000 dc64d160 e08366c8 e08366c8 dfdc7f74 c= 02529b2=20 > e08366c8 00000286 e0861514 dfdc7fe4 00000000 0000007b 0000007b d= c64d160=20 > Call Trace: > [show_stack+127/160] show_stack+0x7f/0xa0 > [show_registers+343/448] show_registers+0x157/0x1c0 > [die+332/688] die+0x14c/0x2b0 > [do_page_fault+921/1791] do_page_fault+0x399/0x6ff > [error_code+79/84] error_code+0x4f/0x54 > [txLazyCommit+34/688] txLazyCommit+0x22/0x2b0 > [jfs_lazycommit+844/1200] jfs_lazycommit+0x34c/0x4b0 > [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10 > Code: f6 47 04 02 0f 85 4f 01 00 00 8d 5f 10 0f b6 43 03 85 c0 74 4d 89= c6 8d b4 26 00 00 00 00 f6 43 04 f0 0f 85 16 01 00 00 8b 47 0c <0f> b7 4= 0 28 25 00 f0 00 00 3d 00 40 00 00 0f 84 ef 00 00 00 8b=20 This was enough to tell me what's going on. txUpdateMap should not be accessing tlck->ip, since it may no longer be valid. I think DEBUG_PAGEALLOC helped uncover this bug. It shouldn't be too hard to fix, but it isn't trivial either. I'll try to get a patch to you soon. > Sorry for the relative lack of details, I hope that there is > enough to start an investigation. Thanks this was helpful. It was enough to discover one bug, and hopefully we can make some progress on the other ones. Thanks, Shaggy --=20 David Kleikamp IBM Linux Technology Center |
From: Dave K. <sh...@au...> - 2005-09-16 04:16:27
|
On Thu, 2005-09-15 at 08:38 -0500, Dave Kleikamp wrote: > This was enough to tell me what's going on. txUpdateMap should not be > accessing tlck->ip, since it may no longer be valid. I think > DEBUG_PAGEALLOC helped uncover this bug. It shouldn't be too hard to > fix, but it isn't trivial either. I'll try to get a patch to you soon. Here's the patch. It turned out to be pretty easy. This should fix the oops. ================================================ JFS: don't dereference tlck->ip from txUpdateMap The inode pointer may no longer be valid Signed-off-by: Dave Kleikamp <sh...@au...> diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c --- a/fs/jfs/jfs_txnmgr.c +++ b/fs/jfs/jfs_txnmgr.c @@ -725,6 +725,9 @@ struct tlock *txLock(tid_t tid, struct i else tlck->flag = tlckINODELOCK; + if (S_ISDIR(ip->i_mode)) + tlck->flag |= tlckDIRECTORY; + tlck->type = 0; /* bind the tlock and the page */ @@ -1009,6 +1012,8 @@ struct tlock *txMaplock(tid_t tid, struc /* bind the tlock and the object */ tlck->flag = tlckINODELOCK; + if (S_ISDIR(ip->i_mode)) + tlck->flag |= tlckDIRECTORY; tlck->ip = ip; tlck->mp = NULL; @@ -1077,6 +1082,8 @@ struct linelock *txLinelock(struct linel linelock->flag = tlckLINELOCK; linelock->maxcnt = TLOCKLONG; linelock->index = 0; + if (tlck->flag & tlckDIRECTORY) + linelock->flag |= tlckDIRECTORY; /* append linelock after tlock */ linelock->next = tlock->next; @@ -2358,7 +2365,7 @@ static void txUpdateMap(struct tblock * */ else { /* (maplock->flag & mlckFREE) */ - if (S_ISDIR(tlck->ip->i_mode)) + if (tlck->flag & tlckDIRECTORY) txFreeMap(ipimap, maplock, tblk, COMMIT_PWMAP); else diff --git a/fs/jfs/jfs_txnmgr.h b/fs/jfs/jfs_txnmgr.h --- a/fs/jfs/jfs_txnmgr.h +++ b/fs/jfs/jfs_txnmgr.h @@ -122,6 +122,7 @@ extern struct tlock *TxLock; /* transact #define tlckLOG 0x0800 /* updateMap state */ #define tlckUPDATEMAP 0x0080 +#define tlckDIRECTORY 0x0040 /* freeLock state */ #define tlckFREELOCK 0x0008 #define tlckWRITEPAGE 0x0004 |
From: <pg...@jf...> - 2005-09-18 21:10:30
|
>>> On Thu, 15 Sep 2005 08:38:16 -0500, Dave Kleikamp >>> <sh...@au...> said: [ ... ] >> * Making a file system with a 30MiB log instead of the default >> 32MiB makes reading it with 'tar' over twice as slow. This >> for the same partition on the same hard disc with the same >> content freshly loaded (it was so strange I checked several >> times). shaggy> You're right that this is strange. If you are running shaggy> with noatime, the journal shouldn't be a factor at all shaggy> when reading the volume. This one really puzzles me. I wasn't running with 'noatime', but not to worry. I think it was due to some ATA chipset driver issue, because I have tried to reproduce it later and it did not happen again. However from various tests it looks that when reloading a filesystem from fresh the size of the log seems to have a _small_ (e.g. 10%) influence on how ''fast'' rereading the contents back is. If the log is longer perhaps allocations are more contiguous or whatever... [ ... ] |
From: <pg...@jf...> - 2005-09-16 14:30:05
|
>>> On Thu, 15 Sep 2005 08:38:16 -0500, Dave Kleikamp >>> <sh...@au...> said: shaggy> I'm kind of buried in work right now, but your problems shaggy> appear serious, so I'm trying to respond as quick as I shaggy> can. Not that serious, things mostly work (and fairly well), I just get the rather occasional crash. Real corruption would be more serious. No need to hurry, it is also just my home PC :-). >> * Some of my tests were tree traversals, [ ... ] the timings >> be with '-o noatime', unfortunately I got a crash because >> of that. shaggy> noatime is used a lot, I don't think noatime was a shaggy> direct cause of the problem, [ ... ] I concur. I have tried to reproduce it subsequently but did not happen again, but then I now remember I did this on an empty filesystem being freshly loaded, so perhaps it is a corner case. I'll do another test. >> * When converting from 'ext3' to JFS file systems, I did this >> by copying things around, and I got a couple of lockups. It >> may be that these were related to high buffer cache traffic >> (I was doing a large 'dd' between partitions at one time) >> and races thereof. shaggy> No idea what could be happening here. If you could shaggy> capture a stack trace of the processes, it may give me a shaggy> clue what's going on. 'echo t > /proc/sysrq-trigger' shaggy> should dump the stack traces to the syslog. Looks like a semi-hard lockup, but I will try to see what I can do, and perhaps run a kernel debugger. Semi-hard means I can ping the machine from a laptop, but nothing else works, and the screen and keyboard are dead. [ ... ] shaggy> I had another recent report of dtree corruption [ ... ] shaggy> (I highly doubt that you ran mkfs -O to create the shaggy> partition.) I didn't so it should not be that. BTW, that '-O' is a really useful and just about unique feature of JFS, and perhaps it should be mentioned more prominently. http://OSS.SGI.com/projects/xfs/papers/ukuug2003.pdf =ABFUTURE * Case insensitive support. Big speedup for Samba=BB http://lists.Samba.org/archive/samba/2004-August/090489.html =AB> I'm wondering what issues / problems I might run into if > I'm using samba 3 on linux with the IBM JFS filesystem > installed to be case-insensitive. I've had no luck > finding anyone who says they've tried it. It should be really fast :-). Just set case sensitive =3D yes in the [global] section and let the filesystem sort it out :-).=BB >> * Making a file system with a 30MiB log instead of the >> default 32MiB makes reading it with 'tar' over twice as >> slow. This for the same partition on the same hard disc >> with the same content freshly loaded (it was so strange I >> checked several times). shaggy> You're right that this is strange. If you are running shaggy> with noatime, the journal shouldn't be a factor at all shaggy> when reading the volume. This one really puzzles me. But I stopped using 'noatime' after the first attempt, as it had caused trouble. BTW I should add I used 'jfs=5Fdebugfs' 'su p' and the output was nearly identical, that is except for the size and related values for the log. I'll try to reproduce it. [ ... dtree trouble ... ] shaggy> If you see this again, run jfs=5Ffsck with the -v flag. shaggy> That may give me a better idea of the nature of the shaggy> dtree corruption. OK, that sounds good. [ ... ] shaggy> I think DEBUG=5FPAGEALLOC helped uncover this bug. Ha, as to this I have almost all kernel selfchecking options enabled all the time. The slowdown is not large and I feel a lot better even if it is just my home PC :-). Just about the only selfcheck I have disabled is the slab allocator selfchecks, because while in 2.4 they did not cost a lot, in 2.6 they are very slow. Oh well. Would it help and not be too intrusive for my daily use if I enabled the JFS debugging=3F |
From: <pg...@jf...> - 2005-09-16 14:39:02
|
>>> On Fri, 16 Sep 2005 15:16:15 +0100, >>> pg...@jf... (Peter Grandi) said: [ ... ] pg_jfs> BTW, that '-O' is a really useful and just about unique pg_jfs> feature of JFS, and perhaps it should be mentioned more pg_jfs> prominently. In case anybody on this list wondered what is the big deal with '-O' and Samba, there is a clear discussion here: http://Samba.org/ftp/tridge/talks/vfs.pdf in particular page/slide 7. [ ... ] |
From: Dave K. <sh...@au...> - 2005-09-16 14:46:54
|
On Fri, 2005-09-16 at 15:16 +0100, Peter Grandi wrote: > Would it help and not be too intrusive for my daily use if I > enabled the JFS debugging? Please do enable it. It is not very intrusive. About the only case where it can be significant is if the file system fills up. It may do a lot of printks when allocations fail. -- David Kleikamp IBM Linux Technology Center |
From: Jon N. <jne...@ja...> - 2005-09-16 15:17:50
|
On Fri, 16 Sep 2005, Peter Grandi wrote: > >>> On Thu, 15 Sep 2005 08:38:16 -0500, Dave Kleikamp > >>> <sh...@au...> said: > > shaggy> noatime is used a lot, I don't think noatime was a > shaggy> direct cause of the problem, [ ... ] > > I concur. I have tried to reproduce it subsequently but did not > happen again, but then I now remember I did this on an empty > filesystem being freshly loaded, so perhaps it is a corner case. > I'll do another test. On one machine with 11 JFS filesystems they /all/ have noatime. The filesystems go back to SuSE 9.1 and I ran 9.2 and I run 9.3. I'll be running 10.0 even though they diss JFS. My other machines also use JFS almost exclusively (I usually use ext2 or 3 for /boot). I've noticed quite a few more bug reports than usual for JFS since 2.6.13 came out. -- Jon Nelson <jne...@ja...> |
From: <pg...@jf...> - 2005-09-16 14:58:58
|
>>> On Fri, 16 Sep 2005 15:16:15 +0100, >>> pg...@jf... (Peter Grandi) said: [ ... ] >>> * Some of my tests were tree traversals, [ ... ] the timings >>> be with '-o noatime', unfortunately I got a crash because >>> of that. shaggy> noatime is used a lot, I don't think noatime was a shaggy> direct cause of the problem, [ ... ] pg_jfs> I concur. I have tried to reproduce it subsequently but pg_jfs> did not happen again, but then I now remember I did this pg_jfs> on an empty filesystem being freshly loaded, so perhaps pg_jfs> it is a corner case. I'll do another test. Spoke too soon, it happened again, and I got a backtrace (I haven't applied the patch you just sent me yet). It does not look directly JFS related. I was doing tests with various log sizes (strange results still), with and without 'noatime' and with the 'cfq' or the 'anticipatory' elevators, having switched to 'anticipatory' a bit earlier, and post-boot elevator switching seems a bit buggy. But it is curious that it happened as I specified 'noatime' and not without. ---------------------------------------------------------------- Unable to handle kernel NULL pointer dereference at virtual address 0000000c printing eip: c02f2ecf *pde = 00000000 Oops: 0000 [#1] DEBUG_PAGEALLOC Modules linked in: videodev loop binfmt_misc snd_cmipci snd_opl3_lib snd_hwdep snd_seq_oss snd_seq_midi snd_seq_midi_event snd_seq snd_via82xx gameport snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 3c59x mii parport_pc lp parport video thermal processor fan container button battery ac it87 eeprom i2c_sensor i2c_isa i2c_dev i2c_core ntfs nls_iso8859_1 nls_cp850 sg sr_mod ide_scsi scsi_mod 8250 serial_core nvram rtc CPU: 0 EIP: 0060:[as_queue_empty+15/64] Not tainted VLI EFLAGS: 00210086 (2.6.13p) EIP is at as_queue_empty+0xf/0x40 eax: 0000000c ebx: c05e0c58 ecx: c02f2ec0 edx: 00000000 esi: c05e0ee0 edi: c05e0ee0 ebp: c3436cbc esp: c3436cbc ds: 007b es: 007b ss: 0068 Process mount (pid: 20241, threadinfo=c3436000 task=daeccb10) Stack: c3436cc8 c02e7b16 dfdf47d8 c3436d00 c0310842 dfdf47d8 c3436ce4 00000000 d7e9435c c11ba940 00200086 dfdf45fc c3436dc8 c14027e0 dfdf45fc c3436dc8 c14027e0 c3436d10 c0310cd6 dfdef1e0 ffffffff c3436d20 c02ea649 dfdf45fc Call Trace: [show_stack+127/160] show_stack+0x7f/0xa0 [show_registers+343/448] show_registers+0x157/0x1c0 [die+332/688] die+0x14c/0x2b0 [do_page_fault+921/1791] do_page_fault+0x399/0x6ff [error_code+79/84] error_code+0x4f/0x54 [elv_queue_empty+38/48] elv_queue_empty+0x26/0x30 [ide_do_request+130/1264] ide_do_request+0x82/0x4f0 [do_ide_request+38/48] do_ide_request+0x26/0x30 [__generic_unplug_device+57/64] __generic_unplug_device+0x39/0x40 [generic_unplug_device+124/352] generic_unplug_device+0x7c/0x160 [blk_backing_dev_unplug+41/48] blk_backing_dev_unplug+0x29/0x30 [block_sync_page+62/80] block_sync_page+0x3e/0x50 [sync_page+70/80] sync_page+0x46/0x50 [__wait_on_bit_lock+93/112] __wait_on_bit_lock+0x5d/0x70 [__lock_page+132/144] __lock_page+0x84/0x90 [do_generic_mapping_read+875/1088] do_generic_mapping_read+0x36b/0x440 [__generic_file_aio_read+445/496] __generic_file_aio_read+0x1bd/0x1f0 [generic_file_read+170/208] generic_file_read+0xaa/0xd0 [vfs_read+213/432] vfs_read+0xd5/0x1b0 [sys_read+75/128] sys_read+0x4b/0x80 [syscall_call+7/11] syscall_call+0x7/0xb Code: b8 70 99 47 c0 89 44 24 04 e8 9e 0b e3 ff e8 d9 1c e1 ff e9 fb fe ff ff 8d 74 26 00 55 89 e5 8b 45 08 8b 40 0c 8b 50 04 8d 42 0c <39> 42 0c 75 1c 8d 42 14 39 42 14 75 14 8b 42 2c 39 00 74 11 8d drivers/block/ll_rw_blk.c:1475: spin_lock(drivers/ide/ide.c:c052ad08) already locked by drivers/block/ll_rw_blk.c/1475 ---------------------------------------------------------------- |
From: Dave K. <sh...@au...> - 2005-09-16 15:42:17
|
On Fri, 2005-09-16 at 15:58 +0100, Peter Grandi wrote: > >>> On Fri, 16 Sep 2005 15:16:15 +0100, > >>> pg...@jf... (Peter Grandi) said: > > [ ... ] > > >>> * Some of my tests were tree traversals, [ ... ] the timings > >>> be with '-o noatime', unfortunately I got a crash because > >>> of that. > > shaggy> noatime is used a lot, I don't think noatime was a > shaggy> direct cause of the problem, [ ... ] > > pg_jfs> I concur. I have tried to reproduce it subsequently but > pg_jfs> did not happen again, but then I now remember I did this > pg_jfs> on an empty filesystem being freshly loaded, so perhaps > pg_jfs> it is a corner case. I'll do another test. > > Spoke too soon, it happened again, and I got a backtrace (I > haven't applied the patch you just sent me yet). It does not > look directly JFS related. No, I googled and found a similar oops reported on a -ck kernel with reiserfs: http://bhhdoa.org.au/pipermail/ck/2005-August/004136.html I haven't read the whole thread, so I don't know what's known about it yet. > I was doing tests with various log sizes (strange results > still), with and without 'noatime' and with the 'cfq' or the > 'anticipatory' elevators, having switched to 'anticipatory' a > bit earlier, and post-boot elevator switching seems a bit buggy. > > But it is curious that it happened as I specified 'noatime' and > not without. Try my patch and avoid anticipatory, and I think that you may be okay. > ---------------------------------------------------------------- > Unable to handle kernel NULL pointer dereference at virtual address 0000000c > printing eip: > c02f2ecf > *pde = 00000000 > Oops: 0000 [#1] > DEBUG_PAGEALLOC > Modules linked in: videodev loop binfmt_misc snd_cmipci snd_opl3_lib snd_hwdep snd_seq_oss snd_seq_midi snd_seq_midi_event snd_seq snd_via82xx gameport snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 3c59x mii parport_pc lp parport video thermal processor fan container button battery ac it87 eeprom i2c_sensor i2c_isa i2c_dev i2c_core ntfs nls_iso8859_1 nls_cp850 sg sr_mod ide_scsi scsi_mod 8250 serial_core nvram rtc > CPU: 0 > EIP: 0060:[as_queue_empty+15/64] Not tainted VLI > EFLAGS: 00210086 (2.6.13p) > EIP is at as_queue_empty+0xf/0x40 > eax: 0000000c ebx: c05e0c58 ecx: c02f2ec0 edx: 00000000 > esi: c05e0ee0 edi: c05e0ee0 ebp: c3436cbc esp: c3436cbc > ds: 007b es: 007b ss: 0068 > Process mount (pid: 20241, threadinfo=c3436000 task=daeccb10) > Stack: c3436cc8 c02e7b16 dfdf47d8 c3436d00 c0310842 dfdf47d8 c3436ce4 00000000 > d7e9435c c11ba940 00200086 dfdf45fc c3436dc8 c14027e0 dfdf45fc c3436dc8 > c14027e0 c3436d10 c0310cd6 dfdef1e0 ffffffff c3436d20 c02ea649 dfdf45fc > Call Trace: > [show_stack+127/160] show_stack+0x7f/0xa0 > [show_registers+343/448] show_registers+0x157/0x1c0 > [die+332/688] die+0x14c/0x2b0 > [do_page_fault+921/1791] do_page_fault+0x399/0x6ff > [error_code+79/84] error_code+0x4f/0x54 > [elv_queue_empty+38/48] elv_queue_empty+0x26/0x30 > [ide_do_request+130/1264] ide_do_request+0x82/0x4f0 > [do_ide_request+38/48] do_ide_request+0x26/0x30 > [__generic_unplug_device+57/64] __generic_unplug_device+0x39/0x40 > [generic_unplug_device+124/352] generic_unplug_device+0x7c/0x160 > [blk_backing_dev_unplug+41/48] blk_backing_dev_unplug+0x29/0x30 > [block_sync_page+62/80] block_sync_page+0x3e/0x50 > [sync_page+70/80] sync_page+0x46/0x50 > [__wait_on_bit_lock+93/112] __wait_on_bit_lock+0x5d/0x70 > [__lock_page+132/144] __lock_page+0x84/0x90 > [do_generic_mapping_read+875/1088] do_generic_mapping_read+0x36b/0x440 > [__generic_file_aio_read+445/496] __generic_file_aio_read+0x1bd/0x1f0 > [generic_file_read+170/208] generic_file_read+0xaa/0xd0 > [vfs_read+213/432] vfs_read+0xd5/0x1b0 > [sys_read+75/128] sys_read+0x4b/0x80 > [syscall_call+7/11] syscall_call+0x7/0xb > Code: b8 70 99 47 c0 89 44 24 04 e8 9e 0b e3 ff e8 d9 1c e1 ff e9 fb fe ff ff 8d 74 26 00 55 89 e5 8b 45 08 8b 40 0c 8b 50 04 8d 42 0c <39> 42 0c 75 1c 8d 42 14 39 42 14 75 14 8b 42 2c 39 00 74 11 8d > drivers/block/ll_rw_blk.c:1475: spin_lock(drivers/ide/ide.c:c052ad08) already locked by drivers/block/ll_rw_blk.c/1475 > ---------------------------------------------------------------- David Kleikamp IBM Linux Technology Center |
From: <pg...@jf...> - 2005-09-18 21:10:33
|
>>> On Fri, 16 Sep 2005 10:40:30 -0500, Dave Kleikamp >>> <sh...@au...> said: [ ... crash on mounting JFS partition with '-o noatime' ... ] >> Spoke too soon, it happened again, and I got a backtrace (I >> haven't applied the patch you just sent me yet). It does not >> look directly JFS related. shaggy> No, I googled and found a similar oops reported on a -ck shaggy> kernel with reiserfs: [ ... ] I haven't read the whole shaggy> thread, so I don't know what's known about it yet. Looks a bit different as to where it arises. However it looks indeed like it is somewhere else. [ ... ] shaggy> Try my patch and avoid anticipatory, and I think that shaggy> you may be okay. Well, the patch and staying clear of 'anticipatory' seem to be doing fine, so far so good. I have had some other problems with 'ancipatory', so I am suspicious of it too. For interactive use I think that 'cfq' is better anyhow, so that's my default. Unfortunately I am trying to do some simple file system driver performance tests, and 'anticipatory' is the kernel default elevator, and thus presumably the one that matters to most people. I am considering doing the test with 'noop', but that's not fair either, as it seems some file systems are designed assuming some sort of elevator sorting does happen. |