Thread: [Jfs-discussion] 2.4.20 + JFS 1.1.2 stops responding
Brought to you by:
blaschke-oss,
shaggyk
From: <ad...@ae...> - 2003-04-16 17:48:23
|
Reproduceable problem where vanilla 2.4.20 with JFS and 2.4.20 + JFS 1.1.2 stop responding after several hours (<8) of heavy file activity. By stops responding I mean I cannot enter any commands at the prompt or establish new login sessions. There are no errors to the console or in the logs after a reboot. The machine will still ping and switch between virtual consoles while in this state, but no commands can be entered at the prompt. Sysreq and ctrl-alt-del don't work as well. This has been observed on a Dell Power Edge 1650 single processor server and a Dell Inspiron 8000 notebook. I have not been able to reproduce this with EXT2 (at least not yet). I can reproduce the problem by starting a couple of instances of this script in different directories on a JFS filesystem: #!/bin/sh while [ true ] do i=0 while [ $i -lt 10000 ] do myrand=`expr $RANDOM / 128` echo c$i dd if=/dev/zero of=$i.tdh bs=1k count=$myrand > /dev/null 2>&1 i=`expr $i + 1` done find . -name "*.tdh" -exec rm {} \; done It appears the hangup may be at the "find" command since the last thing printed to the console is always c9999. After a reboot I ran jfs_fsck and didn't see anything unusual, the exit code was 1. Any ideas? Thanks. -- Thomas Hays ad...@ae... |
From: Dave K. <sh...@au...> - 2003-04-16 18:10:17
|
On Wednesday 16 April 2003 09:48, ad...@ae... wrote: > Reproduceable problem where vanilla 2.4.20 with JFS and 2.4.20 + JFS > 1.1.2 stop responding after several hours (<8) of heavy file > activity. By stops responding I mean I cannot enter any commands at > the prompt or establish new login sessions. There are no errors to > the console or in the logs after a reboot. The machine will still > ping and switch between virtual consoles while in this state, but no > commands can be entered at the prompt. Sysreq and ctrl-alt-del don't > work as well. Hmm. It would help to get a stack trace, but if sysreq isn't responding, it's not so easy. I'll try to reproduce this myself. > It appears the hangup may be at the "find" command since the last > thing printed to the console is always c9999. After a reboot I ran > jfs_fsck and didn't see anything unusual, the exit code was 1. jfs_fsck probably just replayed the journal. jfs_fsck -n checks everything in read-only mode. If this reports any problems, please let me know. You can then fix it with jfs_fsck -f. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center |
From: Dave K. <sh...@au...> - 2003-04-24 22:38:44
|
I haven't been able to reproduce this, but I haven't run it for more than a couple hours. I'm wondering if this is caused by the same bug I fixed last week. I've been running with the patch for it. The patch is below. You may want to check if this fixes the problem. Thanks, Dave On Wednesday 16 April 2003 09:48, ad...@ae... wrote: > Reproduceable problem where vanilla 2.4.20 with JFS and 2.4.20 + JFS > 1.1.2 stop responding after several hours (<8) of heavy file > activity. By stops responding I mean I cannot enter any commands at > the prompt or establish new login sessions. There are no errors to > the console or in the logs after a reboot. The machine will still > ping and switch between virtual consoles while in this state, but no > commands can be entered at the prompt. Sysreq and ctrl-alt-del don't > work as well. Index: linux24/fs/jfs/jfs_txnmgr.c =================================================================== RCS file: /usr/cvs/jfs/linux24/fs/jfs/jfs_txnmgr.c,v retrieving revision 1.54 diff -u -r1.54 jfs_txnmgr.c --- linux24/fs/jfs/jfs_txnmgr.c 24 Mar 2003 21:03:08 -0000 1.54 +++ linux24/fs/jfs/jfs_txnmgr.c 17 Apr 2003 19:37:37 -0000 @@ -1227,8 +1227,21 @@ * Ensure that inode isn't reused before * lazy commit thread finishes processing */ - if (tblk->xflag & (COMMIT_CREATE | COMMIT_DELETE)) + if (tblk->xflag & (COMMIT_CREATE | COMMIT_DELETE)) { atomic_inc(&tblk->ip->i_count); + /* + * Avoid a rare deadlock + * + * If the inode is locked, we may be blocked in + * jfs_commit_inode. If so, we don't want the + * lazy_commit thread doing the last iput() on the inode + * since that may block on the locked inode. Instead, + * commit the transaction synchronously, so the last iput + * will be done by the calling thread (or later) + */ + if (tblk->ip->i_state & I_LOCK) + tblk->xflag &= ~COMMIT_LAZY; + } ASSERT((!(tblk->xflag & COMMIT_DELETE)) || ((tblk->ip->i_nlink == 0) && |