Thread: [Jfs-discussion] Strange problem with JFS
Brought to you by:
blaschke-oss,
shaggyk
From: Jon N. <jne...@ja...> - 2004-12-17 00:54:53
|
I encountered something very strange last night similar to other reports of starvation (at least different reports by John Goerzen and others, back to June and July). I've got 3x 80G IDE drives in a raid5 array (150G or so usuable). On top of that I've got LVM. I was copying from one LV (on top of a normal physical volume) to an LV that is on top of the raid. I was copying an entire installation (about 16G worth), using just cp -a While the cp was in progress, the system was almost utterly unusable. Even an ls in a directory with < 20 entries would take 20+ minutes to complete. This is on a Duron 750 with 1 GB of RAM. Anything involving disk activity was utterly deadlocked, although the system /itself/ was quite responsive. If I suspended the cp (with control-z) within a few seconds the system "came to life" and everything was normal. This is utterly repeatable. The kernel the 2.6.9 kernel on a SuSE 9.2 Pro installation. Later, I tried something: in the "source" directory, I ran: tar cf - | tar --diff -C /oldstuff and the system was totally usable - it would appear that writing (both large and small files, some very small, some very large, and everything in between) is the instigator. On my other machine, with a notebook drive no less, I tried shuffling a single 300MB file around and had no problems (very nearly the same kernel as the other machine). Does anybody have any ideas? All filesystems were mounted with noatime, except the notebook's. -- Jon Nelson <jne...@ja...> |
From: John G. <jgo...@co...> - 2004-12-17 01:18:02
|
On Thu, Dec 16, 2004 at 03:54:46PM -0600, Jon Nelson wrote: > > I encountered something very strange last night similar to other reports > of starvation (at least different reports by John Goerzen and others, > back to June and July). Interesting you should mention that, as I've seen a few mild cases myself. Let's compare notes... > I've got 3x 80G IDE drives in a raid5 array (150G or so usuable). > On top of that I've got LVM. No RAID going on here, but I do use LVM everywhere I possibly can. [ snip ] > Even an ls in a directory with < 20 entries would take 20+ minutes to > complete. This is on a Duron 750 with 1 GB of RAM. Anything involving Very similar to what I saw, but much more extreme. > and the system was totally usable - it would appear that writing (both I noticed my problems during runs of updatedb (for locate). That process basically reads a bunch of directories, probably stats all the files, but never actually opens the files. (I think it runs find.) I wonder if this has something to do with directories or inodes. I haven't lately noticed it with updatedb runs, but I have noticed it periodically when copying, or building ISO images or things like that. -- John (BTW, how's Boa going along these days? I think it was probably about 6 years ago I first contacted you about it.) |
From: Jon N. <jne...@ja...> - 2004-12-17 01:55:39
|
On Thu, 16 Dec 2004, John Goerzen wrote: > On Thu, Dec 16, 2004 at 03:54:46PM -0600, Jon Nelson wrote: >> >> I encountered something very strange last night similar to other reports >> of starvation (at least different reports by John Goerzen and others, >> back to June and July). > > Interesting you should mention that, as I've seen a few mild cases > myself. > > Let's compare notes... > >> I've got 3x 80G IDE drives in a raid5 array (150G or so usuable). >> On top of that I've got LVM. > > No RAID going on here, but I do use LVM everywhere I possibly can. > > [ snip ] > >> Even an ls in a directory with < 20 entries would take 20+ minutes to >> complete. This is on a Duron 750 with 1 GB of RAM. Anything involving > > Very similar to what I saw, but much more extreme. > >> and the system was totally usable - it would appear that writing (both > > I noticed my problems during runs of updatedb (for locate). That > process basically reads a bunch of directories, probably stats all the > files, but never actually opens the files. (I think it runs find.) > > I wonder if this has something to do with directories or inodes. Hmm. That /is/ interesting. One of the things I copied over was my squid cache. It's not huge or anything (1.8G and ~100K files), but it /was/ part of the same "process" (cp -a) that copied over .flac files and some other very large files. I wonder if the kernel attaches to processes some list of "files it has done something with" and the longer that list the more expensive life gets. What's odd is how /one/ process was seemingly able to exclude others from accesing the disk, but only while running (suspended things went back to normal). > (BTW, how's Boa going along these days? I think it was probably about 6 > years ago I first contacted you about it.) It's going very well. For about 3 years now I've been telling coworkers that 1.0 is "just around the corner". yeah, right. AFAIK, Boa is the oldest web server in use (predates Apache, you know) and still hasn't hit 1.0. -- Jon Nelson <jne...@ja...> |
From: Sonny R. <so...@bu...> - 2004-12-17 22:20:14
|
On Thu, Dec 16, 2004 at 04:17:53PM -0600, John Goerzen wrote: <snip> > I noticed my problems during runs of updatedb (for locate). That > process basically reads a bunch of directories, probably stats all the > files, but never actually opens the files. (I think it runs find.) > I wonder if this has something to do with directories or inodes. > > I haven't lately noticed it with updatedb runs, but I have noticed it > periodically when copying, or building ISO images or things like that. Yes, again this sounds similar to a problem someone reported a week or two ago, where he had a directory with 500k files and was running find on it. The symptoms will be that kswapd is running a lot and system responsiveness will be very poor. Also, it's certainly possible that some of your directorys will have very few files, but what's important is the total number of inodes in memory, not any particular directory. You should look at the number of inodes incore at this point, and this will give you an idea of whether this is the problem or not. Do this by using either "slabtop" or looking at /proc/slabinfo and looking at the number of slabs used for dentrys and inodes. If this is indeed the problem, one stopgap measuere on > 2.6.7 kernels is to alter the value of /proc/sys/vm/vfs_cache_pressre upwards. I would try values with different orders of magnitude, i.e. 1000, 10000, 100000, etc. This should cause kswapd to be more aggressive in eliminating slab objects, which means it will hold the dcache for less time overall. Sonny |
From: Jon N. <jne...@ja...> - 2004-12-18 06:08:55
|
On Fri, 17 Dec 2004, Sonny Rao wrote: > On Thu, Dec 16, 2004 at 04:17:53PM -0600, John Goerzen wrote: > <snip> >> I noticed my problems during runs of updatedb (for locate). That >> process basically reads a bunch of directories, probably stats all the >> files, but never actually opens the files. (I think it runs find.) > >> I wonder if this has something to do with directories or inodes. >> >> I haven't lately noticed it with updatedb runs, but I have noticed it >> periodically when copying, or building ISO images or things like that. > > Yes, again this sounds similar to a problem someone reported a week or > two ago, where he had a directory with 500k files and was running find > on it. The symptoms will be that kswapd is running a lot and system > responsiveness will be very poor. Also, it's certainly possible that > some of your directorys will have very few files, but what's important > is the total number of inodes in memory, not any particular directory. > > You should look at the number of inodes incore at this point, and this > will give you an idea of whether this is the problem or not. Do this > by using either "slabtop" or looking at /proc/slabinfo and looking at > the number of slabs used for dentrys and inodes. > > If this is indeed the problem, one stopgap measuere on > 2.6.7 kernels > is to alter the value of /proc/sys/vm/vfs_cache_pressre upwards. I > would try values with different orders of magnitude, i.e. 1000, 10000, > 100000, etc. This should cause kswapd to be more aggressive in > eliminating slab objects, which means it will hold the dcache for less > time overall. Well, I'm not really sure what I'm looking for, but I did this: fstest -n 50 -f 10000 -s 2048 Then, in another terminal, I ran this find . | wc -l and from slaptop, sorted on cache size: 49396 49391 99% 0.81K 12349 4 49396K jfs_ip 46452 46228 99% 0.28K 3318 14 13272K radix_tree_node 13244 10512 79% 0.14K 473 28 1892K dentry_cache 9375 7382 78% 0.05K 125 75 500K buffer_head At this time the disk light is (for the last 5 minutes or more now) completely inactive (except for a very occasional blip). Looks like I cannot kill fstest or find (kill -9 has no effect). Both processes are in 'D+' mode. I got fstest (I'm pretty sure mine is old) at: http://samba.org/ftp/unpacked/junkcode/ What else can I do? -- Jon Nelson <jne...@ja...> |
From: Jon N. <jne...@ja...> - 2004-12-18 06:10:22
|
A small addendum - the load on this box is now 54, but it's totally usable. This is a very low end laptop with a celeron 1.6GHz. -- Jon Nelson <jne...@ja...> |
From: Sonny R. <so...@bu...> - 2004-12-18 23:10:23
|
On Fri, Dec 17, 2004 at 09:08:45PM -0600, Jon Nelson wrote: > On Fri, 17 Dec 2004, Sonny Rao wrote: > > >On Thu, Dec 16, 2004 at 04:17:53PM -0600, John Goerzen wrote: > ><snip> > >>I noticed my problems during runs of updatedb (for locate). That > >>process basically reads a bunch of directories, probably stats all the > >>files, but never actually opens the files. (I think it runs find.) > > > >>I wonder if this has something to do with directories or inodes. > >> > >>I haven't lately noticed it with updatedb runs, but I have noticed it > >>periodically when copying, or building ISO images or things like that. > > > >Yes, again this sounds similar to a problem someone reported a week or > >two ago, where he had a directory with 500k files and was running find > >on it. The symptoms will be that kswapd is running a lot and system > >responsiveness will be very poor. Also, it's certainly possible that > >some of your directorys will have very few files, but what's important > >is the total number of inodes in memory, not any particular directory. > > > >You should look at the number of inodes incore at this point, and this > >will give you an idea of whether this is the problem or not. Do this > >by using either "slabtop" or looking at /proc/slabinfo and looking at > >the number of slabs used for dentrys and inodes. > > > >If this is indeed the problem, one stopgap measuere on > 2.6.7 kernels > >is to alter the value of /proc/sys/vm/vfs_cache_pressre upwards. I > >would try values with different orders of magnitude, i.e. 1000, 10000, > >100000, etc. This should cause kswapd to be more aggressive in > >eliminating slab objects, which means it will hold the dcache for less > >time overall. > > Well, I'm not really sure what I'm looking for, but I did this: > > fstest -n 50 -f 10000 -s 2048 > > Then, in another terminal, I ran this > > find . | wc -l > > and from slaptop, sorted on cache size: > > 49396 49391 99% 0.81K 12349 4 49396K jfs_ip > 46452 46228 99% 0.28K 3318 14 13272K radix_tree_node > 13244 10512 79% 0.14K 473 28 1892K dentry_cache > 9375 7382 78% 0.05K 125 75 500K buffer_head > > At this time the disk light is (for the last 5 minutes or more now) > completely inactive (except for a very occasional blip). > > Looks like I cannot kill fstest or find (kill -9 has no effect). > Both processes are in 'D+' mode. > > I got fstest (I'm pretty sure mine is old) at: > > http://samba.org/ftp/unpacked/junkcode/ > > What else can I do? Hmm, okay so that basically means you have 49396 JFS inodes in-memory. Which is not an extremely large number IMO. The 7th column shows that you are using almost 50 MB of kernel memory on these inodes, which isn't a lot considering this machine has 1GB of RAM. So this certainly doesn't seem like the same issue as the other fellow. So now I have two more questions, Is JFS compiled with stats turned on? Does this happen on other filesystems besides JFS? IF it is compiled with stats turned on, the contents of the files in /proc/fs/jfs/ before and after you start the tests would be interesting. Sonny |
From: Jon N. <jne...@ja...> - 2004-12-18 23:16:13
|
On Sat, 18 Dec 2004, Sonny Rao wrote: > On Fri, Dec 17, 2004 at 09:08:45PM -0600, Jon Nelson wrote: >> On Fri, 17 Dec 2004, Sonny Rao wrote: >> >>> On Thu, Dec 16, 2004 at 04:17:53PM -0600, John Goerzen wrote: >>> <snip> >>>> I noticed my problems during runs of updatedb (for locate). That >>>> process basically reads a bunch of directories, probably stats all the >>>> files, but never actually opens the files. (I think it runs find.) >>> >>>> I wonder if this has something to do with directories or inodes. >>>> >>>> I haven't lately noticed it with updatedb runs, but I have noticed it >>>> periodically when copying, or building ISO images or things like that. >>> >>> Yes, again this sounds similar to a problem someone reported a week or >>> two ago, where he had a directory with 500k files and was running find >>> on it. The symptoms will be that kswapd is running a lot and system >>> responsiveness will be very poor. Also, it's certainly possible that >>> some of your directorys will have very few files, but what's important >>> is the total number of inodes in memory, not any particular directory. >>> >>> You should look at the number of inodes incore at this point, and this >>> will give you an idea of whether this is the problem or not. Do this >>> by using either "slabtop" or looking at /proc/slabinfo and looking at >>> the number of slabs used for dentrys and inodes. >>> >>> If this is indeed the problem, one stopgap measuere on > 2.6.7 kernels >>> is to alter the value of /proc/sys/vm/vfs_cache_pressre upwards. I >>> would try values with different orders of magnitude, i.e. 1000, 10000, >>> 100000, etc. This should cause kswapd to be more aggressive in >>> eliminating slab objects, which means it will hold the dcache for less >>> time overall. >> >> Well, I'm not really sure what I'm looking for, but I did this: >> >> fstest -n 50 -f 10000 -s 2048 >> >> Then, in another terminal, I ran this >> >> find . | wc -l >> >> and from slaptop, sorted on cache size: >> >> 49396 49391 99% 0.81K 12349 4 49396K jfs_ip >> 46452 46228 99% 0.28K 3318 14 13272K radix_tree_node >> 13244 10512 79% 0.14K 473 28 1892K dentry_cache >> 9375 7382 78% 0.05K 125 75 500K buffer_head >> >> At this time the disk light is (for the last 5 minutes or more now) >> completely inactive (except for a very occasional blip). >> >> Looks like I cannot kill fstest or find (kill -9 has no effect). >> Both processes are in 'D+' mode. >> >> I got fstest (I'm pretty sure mine is old) at: >> >> http://samba.org/ftp/unpacked/junkcode/ >> >> What else can I do? > > Hmm, okay so that basically means you have 49396 JFS inodes in-memory. > Which is not an extremely large number IMO. The 7th column shows that > you are using almost 50 MB of kernel memory on these inodes, which > isn't a lot considering this machine has 1GB of RAM. So this > certainly doesn't seem like the same issue as the other fellow. Actually, the test above was on my laptop, with only 384 RAM and 512 swap. I saw *the first time I ran the test* the # of inodes go above 100000 (100K) but then it hovered around 47K consuming always about 50MB. I was not able to reproduce with reiserfs, xfs, ext3. (all noatime). > So now I have two more questions, Is JFS compiled with stats turned > on? Does this happen on other filesystems besides JFS? I have no idea. How do I find out? This is just the stock SuSE 9.2 kernel. (see above re: other filesystems). > IF it is compiled with stats turned on, the contents of the files in > /proc/fs/jfs/ before and after you start the tests would be > interesting. Well, I have /proc/fs/jfs and there are 4 files there. I had to reboot the laptop this morning (even though "load" was about 50, the machine felt totally normal. However, the locked filesystem was interfering with normal operations, and 'sync' of course hung forever). I'll let you know in a followup what /proc/fs/jfs says before and after testing. fstest is very easy to run, why not give it a shot yourself? Following my earlier instructions and you, too, may be able to reproduce the problem locally. -- Jon Nelson <jne...@ja...> |
From: Jon N. <jne...@ja...> - 2004-12-18 23:47:34
|
On Sat, 18 Dec 2004, Jon Nelson wrote: > On Sat, 18 Dec 2004, Sonny Rao wrote: ... >>> Well, I'm not really sure what I'm looking for, but I did this: >>> >>> fstest -n 50 -f 10000 -s 2048 >>> >>> Then, in another terminal, I ran this >>> >>> find . | wc -l Boom! disk light goes out, can't do ANYTHING with that filesystem, instantly. >>> I got fstest (I'm pretty sure mine is old) at: >>> >>> http://samba.org/ftp/unpacked/junkcode/ Also, from slabtop at the current moment: 10696 10695 99% 0.81K 2674 4 10696K jfs_ip In this order: the contents of the files after boot, after cleaning the FS (due to the hanging), and then just a few seconds after it's hung (I can hang it almost instantly, 100% reproduceable). =================================== Contents of: /proc/fs/jfs/lmstats =================================== JFS Logmgr stats ================ commits = 0 writes submitted = 2 writes completed = 2 full pages submitted = 0 partial pages submitted = 0 =================================== =================================== Contents of: /proc/fs/jfs/mpstat =================================== JFS Metapage statistics ======================= page allocations = 12 page frees = 12 lock waits = 0 =================================== =================================== Contents of: /proc/fs/jfs/txstats =================================== JFS TxStats =========== calls to txBegin = 0 txBegin blocked by sync barrier = 0 txBegin blocked by tlocks low = 0 txBegin blocked by no free tid = 0 calls to txBeginAnon = 0 txBeginAnon blocked by sync barrier = 0 txBeginAnon blocked by tlocks low = 0 calls to txLockAlloc = 0 tLockAlloc blocked by no free lock = 0 =================================== =================================== Contents of: /proc/fs/jfs/xtstat =================================== JFS Xtree statistics ==================== searches = 5 fast searches = 0 splits = 0 =================================== **** After clean but before test **** =================================== Contents of: /proc/fs/jfs/lmstats =================================== JFS Logmgr stats ================ commits = 96584 writes submitted = 34132 writes completed = 34132 full pages submitted = 30797 partial pages submitted = 2373 =================================== =================================== Contents of: /proc/fs/jfs/mpstat =================================== JFS Metapage statistics ======================= page allocations = 1482201 page frees = 1482193 lock waits = 1 =================================== =================================== Contents of: /proc/fs/jfs/txstats =================================== JFS TxStats =========== calls to txBegin = 96584 txBegin blocked by sync barrier = 0 txBegin blocked by tlocks low = 37 txBegin blocked by no free tid = 0 calls to txBeginAnon = 0 txBeginAnon blocked by sync barrier = 0 txBeginAnon blocked by tlocks low = 0 calls to txLockAlloc = 1937690 tLockAlloc blocked by no free lock = 0 =================================== =================================== Contents of: /proc/fs/jfs/xtstat =================================== JFS Xtree statistics ==================== searches = 749112 fast searches = 74271 splits = 0 =================================== **** While hung **** =================================== Contents of: /proc/fs/jfs/lmstats =================================== JFS Logmgr stats ================ commits = 116220 writes submitted = 38720 writes completed = 38720 full pages submitted = 34252 partial pages submitted = 3203 =================================== =================================== Contents of: /proc/fs/jfs/mpstat =================================== JFS Metapage statistics ======================= page allocations = 1548192 page frees = 1547972 lock waits = 14 =================================== =================================== Contents of: /proc/fs/jfs/txstats =================================== JFS TxStats =========== calls to txBegin = 116224 txBegin blocked by sync barrier = 0 txBegin blocked by tlocks low = 48 txBegin blocked by no free tid = 12 calls to txBeginAnon = 10645 txBeginAnon blocked by sync barrier = 0 txBeginAnon blocked by tlocks low = 73 calls to txLockAlloc = 2131739 tLockAlloc blocked by no free lock = 5 =================================== =================================== Contents of: /proc/fs/jfs/xtstat =================================== JFS Xtree statistics ==================== searches = 789286 fast searches = 94616 splits = 0 =================================== -- Jon Nelson <jne...@ja...> |
From: Jon N. <jne...@ja...> - 2004-12-18 23:49:45
|
I should note that if I crack open another terminal and strace the find, this is what I get: open(".", O_RDONLY|O_LARGEFILE) = 5 fchdir(5) = 0 lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 chdir(".") = 0 lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40252000 write(1, ".\n", 2. ) = 2 open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6 fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(6, F_SETFD, FD_CLOEXEC) = 0 getdents(6, ^^^^^ hangs here. -- Jon Nelson <jne...@ja...> |
From: Sonny R. <so...@bu...> - 2004-12-20 11:29:13
|
On Sat, Dec 18, 2004 at 02:49:38PM -0600, Jon Nelson wrote: > > > I should note that if I crack open another terminal and strace the find, > this is what I get: > > > open(".", O_RDONLY|O_LARGEFILE) = 5 > fchdir(5) = 0 > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > chdir(".") = 0 > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0 > mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, > 0) = 0x40252000 > write(1, ".\n", 2. > ) = 2 > open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6 > fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > fcntl64(6, F_SETFD, FD_CLOEXEC) = 0 > getdents(6, > ^^^^^ hangs here. Yeah, there appears to be a deadlock somewhere in the directory handling code. I find that it's not always reproducible on my laptop depending on how much memory is free, other stuff going on, etc. But I was able to trigger it once by simply running fstest and a find concurrently, then killing fstest and running ls on the directory where fstest was running. Here's what the backtraces look like: fstest D C0348640 0 5093 1 5094 5092 (NOTLB) f07f9f14 00000082 f07f7870 c0348640 fffffff5 f13a992c c015eeca f13a992c 00000001 f07f9f70 f242ad10 00000000 23063c80 000f42cd f07f7a18 f13a9994 00000292 f07f8000 f07f7870 c026d7d7 f13a999c 00000001 f07f7870 c0118714 Call Trace: [<c015eeca>] link_path_walk+0xccf/0xdb2 [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<c0162261>] .text.lock.namei+0x109/0x168 [<c01068f1>] error_code+0x2d/0x38 [<c01181fd>] schedule_tail+0x41/0x4d [<c0105dfd>] sysenter_past_esp+0x52/0x71 fstest D C0348640 0 5094 1 5095 5093 (NOTLB) f07fbf14 00000082 f07f72e0 c0348640 fffffff5 f13a992c c015eeca f13a992c 00000001 f07fbf70 f242ad10 00000000 23063c80 000f42cd f07f7488 f13a9994 00000292 f07fa000 f07f72e0 c026d7d7 f13a999c 00000001 f07f72e0 c0118714 Call Trace: [<c015eeca>] link_path_walk+0xccf/0xdb2 [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<c0162261>] .text.lock.namei+0x109/0x168 [<c01068f1>] error_code+0x2d/0x38 [<c01181fd>] schedule_tail+0x41/0x4d [<c0105dfd>] sysenter_past_esp+0x52/0x71 <snip> find D C0348AE8 0 5115 4192 (NOTLB) f2233f24 00000082 f1526330 c0348ae8 000f42ce 0000fe04 00000000 00000000 0744ab6b 000f42ce f1526330 000f4240 074f6600 000f42ce f1526ff8 f13a9994 00000292 f2232000 f1526e50 c026d7d7 f13a999c 00000001 f1526e50 c0118714 Call Trace: [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<c016393f>] .text.lock.readdir+0x5/0x16 [<c0163900>] sys_getdents64+0x71/0xab [<c01637a4>] filldir64+0x0/0xeb [<c0105dfd>] sysenter_past_esp+0x52/0x71 ls D C0348640 0 5117 4191 (NOTLB) f0581f24 00000082 f15268c0 c0348640 00000000 f15bf354 f15af400 00000000 f14b8140 f14b8160 f747c1b0 000f4240 70291b00 000f42d1 f1526a68 f13a9994 00000292 f0580000 f15268c0 c026d7d7 f13a999c 00000001 f15268c0 c0118714 Call Trace: [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<c016393f>] .text.lock.readdir+0x5/0x16 [<c0163900>] sys_getdents64+0x71/0xab [<c01637a4>] filldir64+0x0/0xeb [<c0105dfd>] sysenter_past_esp+0x52/0x71 So someone is holding the readdir semaphore here, but I can't tell which thread. find and ls are blocking on that. Also the jfsCommit and jfsSync threads are also blocking on something, but I haven't been able to get the backtraces for those yet. All of the fstest threads seem to be stuck in some error path. I'll follow up if I find more. Sonny |
From: Sonny R. <so...@bu...> - 2004-12-20 12:48:55
|
On Mon, Dec 20, 2004 at 03:20:58AM -0500, Sonny Rao wrote: > On Sat, Dec 18, 2004 at 02:49:38PM -0600, Jon Nelson wrote: > > > > > > I should note that if I crack open another terminal and strace the find, > > this is what I get: > > > > > > open(".", O_RDONLY|O_LARGEFILE) = 5 > > fchdir(5) = 0 > > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > > chdir(".") = 0 > > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > > lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > > fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0 > > mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, > > 0) = 0x40252000 > > write(1, ".\n", 2. > > ) = 2 > > open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6 > > fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > > fcntl64(6, F_SETFD, FD_CLOEXEC) = 0 > > getdents(6, > > ^^^^^ hangs here. > > Yeah, there appears to be a deadlock somewhere in the directory > handling code. I find that it's not always reproducible on my laptop > depending on how much memory is free, other stuff going on, etc. But > I was able to trigger it once by simply running fstest and a find > concurrently, then killing fstest and running ls on the directory > where fstest was running. This message is mostly for Shaggy, but here it is anyway (kinda long): Ok I have some suspicious looking backtraces now where some threads are sleeping in jfs_unlink (which holds the directory i_sem I think) which hold the i_sem (from sys_mkdir). Overall of the 100 fstest threads, here's numbers for what syscall each was running: unlink - 4 pwrite64 - 24 close - 7 open/create - 3 mkdir - 1 unknown - 60 (error path, blocked on namei semaphore, probably JFS namei.c ? ) Here's some more backtraces: pdflusher writing back inodes: pdflush D C0348640 0 59 3 61 58 (L-TLB) c1717e2c 00000046 dfeb0b90 c0348640 c1717e30 dcaf8a44 c1717e20 00000000 00000010 c1717f34 db127260 00000000 e19f81c0 000f424a dfeb0d38 dcaf8800 00000282 c1716000 dfeb0b90 c026d7d7 dcaf8808 00000001 dfeb0b90 c0118714 Call Trace: [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<e096360a>] .text.lock.inode+0x5/0x9b [jfs] [<e0962e4c>] jfs_write_inode+0x4c/0x60 [jfs] [<c0170e43>] write_inode+0x46/0x48 [<c017102b>] __sync_single_inode+0x1e6/0x202 [<c0171263>] sync_sb_inodes+0x169/0x282 [<c0171462>] writeback_inodes+0xe6/0xf2 [<c01393c8>] wb_kupdate+0x99/0x10e [<c0139d77>] __pdflush+0xcb/0x1c4 [<c0139e70>] pdflush+0x0/0x2c [<c0139e98>] pdflush+0x28/0x2c [<c013932f>] wb_kupdate+0x0/0x10e [<c0139e70>] pdflush+0x0/0x2c [<c012d579>] kthread+0xa5/0xab [<c012d4d4>] kthread+0x0/0xab [<c010425d>] kernel_thread_helper+0x5/0xb jfs kernel daemon threads: jfsIO S C0348640 0 1801 1 1802 1795 (L-TLB) deeadf90 00000046 dfaed770 c0348640 dfb619c0 00000050 00000000 df2cdd40 df2cdd40 df75d5c0 d809d780 00000000 0e3514c0 000f424b dfaed918 deeac000 deeac000 deeadfc0 ffffe000 e097faa5 df2cdd40 dfaed770 00000000 dfaed770 Call Trace: [<e097faa5>] jfsIOWait+0x125/0x160 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0105d26>] ret_from_fork+0x6/0x14 [<c0118714>] default_wake_function+0x0/0x12 [<e097f980>] jfsIOWait+0x0/0x160 [jfs] [<c010425d>] kernel_thread_helper+0x5/0xb jfsCommit D C0348640 0 1802 1 1803 1801 (L-TLB) de429ee0 00000046 dee18800 c0348640 df7dc000 dec76234 00004000 e09357e4 de428000 00270284 dbe060f0 00000000 f6e6e500 000f424a dee189a8 dec7607c de429ef4 dee18800 dec76234 c026e3cd deeaa000 00001bcc dec76080 dec76080 Call Trace: [<c026e3cd>] rwsem_down_read_failed+0x8f/0x17c [<e097085d>] .text.lock.jfs_imap+0x3ff/0x422 [jfs] [<e09824de>] txUpdateMap+0xae/0x250 [jfs] [<e09808d7>] txEnd+0xd7/0x140 [jfs] [<e0982d00>] txLazyCommit+0x20/0xe0 [jfs] [<e0982f78>] jfs_lazycommit+0x1b8/0x1e0 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0105d26>] ret_from_fork+0x6/0x14 [<c0118714>] default_wake_function+0x0/0x12 [<e0982dc0>] jfs_lazycommit+0x0/0x1e0 [jfs] [<c010425d>] kernel_thread_helper+0x5/0xb jfsSync D C0348640 0 1803 1 2174 1802 (L-TLB) de94bdd8 00000046 dee198b0 c0348640 def1d72c 00000008 c01f1c3a def1d72c d808f600 000f424b dbe060f0 00000000 12707480 000f424b dee19a58 de94a000 de94bde8 c0118714 de94be08 e097fe87 00000000 00000000 00000000 dee198b0 Call Trace: [<c01f1c3a>] generic_make_request+0x15e/0x1de [<c0118714>] default_wake_function+0x0/0x12 [<e097fe87>] txLockAlloc+0xa7/0x160 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0118714>] default_wake_function+0x0/0x12 [<e0980bb5>] txLock+0x275/0x470 [jfs] [<e097f5b1>] lbmStartIO+0xb1/0xe0 [jfs] [<e097f46f>] lbmWrite+0x10f/0x150 [jfs] [<e097c99e>] __get_metapage+0x5e/0x3d0 [jfs] [<e097dcbd>] lmGCwrite+0xdd/0xf0 [jfs] [<e096ce50>] diWrite+0x190/0x5f0 [jfs] [<e0981473>] txCommit+0x1b3/0x320 [jfs] [<e098083b>] txEnd+0x3b/0x140 [jfs] [<e098351f>] jfs_sync+0x21f/0x2c0 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0105d26>] ret_from_fork+0x6/0x14 [<c0118714>] default_wake_function+0x0/0x12 [<e0983300>] jfs_sync+0x0/0x2c0 [jfs] [<c010425d>] kernel_thread_helper+0x5/0xb (some of the 100) fstest threads: fstest D C0348640 0 4565 4561 4566 4564 (NOTLB) db6a7e8c 00000082 d9acfa30 c0348640 da11a6e0 c0167a32 da11a6e0 c0356cb0 db6a7eac 00000000 dbe07730 00000000 d8255c00 000f424a d9acfbd8 db6a6000 00000000 db6a7e9c db6a7f70 e098061f df75d5c0 db6a6000 00000000 d9acfa30 Call Trace: [<c0167a32>] dput+0x22/0x212 [<e098061f>] txBegin+0x36f/0x3a0 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0118714>] default_wake_function+0x0/0x12 [<e0963f4a>] jfs_unlink+0x6a/0x420 [jfs] [<c01609a1>] vfs_unlink+0x102/0x1d5 [<c0160b3c>] sys_unlink+0xc8/0x13a [<c0105dfd>] sysenter_past_esp+0x52/0x71 fstest D C0348640 0 4567 4561 4568 4566 (NOTLB) dc447f14 00000082 d9ace3f0 c0348640 fffffff5 dce97c68 c015eeca dce97c68 00000001 dc447f70 d9912230 000f4240 753f0940 000f4242 d9ace598 dce97cd0 00000292 dc446000 d9ace3f0 c026d7d7 dce97cd8 00000001 d9ace3f0 c0118714 Call Trace: [<c015eeca>] link_path_walk+0xccf/0xdb2 [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<c0162261>] .text.lock.namei+0x109/0x168 [<c01502da>] filp_close+0x59/0x86 [<c015036a>] sys_close+0x63/0x96 [<c0105dfd>] sysenter_past_esp+0x52/0x71 fstest D C0348640 0 4572 4561 4573 4571 (NOTLB) da275c00 00000082 d9940d10 c0348640 da094c68 e0966a0e da094c68 00000000 00000000 da275c08 d9978640 00000000 e0314bc0 000f424a d9940eb8 da275c08 df75d5c0 da274000 df75d500 e09807b4 00000000 d9940d10 c0118714 db19bc14 Call Trace: [<e0966a0e>] xtLookup+0xde/0x1e0 [jfs] [<e09807b4>] txBeginAnon+0x164/0x1b0 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0118714>] default_wake_function+0x0/0x12 [<e097be66>] extAlloc+0x46/0x3d0 [jfs] [<c011875e>] __wake_up_common+0x38/0x57 [<c011880e>] __wake_up_locked+0x2a/0x2e [<e09631a5>] jfs_get_blocks+0x2a5/0x330 [jfs] [<e0963268>] jfs_get_block+0x38/0x40 [jfs] [<c0154d24>] nobh_prepare_write+0xed/0x3eb [<c0137ee4>] buffered_rmqueue+0xf8/0x1c5 [<c019d013>] radix_tree_node_alloc+0x1f/0x5a [<c019d2a1>] radix_tree_insert+0xe5/0x105 [<c0133b82>] add_to_page_cache+0x69/0xaf [<c0135bfa>] generic_file_aio_write_nolock+0x3af/0xb3e [<e0963230>] jfs_get_block+0x0/0x40 [jfs] [<e0963874>] jfs_create+0x1d4/0x2a0 [jfs] [<c01363fd>] generic_file_write_nolock+0x74/0x8c [<c015f731>] may_open+0x59/0x253 [<c014ff6b>] dentry_open+0x109/0x222 [<c013650a>] generic_file_write+0x55/0x6c [<c01364b5>] generic_file_write+0x0/0x6c [<c0150ca1>] vfs_write+0xd0/0x135 [<c0150f1a>] sys_pwrite64+0x88/0x8c [<c0105dfd>] sysenter_past_esp+0x52/0x71 fstest D C0348640 0 4579 4561 4580 4578 (NOTLB) dc12dcac 00000086 db126cd0 c0348640 dee7f1c0 dc12dcc0 c026dc40 db1261b0 c0348640 00000001 00000000 00000000 04225880 000f424b db126e78 dc12c000 debfc754 debfc784 dc12dcd4 e097c77e 00000000 db126cd0 c0118714 00000000 Call Trace: [<c026dc40>] schedule+0x294/0x485 [<e097c77e>] __lock_metapage+0x8e/0xd0 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0118714>] default_wake_function+0x0/0x12 [<c0118714>] default_wake_function+0x0/0x12 [<e097ca69>] __get_metapage+0x129/0x3d0 [jfs] [<e097cedc>] release_metapage+0x7c/0x320 [jfs] [<e096ce04>] diWrite+0x144/0x5f0 [jfs] [<e0981473>] txCommit+0x1b3/0x320 [jfs] [<e096b02f>] xtInitRoot+0x2f/0x80 [jfs] [<e096392b>] jfs_create+0x28b/0x2a0 [jfs] [<c015dce7>] permission+0x67/0x79 [<c015dce7>] permission+0x67/0x79 [<c015f64b>] vfs_create+0xb9/0x146 [<c015feba>] open_namei+0x58f/0x5de [<c014fe3c>] filp_open+0x3e/0x64 [<c01500bb>] get_unused_fd+0x37/0xd9 [<c015022e>] sys_open+0x5b/0x8b [<c0105dfd>] sysenter_past_esp+0x52/0x71 fstest D C0348640 0 4577 4561 4578 4576 (NOTLB) d991dc64 00000086 db1261b0 c0348640 d991dc54 00000282 da053608 00000003 00000001 00000000 db126cd0 00000000 12707480 000f424b db126358 d991c000 d991dc74 c0118714 d991dc94 e097fe87 dbe060f0 db1261b0 00000000 db1261b0 Call Trace: [<c0118714>] default_wake_function+0x0/0x12 [<e097fe87>] txLockAlloc+0xa7/0x160 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0118714>] default_wake_function+0x0/0x12 [<e0980bb5>] txLock+0x275/0x470 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<e097c99e>] __get_metapage+0x5e/0x3d0 [jfs] [<e097cedc>] release_metapage+0x7c/0x320 [jfs] [<e096ce50>] diWrite+0x190/0x5f0 [jfs] [<e0981473>] txCommit+0x1b3/0x320 [jfs] [<e096b02f>] xtInitRoot+0x2f/0x80 [jfs] [<e096392b>] jfs_create+0x28b/0x2a0 [jfs] [<c015dce7>] permission+0x67/0x79 [<c015dce7>] permission+0x67/0x79 [<c015f64b>] vfs_create+0xb9/0x146 [<c015feba>] open_namei+0x58f/0x5de [<c014fe3c>] filp_open+0x3e/0x64 [<c01500bb>] get_unused_fd+0x37/0xd9 [<c015022e>] sys_open+0x5b/0x8b [<c0105dfd>] sysenter_past_esp+0x52/0x71 fstest D C0348640 0 4598 4561 4599 4597 (NOTLB) dbc11d88 00000086 dbd57770 c0348640 00000000 00000000 00000000 00000000 00000000 df1ed83c d9912d50 00000000 d8dc7700 000f424a dbd57918 df7dd244 00000286 dbc10000 dbd57770 c026d7d7 df7dd24c 00000001 dbd57770 c0118714 Call Trace: [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<e0970663>] .text.lock.jfs_imap+0x205/0x422 [jfs] [<c0169722>] alloc_inode+0xf2/0x19b [<e097bccb>] ialloc+0x5b/0x1b0 [jfs] [<e096399d>] jfs_mkdir+0x5d/0x2f0 [jfs] [<c0117c03>] recalc_task_prio+0x93/0x188 [<c015dce7>] permission+0x67/0x79 [<c015eeca>] link_path_walk+0xccf/0xdb2 [<c015de3d>] cached_lookup+0x23/0x85 [<c015dce7>] permission+0x67/0x79 [<c0160379>] vfs_mkdir+0xad/0x137 [<c01604ba>] sys_mkdir+0xb7/0xf6 [<c01181fd>] schedule_tail+0x41/0x4d [<c0105dfd>] sysenter_past_esp+0x52/0x71 fstest D C0348640 0 4599 4561 4600 4598 (NOTLB) dc31bf14 00000086 dbd571e0 c0348640 fffffff5 dce97c68 c015eeca dce97c68 00000001 dc31bf70 d9940780 00000000 d8dc7700 000f424a dbd57388 dce97cd0 00000282 dc31a000 dbd571e0 c026d7d7 dce97cd8 00000001 dbd571e0 c0118714 Call Trace: [<c015eeca>] link_path_walk+0xccf/0xdb2 [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<c0162261>] .text.lock.namei+0x109/0x168 [<c01068f1>] error_code+0x2d/0x38 [<c01181fd>] schedule_tail+0x41/0x4d [<c0105dfd>] sysenter_past_esp+0x52/0x71 |
From: Jon N. <jne...@ja...> - 2004-12-23 17:34:20
|
On Mon, 20 Dec 2004, Sonny Rao wrote: > This message is mostly for Shaggy, but here it is anyway (kinda long): > > Ok I have some suspicious looking backtraces now where some threads > are sleeping in jfs_unlink (which holds the directory i_sem I think) > which hold the i_sem (from sys_mkdir). Well, it's more or less out of my hands now. I wonder if this relates to my performance problems as well - I suspect it does, although a fix for this bug may ot improve performance. Thanks! -- Jon Nelson <jne...@ja...> |
From: Jon N. <jne...@ja...> - 2005-01-08 03:41:15
|
Out of curiosity, has any progress been made on this item? On Thu, 23 Dec 2004, Jon Nelson wrote: > On Mon, 20 Dec 2004, Sonny Rao wrote: > > > This message is mostly for Shaggy, but here it is anyway (kinda long): > > > > Ok I have some suspicious looking backtraces now where some threads > > are sleeping in jfs_unlink (which holds the directory i_sem I think) > > which hold the i_sem (from sys_mkdir). -- Jon Nelson <jne...@ja...> |
From: Sonny R. <so...@bu...> - 2005-01-11 07:58:11
|
On Thu, Dec 23, 2004 at 08:34:12AM -0600, Jon Nelson wrote: > On Mon, 20 Dec 2004, Sonny Rao wrote: > > >This message is mostly for Shaggy, but here it is anyway (kinda long): > > > >Ok I have some suspicious looking backtraces now where some threads > >are sleeping in jfs_unlink (which holds the directory i_sem I think) > >which hold the i_sem (from sys_mkdir). > > Well, it's more or less out of my hands now. I wonder if this relates to > my performance problems as well - I suspect it does, although a fix for > this bug may ot improve performance. If you are out of tlocks frequently, then you're in luck. There's module load time parameter for the number of txblocks and txlocks, just up the values and see if performance increases. Sonny |
From: Sonny R. <so...@bu...> - 2004-12-19 22:18:53
|
On Sat, Dec 18, 2004 at 02:47:27PM -0600, Jon Nelson wrote: > > > On Sat, 18 Dec 2004, Jon Nelson wrote: > > >On Sat, 18 Dec 2004, Sonny Rao wrote: > ... > >>>Well, I'm not really sure what I'm looking for, but I did this: > >>> > >>>fstest -n 50 -f 10000 -s 2048 > >>> > >>>Then, in another terminal, I ran this > >>> > >>>find . | wc -l > > Boom! disk light goes out, can't do ANYTHING with that filesystem, > instantly. > > >>>I got fstest (I'm pretty sure mine is old) at: > >>> > >>>http://samba.org/ftp/unpacked/junkcode/ > > Also, from slabtop at the current moment: > > 10696 10695 99% 0.81K 2674 4 10696K jfs_ip > > In this order: the contents of the files after boot, after cleaning the > FS (due to the hanging), and then just a few seconds after it's hung (I > can hang it almost instantly, 100% reproduceable). <snip> Hmm, okay I had thought it might be caused by a lack of available transaction blocks or locks, but that doesn't seem to be the case based on the data you provided. I am able to reproduce the problem on a laptop I have with me as well, I'm running a 2.6.8.1 kernel. It looks like all of the processes are stuck waiting on a semaphore somewhere in namei (.text.lock.namei) and upon reboot i have a bit of filesystem corruption. Shaggy will have to look into this some more, thanks for the report. Sonny |
From: Dave K. <sh...@au...> - 2005-01-10 19:59:55
|
Sorry it's taken me so long to look into this, and thanks for all your help with this bug. On Sun, 2004-12-19 at 14:10 -0500, Sonny Rao wrote: > Hmm, okay I had thought it might be caused by a lack of available > transaction blocks or locks, but that doesn't seem to be the case > based on the data you provided. Actually, we do appear to be out of transaction locks (tlocks). jfsSync D C0348640 0 1803 1 2174 1802 (L-TLB) de94bdd8 00000046 dee198b0 c0348640 def1d72c 00000008 c01f1c3a def1d72c d808f600 000f424b dbe060f0 00000000 12707480 000f424b dee19a58 de94a000 de94bde8 c0118714 de94be08 e097fe87 00000000 00000000 00000000 dee198b0 Call Trace: [<c01f1c3a>] generic_make_request+0x15e/0x1de [<c0118714>] default_wake_function+0x0/0x12 [<e097fe87>] txLockAlloc+0xa7/0x160 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0118714>] default_wake_function+0x0/0x12 [<e0980bb5>] txLock+0x275/0x470 [jfs] [<e097f5b1>] lbmStartIO+0xb1/0xe0 [jfs] [<e097f46f>] lbmWrite+0x10f/0x150 [jfs] [<e097c99e>] __get_metapage+0x5e/0x3d0 [jfs] [<e097dcbd>] lmGCwrite+0xdd/0xf0 [jfs] [<e096ce50>] diWrite+0x190/0x5f0 [jfs] [<e0981473>] txCommit+0x1b3/0x320 [jfs] [<e098083b>] txEnd+0x3b/0x140 [jfs] [<e098351f>] jfs_sync+0x21f/0x2c0 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0105d26>] ret_from_fork+0x6/0x14 [<c0118714>] default_wake_function+0x0/0x12 [<e0983300>] jfs_sync+0x0/0x2c0 [jfs] [<c010425d>] kernel_thread_helper+0x5/0xb This is the thread that is trying to free some tlocks when they start getting low. I think generic_make_request is just stack noise. We really don't want txLockAlloc to block here. We try really hard not to exhaust all of the tlocks so that this thread can make some progress. I think the real problem though is the jfsCommit thread. If it were not blocked, I suspect that there would be a lot of tlocks freed. jfsCommit D C0348640 0 1802 1 1803 1801 (L-TLB) de429ee0 00000046 dee18800 c0348640 df7dc000 dec76234 00004000 e09357e4 de428000 00270284 dbe060f0 00000000 f6e6e500 000f424a dee189a8 dec7607c de429ef4 dee18800 dec76234 c026e3cd deeaa000 00001bcc dec76080 dec76080 Call Trace: [<c026e3cd>] rwsem_down_read_failed+0x8f/0x17c [<e097085d>] .text.lock.jfs_imap+0x3ff/0x422 [jfs] [<e09824de>] txUpdateMap+0xae/0x250 [jfs] [<e09808d7>] txEnd+0xd7/0x140 [jfs] [<e0982d00>] txLazyCommit+0x20/0xe0 [jfs] [<e0982f78>] jfs_lazycommit+0x1b8/0x1e0 [jfs] [<c0118714>] default_wake_function+0x0/0x12 [<c0105d26>] ret_from_fork+0x6/0x14 [<c0118714>] default_wake_function+0x0/0x12 [<e0982dc0>] jfs_lazycommit+0x0/0x1e0 [jfs] [<c010425d>] kernel_thread_helper+0x5/0xb This thread must be in diUpdatePMap, trying to acquire ipimap's rdwrlock: IREAD_LOCK(). The funny thing is that the only place a write lock is taken on this inode is in diNewIAG, which is only called under diAlloc, which I don't see in any stacks. I can't find any error path that would leave the lock taken, so I can't account for why this thread would be blocked. I'm sure you would have noticed if a thread oopsed in this path, leaving the lock locked. > I am able to reproduce the problem on a laptop I have with me as well, > I'm running a 2.6.8.1 kernel. > > It looks like all of the processes are stuck waiting on a semaphore > somewhere in namei (.text.lock.namei) and upon reboot i have a bit of > filesystem corruption. I'd be interested in any info from jfs_fsck as to the nature of the filesystem corruption. > Shaggy will have to look into this some more, thanks for the report. Thanks for all your help. Shaggy -- David Kleikamp IBM Linux Technology Center |
From: Dave K. <sh...@au...> - 2005-01-10 20:13:19
|
On Mon, 2005-01-10 at 10:59 -0600, Dave Kleikamp wrote: > This thread must be in diUpdatePMap, trying to acquire ipimap's > rdwrlock: IREAD_LOCK(). The funny thing is that the only place a write > lock is taken on this inode is in diNewIAG, which is only called under > diAlloc, which I don't see in any stacks. > > I can't find any error path that would leave the lock taken, so I can't > account for why this thread would be blocked. I'm sure you would have > noticed if a thread oopsed in this path, leaving the lock locked. I found the deadlock. This thread is the missing link: fstest D C0348640 0 4598 4561 4599 4597 (NOTLB) dbc11d88 00000086 dbd57770 c0348640 00000000 00000000 00000000 00000000 00000000 df1ed83c d9912d50 00000000 d8dc7700 000f424a dbd57918 df7dd244 00000286 dbc10000 dbd57770 c026d7d7 df7dd24c 00000001 dbd57770 c0118714 Call Trace: [<c026d7d7>] __down+0x8b/0xfd [<c0118714>] default_wake_function+0x0/0x12 [<c026d984>] __down_failed+0x8/0xc [<e0970663>] .text.lock.jfs_imap+0x205/0x422 [jfs] [<c0169722>] alloc_inode+0xf2/0x19b [<e097bccb>] ialloc+0x5b/0x1b0 [jfs] [<e096399d>] jfs_mkdir+0x5d/0x2f0 [jfs] [<c0117c03>] recalc_task_prio+0x93/0x188 [<c015dce7>] permission+0x67/0x79 [<c015eeca>] link_path_walk+0xccf/0xdb2 [<c015de3d>] cached_lookup+0x23/0x85 [<c015dce7>] permission+0x67/0x79 [<c0160379>] vfs_mkdir+0xad/0x137 [<c01604ba>] sys_mkdir+0xb7/0xf6 [<c01181fd>] schedule_tail+0x41/0x4d [<c0105dfd>] sysenter_past_esp+0x52/0x71 The compiler does a lot of inlining, but I do believe this is in diNewIAG. The thread is doing a down(&JFS_IP(ipimap)->commit_sem) while holding the rdwr_lock (which is a recent change). I will have to see how hard this is to fix. -- David Kleikamp IBM Linux Technology Center |
From: Dave K. <sh...@au...> - 2005-01-10 22:39:45
|
Does this patch help? ===== fs/jfs/jfs_imap.c 1.37 vs edited ===== --- 1.37/fs/jfs/jfs_imap.c 2004-12-07 15:38:08 -06:00 +++ edited/fs/jfs/jfs_imap.c 2005-01-10 13:37:12 -06:00 @@ -2532,15 +2532,17 @@ * to include a new iag. */ + /* Must grab commit_sem before getting write lock on ipimap */ + down(&JFS_IP(ipimap)->commit_sem); + /* acquire inode map lock */ IWRITE_LOCK(ipimap); if (ipimap->i_size >> L2PSIZE != imap->im_nextiag + 1) { - IWRITE_UNLOCK(ipimap); - IAGFREE_UNLOCK(imap); jfs_error(imap->im_ipimap->i_sb, "diNewIAG: ipimap->i_size is wrong"); - return -EIO; + rc = -EIO; + goto unlock; } @@ -2551,11 +2553,8 @@ * number limit. */ if (iagno > (MAXIAGS - 1)) { - /* release the inode map lock */ - IWRITE_UNLOCK(ipimap); - rc = -ENOSPC; - goto out; + goto unlock; } /* @@ -2566,12 +2565,8 @@ /* Allocate extent for new iag page */ xlen = sbi->nbperpage; - if ((rc = dbAlloc(ipimap, 0, (s64) xlen, &xaddr))) { - /* release the inode map lock */ - IWRITE_UNLOCK(ipimap); - - goto out; - } + if ((rc = dbAlloc(ipimap, 0, (s64) xlen, &xaddr))) + goto unlock; /* assign a buffer for the page */ mp = get_metapage(ipimap, xaddr, PSIZE, 1); @@ -2581,11 +2576,8 @@ */ dbFree(ipimap, xaddr, (s64) xlen); - /* release the inode map lock */ - IWRITE_UNLOCK(ipimap); - rc = -EIO; - goto out; + goto unlock; } iagp = (struct iag *) mp->data; @@ -2617,22 +2609,17 @@ * addressing structure pointing to the new iag page; */ tid = txBegin(sb, COMMIT_FORCE); - down(&JFS_IP(ipimap)->commit_sem); /* update the inode map addressing structure to point to it */ if ((rc = xtInsert(tid, ipimap, 0, blkno, xlen, &xaddr, 0))) { txEnd(tid); - up(&JFS_IP(ipimap)->commit_sem); /* Free the blocks allocated for the iag since it was * not successfully added to the inode map */ dbFree(ipimap, xaddr, (s64) xlen); - /* release the inode map lock */ - IWRITE_UNLOCK(ipimap); - - goto out; + goto unlock; } /* update the inode map's inode to reflect the extension */ @@ -2660,11 +2647,6 @@ */ imap->im_freeiag = iagno; - /* Until we have logredo working, we want the imap inode & - * control page to be up to date. - */ - diSync(ipimap); - /* release the inode map lock */ IWRITE_UNLOCK(ipimap); } @@ -2688,11 +2670,17 @@ *iagnop = iagno; *mpp = mp; + goto out; + + unlock: + IWRITE_UNLOCK(ipimap); + up(&JFS_IP(ipimap)->commit_sem); + out: /* release the iag free lock */ IAGFREE_UNLOCK(imap); - return (rc); + return rc; } /* |
From: Jon N. <jne...@ja...> - 2005-01-11 06:24:51
|
On Mon, 10 Jan 2005, Dave Kleikamp wrote: > Does this patch help? I'll let you know as soon as I can. Incidentally, I'm only copying the list now unless people want to be specifically included. However, I've encountered another bug. I needed to enlarge one of my volumes. I ran: lvresize --size=+1G /dev/.... Then I ran: mount -o remount,resize /.... Everything went smoothly. df showed the size increase. Shortly thereafter, I got *ONE* message from a userland process about it being out of space (err. ENOSPC) and then the rest were "read-only file system" (EROFS). /var/log/messages had but one message: ERROR: (device dm-3): diAllocIno: nfreeinos = 0, but iag on freelist What happened? -- Jon Nelson <jne...@ja...> |
From: Sonny R. <so...@bu...> - 2005-01-11 03:13:48
Attachments:
jfs-deadlock2.out
jfs-deadlock2-txstats.out
|
On Mon, Jan 10, 2005 at 01:39:34PM -0600, Dave Kleikamp wrote: > Does this patch help? > It doesn't appear to help. It looks like pdflush is trying to writeback inodes and is blocking waiting for a tlock in txLockAlloc again, maybe this is holding everyone else from completing their transactions? Also, you can clearly see one of the fstest threads in the diNewIAG path waiting on a tlock. Also, I'm doing this on a 2.6.8.1 kernel, would that make any difference? I've attached the full backtrace output and txstats output. Sonny |
From: Dave K. <sh...@au...> - 2005-01-11 17:20:16
|
On Mon, 2005-01-10 at 19:03 -0500, Sonny Rao wrote: > On Mon, Jan 10, 2005 at 01:39:34PM -0600, Dave Kleikamp wrote: > > Does this patch help? > > > > It doesn't appear to help. It looks like pdflush is trying to > writeback inodes and is blocking waiting for a tlock in txLockAlloc > again, maybe this is holding everyone else from completing their > transactions? Also, you can clearly see one of the fstest threads in > the diNewIAG path waiting on a tlock. Yeah, that seems to be the trouble. We're not doing a good enough job of preventing tlock starvation. I also noticed that several of the static functions called by diAlloc do show up in this latest stack trace, so I believe I was mistaken about the cause of the earlier deadlock. I now think that the thread in diAlloc was trying to grab the AG_LOCK, and never made it down into diNewIAG. I'm afraid there may still be two different problems that still need to be figured out. > Also, I'm doing this on a 2.6.8.1 kernel, would that make any difference? I don't think so. > I've attached the full backtrace output and txstats output. > > Sonny -- David Kleikamp IBM Linux Technology Center |
From: Dave K. <sh...@au...> - 2005-01-11 19:03:11
|
On Tue, 2005-01-11 at 08:20 -0600, Dave Kleikamp wrote: > I also noticed that several of the static functions called by diAlloc do > show up in this latest stack trace, so I believe I was mistaken about > the cause of the earlier deadlock. I now think that the thread in > diAlloc was trying to grab the AG_LOCK, and never made it down into > diNewIAG. I'm afraid there may still be two different problems that > still need to be figured out. No, I think it's the same problem. I only saw a subset of the blocked threads in the earlier case, and I believe there probably was a thread blocked in diNewIAG. This patch simply blocks new transactions earlier when the tlocks are starting to get scarce. When I implemented the multiple commit threads, I split TxLockVHWM out of TxLockHWM, which was 80% of nTxLock. So this patch puts TxLockVHWM back to the old value of TxLockHWM. Maybe I was a little too bold waiting until 90% of the tlocks were in use before blocking new transactions. (jfsSync thread is awakened when TxLockHWM is hit, and transactions are actually blocked when TxLockVHWM is hit.) ===== fs/jfs/jfs_txnmgr.c 1.64 vs edited ===== --- 1.64/fs/jfs/jfs_txnmgr.c 2004-11-01 07:59:47 -06:00 +++ edited/fs/jfs/jfs_txnmgr.c 2005-01-11 09:42:49 -06:00 @@ -266,8 +266,8 @@ * tid = 0 is reserved. */ TxLockLWM = (nTxLock * 4) / 10; - TxLockHWM = (nTxLock * 8) / 10; - TxLockVHWM = (nTxLock * 9) / 10; + TxLockHWM = (nTxLock * 7) / 10; + TxLockVHWM = (nTxLock * 8) / 10; size = sizeof(struct tblock) * nTxBlock; TxBlock = (struct tblock *) vmalloc(size); |
From: Dave K. <sh...@au...> - 2005-01-11 19:06:16
|
I think you can throw away the first patch. I no longer believe that the code can deadlock due to the commit_sem. -- David Kleikamp IBM Linux Technology Center |
From: Sonny R. <so...@bu...> - 2005-01-11 19:08:41
|
On Tue, Jan 11, 2005 at 10:03:03AM -0600, Dave Kleikamp wrote: > On Tue, 2005-01-11 at 08:20 -0600, Dave Kleikamp wrote: > > I also noticed that several of the static functions called by diAlloc do > > show up in this latest stack trace, so I believe I was mistaken about > > the cause of the earlier deadlock. I now think that the thread in > > diAlloc was trying to grab the AG_LOCK, and never made it down into > > diNewIAG. I'm afraid there may still be two different problems that > > still need to be figured out. > > No, I think it's the same problem. I only saw a subset of the blocked > threads in the earlier case, and I believe there probably was a thread > blocked in diNewIAG. > Yeah, I looked through the earlier backtrace file and found a thread in diNewIAG, I'll try the patch. Sonny |