From: Franco B. <fr...@ro...> - 2004-06-30 01:25:00
|
I'm still seeing cases where I have up to 8 threads all waiting for locks, no CPU is being consumed. I managed to attach with the debugger to a few and some seemed to be in destroy_node, others were in wait for thread restart - or something. I can't be more precise as the debugger was acting strangely and wouldn't let be quit or reattach. We are running fuse in a production environment so it's not possible to say exactly what was happening at the time of the hang. Any ideas? I'm using latest CVS code. |
From: Miklos S. <msz...@in...> - 2004-06-30 08:42:09
|
> I'm still seeing cases where I have up to 8 threads all waiting for > locks, no CPU is being consumed. OK. At least this shouldn't be a spinlock problem like the last one, because that would show a 100% CPU load. > I managed to attach with the debugger to a few and some seemed to be in > destroy_node, others were in wait for thread restart - or something. I > can't be more precise as the debugger was acting strangely and wouldn't > let be quit or reattach. In theory gdb can attach to the process and then switch between threads. But i'm not sure how well that works in practice. > We are running fuse in a production environment so it's not possible to > say exactly what was happening at the time of the hang. Well, you're very brave to run CVS code in production :) > Any ideas? Not for the moment. But I'll be looking for possible deadlock causes. You should also audit your code, if you are doing any kind of locking. > Under what circumstances does fuse spawn a new worker thread? I ask > because on some machines where fuse is running there don't appear to be > any extra worker threads, and yet on others which are running pretty > much the same sort of work, I see many threads. A new worker thread is started in the case when a request is read from kernel and there are no more free worker threads. This guarantees that there will always be one free worker. However there is an upper limit (10) on the number of workers. > Worker threads never shutdown, do they? Only on exit. > I wanted to trap HUP signals so that I can reread a config file, but I > see that fuse defines a handler for HUP which just exits. > > I tried removing HUP from the fuse handler and in the process discovered > that the exit_handler routine was never being called anyway... It works for me. sending a HUP to a FUSE process nicely kills it, like Ctrl-C on a terminal. Miklos |
From: Valient G. <vg...@po...> - 2004-06-30 09:59:22
|
I'm also seeing locking problems, although not resulting in deadlock, so not the same problem. I've tested with FUSE 1.2 + hide_node changes backported, and with FUSE CVS. In my case, I believe the problem may be due to race conditions for FUSE's internal node state. The result is either fuse aborting from get_node, or else passing a request through to the user filesystem with an old name after renaming a file. Some paths (I'm looking at the rename path and open paths) go through multiple lock / unlock phases, but there doesn't seem to be anything preventing the internal node state from being changed between one set of locks and another. Note that I didn't see the locking problems at first. It was running fine on my system, but another user of my filesystem reported the problems to me. It may be that my system is faster then his, or due to differences in threading implementation, and so the window of opportunity for a race condition was smaller. But when I run my filesystem under valgrind, which slows down all operations by quite a bit, then I also see the problems. I've been testing per-node locks across multi-step operations, but still testing. I just wanted to report that I've also been seeing thread locking related issues. Valient On Wed, 2004-06-30 at 10:42, Miklos Szeredi wrote: > > I'm still seeing cases where I have up to 8 threads all waiting for > > locks, no CPU is being consumed. > > OK. At least this shouldn't be a spinlock problem like the last one, > because that would show a 100% CPU load. |
From: Franco B. <fr...@cl...> - 2004-06-30 13:06:01
|
On Wed, 2004-06-30 at 16:42, Miklos Szeredi wrote: > > I tried removing HUP from the fuse handler and in the process discovered > > that the exit_handler routine was never being called anyway... > > It works for me. sending a HUP to a FUSE process nicely kills it, > like Ctrl-C on a terminal. > OK my silly mistake. I kept rebuilding fuse with print statements in the handler but missed the fact that the program was running using the shared library I'd installed sometime back in /usr/local/lib. I can see that the handler is indeed being called. Sorry for being so stupid! |
From: Miklos S. <msz...@in...> - 2004-06-30 10:34:29
|
> I'm also seeing locking problems, although not resulting in deadlock, so > not the same problem. I've tested with FUSE 1.2 + hide_node changes > backported, and with FUSE CVS. > > In my case, I believe the problem may be due to race conditions for > FUSE's internal node state. The result is either fuse aborting from > get_node, or else passing a request through to the user filesystem with > an old name after renaming a file. OK. I've tried to figure these cases out, but obviously haven't fully succeeded. > Some paths (I'm looking at the rename path and open paths) go through > multiple lock / unlock phases, but there doesn't seem to be anything > preventing the internal node state from being changed between one set of > locks and another. I was in the belief that the VFS will prevent these, but looking at it, now I see that's not the case. Does this simple patch fix the problem for you? Thanks, Miklos Index: kernel/file.c =================================================================== RCS file: /cvsroot/avf/fuse/kernel/file.c,v retrieving revision 1.20.2.1 diff -u -r1.20.2.1 file.c --- kernel/file.c 22 Jun 2004 09:14:46 -0000 1.20.2.1 +++ kernel/file.c 30 Jun 2004 10:30:29 -0000 @@ -53,7 +53,9 @@ in.numargs = 1; in.args[0].size = sizeof(inarg); in.args[0].value = &inarg; + down(&inode->i_sem); request_send(fc, &in, &out); + up(&inode->i_sem); if(!out.h.error && !(fc->flags & FUSE_KERNEL_CACHE)) { #ifdef KERNEL_2_6 invalidate_inode_pages(inode->i_mapping); |
From: Miklos S. <msz...@in...> - 2004-06-30 11:17:48
|
> > Some paths (I'm looking at the rename path and open paths) go through > > multiple lock / unlock phases, but there doesn't seem to be anything > > preventing the internal node state from being changed between one set of > > locks and another. > > I was in the belief that the VFS will prevent these, but looking at > it, now I see that's not the case. Does this simple patch fix the > problem for you? The same has to be done with release, but that's trickier because release is not synchronous, like open. Anyway I fixed it in CVS, could you try it please? Thanks, Miklos |
From: Franco B. <fr...@ro...> - 2004-07-01 08:10:16
|
I had another instance of this problem happen today. This time I noticed that the main program had disappeared leaving behind the orphaned worker threads. I suspect that it might have been a file open operation that caused the hang, there was certainly a production job that had hung while opening, or shortly after opening a new file. On Wed, 2004-06-30 at 09:22, Franco Broi wrote: > I'm still seeing cases where I have up to 8 threads all waiting for > locks, no CPU is being consumed. > > I managed to attach with the debugger to a few and some seemed to be in > destroy_node, others were in wait for thread restart - or something. I > can't be more precise as the debugger was acting strangely and wouldn't > let be quit or reattach. > > We are running fuse in a production environment so it's not possible to > say exactly what was happening at the time of the hang. > > Any ideas? > > I'm using latest CVS code. > > > > ------------------------------------------------------- > This SF.Net email sponsored by Black Hat Briefings & Training. > Attend Black Hat Briefings & Training, Las Vegas July 24-29 - > digital self defense, top technical experts, no vendor pitches, > unmatched networking opportunities. Visit www.blackhat.com > _______________________________________________ > Avf-fuse-dev mailing list > Avf...@li... > https://lists.sourceforge.net/lists/listinfo/avf-fuse-dev |
From: Miklos S. <msz...@in...> - 2004-07-01 08:20:42
|
> I had another instance of this problem happen today. This time I noticed > that the main program had disappeared leaving behind the orphaned worker > threads. > > I suspect that it might have been a file open operation that caused the > hang, there was certainly a production job that had hung while opening, > or shortly after opening a new file. OK. Valient discovered a bug in the open/release paths, that could case an abort. I thought that abort would exit the whole program, but it's possible that only the main thread exits (???). If this is the case, than that bug could explain the hang. The bug is hopfully fixed in CVS, so if you upgrade, and these problems go away, then this was the case. If you still get the hang after an upgrade, then we'll continue the hunt. Miklos |
From: Franco B. <fr...@ro...> - 2004-07-01 09:39:16
|
Doing a kernel upgrade on the cluster this weekend, I'll let you know next week if the new fuse module fixes the problem. Thanks. PS. How about making the -x option work for any user? This is how I run it, plus would it be possible for you to remove your HUP handler so that I can define my own without having to change the fuse code?? On Thu, 2004-07-01 at 16:20, Miklos Szeredi wrote: > > I had another instance of this problem happen today. This time I noticed > > that the main program had disappeared leaving behind the orphaned worker > > threads. > > > > I suspect that it might have been a file open operation that caused the > > hang, there was certainly a production job that had hung while opening, > > or shortly after opening a new file. > > OK. Valient discovered a bug in the open/release paths, that could > case an abort. I thought that abort would exit the whole program, but > it's possible that only the main thread exits (???). If this is the > case, than that bug could explain the hang. > > The bug is hopfully fixed in CVS, so if you upgrade, and these > problems go away, then this was the case. If you still get the hang > after an upgrade, then we'll continue the hunt. > > Miklos > > > ------------------------------------------------------- > This SF.Net email sponsored by Black Hat Briefings & Training. > Attend Black Hat Briefings & Training, Las Vegas July 24-29 - > digital self defense, top technical experts, no vendor pitches, > unmatched networking opportunities. Visit www.blackhat.com > _______________________________________________ > Avf-fuse-dev mailing list > Avf...@li... > https://lists.sourceforge.net/lists/listinfo/avf-fuse-dev |
From: Valient G. <vg...@po...> - 2004-07-01 08:36:28
|
On Thu, 2004-07-01 at 10:20, Miklos Szeredi wrote: > > I suspect that it might have been a file open operation that caused the > > hang, there was certainly a production job that had hung while opening, > > or shortly after opening a new file. > > OK. Valient discovered a bug in the open/release paths, that could > case an abort. I thought that abort would exit the whole program, but > it's possible that only the main thread exits (???). If this is the > case, than that bug could explain the hang. Well, I only suspected a problem. I just updated to the latest CVS and I can no longer reproduce the problem, so it is looking good so far. thanks, Valient |
From: Valient G. <vg...@po...> - 2004-07-01 18:26:23
|
On Thu, 2004-07-01 at 10:36, Valient Gough wrote: > On Thu, 2004-07-01 at 10:20, Miklos Szeredi wrote: > > > > I suspect that it might have been a file open operation that caused the > > > hang, there was certainly a production job that had hung while opening, > > > or shortly after opening a new file. > > > > OK. Valient discovered a bug in the open/release paths, that could > > case an abort. I thought that abort would exit the whole program, but > > it's possible that only the main thread exits (???). If this is the > > case, than that bug could explain the hang. > > > Well, I only suspected a problem. I just updated to the latest CVS > and I can no longer reproduce the problem, so it is looking good so > far. Well, just a minute ago my FS spat out "fuse internal error: inode 488 not found" and died.. That's not much to go on, I realize.. I don't have a reproducible case, because it seems much harder to provoke then before the latest locking changes.. It was during an email send from Evolution, which seems to be hard on a filesystem. It was just after sending an email, when evolution does at least one rename-over-open operation, which is fairly new code.. By the way, have you seen fsx-linux (google search brings it right up). It bills itself as "File system exerciser", and it causes some troubles. Under my Reiserfs partition it runs fine, but under a FUSE based filesystem it causes everything using the filesystem to lock up requiring a system reset... So you might find it useful to provoke undesired behavior for testing... Valient |
From: Miklos S. <msz...@in...> - 2004-07-02 06:05:37
|
> Well, just a minute ago my FS spat out "fuse internal error: inode 488 > not found" and died.. > That's not much to go on, I realize.. I don't have a reproducible case, > because it seems much harder to provoke then before the latest locking > changes.. So there's still some race. I'll look into it. > By the way, have you seen fsx-linux (google search brings it right up). > It bills itself as "File system exerciser", and it causes some > troubles. Under my Reiserfs partition it runs fine, but under a FUSE > based filesystem it causes everything using the filesystem to lock up > requiring a system reset... Cool! I'll try it. Thanks, Miklos |
From: Miklos S. <msz...@in...> - 2004-07-02 15:15:24
|
> Well, just a minute ago my FS spat out "fuse internal error: inode 488 > not found" and died.. > That's not much to go on, I realize.. I don't have a reproducible case, > because it seems much harder to provoke then before the latest locking > changes.. A stack trace would be very helpful. So could you enable core file generation before starting the FS (ulimit -c unlimited), so if this ever happens again, we can look at the core file to see what happened. > By the way, have you seen fsx-linux (google search brings it right up). > It bills itself as "File system exerciser", and it causes some > troubles. Under my Reiserfs partition it runs fine, but under a FUSE > based filesystem it causes everything using the filesystem to lock up > requiring a system reset... I was able to reproduce this :). And it's independent of the rename/open races since fsx-linux is only reading and writing a single file. Miklos |
From: Miklos S. <msz...@in...> - 2004-07-02 16:41:22
|
> PS. How about making the -x option work for any user? OK. I've added a module option (user_allow_other) which if set to non-zero will let non-root users use the -x option. The default is not to allow this. This is because I think the -x option could be used to gain access other users' private information by abusing the belief that filesystems can be trusted: they are normally mountable by root only, and have permission checking rules enforced by the kernel, none of which holds for a FUSE filesystem. I may be a bit too paranoid, but it's better to be safe than sorry ;). > This is how I run it, plus would it be possible for you to remove > your HUP handler so that I can define my own without having to > change the fuse code?? Now the signal handlers are only installed if the current is the default handler. So if you install a handler for HUP, INT, TERM or PIPE before calling fuse_main() that handler will not be overwritten. Is this OK? Miklos |
From: Franco B. <fr...@ro...> - 2004-07-03 11:39:39
|
On Sat, 2004-07-03 at 00:41, Miklos Szeredi wrote: > > PS. How about making the -x option work for any user? > > OK. I've added a module option (user_allow_other) which if set to > non-zero will let non-root users use the -x option. The default is > not to allow this. This is because I think the -x option could be > used to gain access other users' private information by abusing the > belief that filesystems can be trusted: they are normally mountable by > root only, and have permission checking rules enforced by the kernel, > none of which holds for a FUSE filesystem. > > I may be a bit too paranoid, but it's better to be safe than sorry ;). > > > This is how I run it, plus would it be possible for you to remove > > your HUP handler so that I can define my own without having to > > change the fuse code?? > > Now the signal handlers are only installed if the current is the > default handler. So if you install a handler for HUP, INT, TERM or > PIPE before calling fuse_main() that handler will not be overwritten. > > Is this OK? Perfect. Thanks. > > Miklos |
From: Franco B. <fr...@ro...> - 2004-07-05 09:00:22
|
Had a few more fuse hangups, did some traces: #0 0x42028d69 in sigsuspend () from /lib/i686/libc.so.6 #1 0x40040108 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0 #2 0x40042480 in __pthread_alt_lock () from /lib/i686/libpthread.so.0 #3 0x4003ef87 in pthread_mutex_lock () from /lib/i686/libpthread.so.0 #4 0x40014db7 in destroy_node (f=0x8053de0, ino=891, version=-4) at fuse.c:294 #5 0x40016fb5 in __fuse_read_cmd (f=0x8053d58) at fuse.c:1478 #6 0x40017445 in do_work (data=0x806f5b0) at fuse_mt.c:44 #7 0x4003e941 in pthread_start_thread () from /lib/i686/libpthread.so.0 #0 0x42028d69 in sigsuspend () from /lib/i686/libc.so.6 #1 0x40040108 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0 #2 0x40042480 in __pthread_alt_lock () from /lib/i686/libpthread.so.0 #3 0x4003ef87 in pthread_mutex_lock () from /lib/i686/libpthread.so.0 #4 0x40014db7 in destroy_node (f=0x8053de0, ino=272, version=-4) at fuse.c:294 #5 0x40016fb5 in __fuse_read_cmd (f=0x8053d58) at fuse.c:1478 #6 0x40017445 in do_work (data=0x806f5b0) at fuse_mt.c:44 #7 0x4003e941 in pthread_start_thread () from /lib/i686/libpthread.so.0 This is with all the latest fixes. On Sat, 2004-07-03 at 19:36, Franco Broi wrote: > On Sat, 2004-07-03 at 00:41, Miklos Szeredi wrote: > > > PS. How about making the -x option work for any user? > > > > OK. I've added a module option (user_allow_other) which if set to > > non-zero will let non-root users use the -x option. The default is > > not to allow this. This is because I think the -x option could be > > used to gain access other users' private information by abusing the > > belief that filesystems can be trusted: they are normally mountable by > > root only, and have permission checking rules enforced by the kernel, > > none of which holds for a FUSE filesystem. > > > > I may be a bit too paranoid, but it's better to be safe than sorry ;). > > > > > This is how I run it, plus would it be possible for you to remove > > > your HUP handler so that I can define my own without having to > > > change the fuse code?? > > > > Now the signal handlers are only installed if the current is the > > default handler. So if you install a handler for HUP, INT, TERM or > > PIPE before calling fuse_main() that handler will not be overwritten. > > > > Is this OK? > > Perfect. > > Thanks. > > > > > Miklos > > > > ------------------------------------------------------- > This SF.Net email sponsored by Black Hat Briefings & Training. > Attend Black Hat Briefings & Training, Las Vegas July 24-29 - > digital self defense, top technical experts, no vendor pitches, > unmatched networking opportunities. Visit www.blackhat.com > _______________________________________________ > Avf-fuse-dev mailing list > Avf...@li... > https://lists.sourceforge.net/lists/listinfo/avf-fuse-dev |
From: Miklos S. <mi...@sz...> - 2004-07-12 16:31:41
|
> Had a few more fuse hangups, did some traces: Thanks. I found a bug which is probably the one that hit you, so you might try another upgrade. The bug came in with the open refcounting code, so the stable version is not affected. BTW. I'm now testing FUSE with LTP and sfx-linux, and found and fixed quite a few bugs. There's still one under 2.6 kernels which I can't put my finger on, but I'm now getting very close :) Miklos |