From: Franco B. <fr...@ro...> - 2004-08-25 04:57:08
|
Is anybody else seeing fuse abort get_node? I've been trying for weeks to figure out what could be wrong but it's proving very difficult to pin point the problem. The traceback points to the do_forget call in __fuse_read_cmd and I can see that struct fuse *f is fine in __fuse_read_cmd but changes address after the call to do_forget. It's almost impossible to say under what circumstances it fails, it sometimes works fine for many days and other times fails almost immediately after I start the filesystem. Any help, hints or tips on tracing the problem would be much appreciated. Franco. #5 0x40016ff1 in __fuse_read_cmd (f=0x8056ca0) at fuse.c:1483 1483 do_forget(f, in, (struct fuse_forget_in *) inarg); (gdb) p f $3 = (struct fuse *) 0x8056ca0 (gdb) p *f $4 = {flags = 0, fd = 3, op = {getattr = 0x80498c0 <sfs_getattr>, readlink = 0x80499f0 <sfs_readlink>, getdir = 0x8049a6c <sfs_getdir>, mknod = 0x8049b98 <sfs_mknod>, mkdir = 0x8049cd8 <sfs_mkdir>, unlink = 0x8049d8c <sfs_unlink>, rmdir = 0x8049e14 <sfs_rmdir>, symlink = 0x8049ec0 <sfs_symlink>, rename = 0x8049f5c <sfs_rename>, link = 0x804a0e8 <sfs_link>, chmod = 0x804a130 <sfs_chmod>, chown = 0x804a194 <sfs_chown>, truncate = 0x804a1fc <sfs_truncate>, utime = 0x804a268 <sfs_utime>, open = 0x804a2cc <sfs_open>, read = 0x804a5c4 <sfs_read>, write = 0x804a7f8 <sfs_write>, statfs = 0x804a938 <sfs_statfs>, flush = 0, release = 0x804ab34 <sfs_release>, fsync = 0x804acf4 <sfs_fsync>, setxattr = 0, getxattr = 0, listxattr = 0, removexattr = 0}, name_table = 0x8056d68, name_table_size = 14057, ino_table = 0x8064910, ino_table_size = 14057, ctr = 1456, generation = 0, hidectr = 0, lock = { __m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 1091033532, __spinlock = 0}}, numworker = 4, numavail = 2, getcontext = 0x400175bc <mt_getcontext>, context = {uid = 0, gid = 0}, context_key = 0, exited = 0, majorver = 3, minorver = 1} (gdb) p *cmd $14 = {buf = 0x8090a70 "", buflen = 24} (gdb) p *(struct fuse_in_header *) cmd->buf $26 = {unique = 0, opcode = FUSE_FORGET, ino = 1241, uid = 0, gid = 0} (gdb) p inarg $29 = (void *) 0x8090a84 (gdb) down #4 0x40014dc2 in destroy_node (f=0x40047e20, ino=1241, version=0) at fuse.c:289 289 node = get_node(f, ino); |
From: Terje O. <os...@gm...> - 2004-09-07 23:30:01
|
I have just found the mailing list archives and a post by Franco Broi caught my attention. I am having the exact same problem, "Abort in get_node". I am reproducing it time and time again, and every time it"s caused by a FORGET on the same inode. I"m still reading through all the responses to Franco Broi. Hopefully you guys have solved this already. If not, I have alot more data on the subject. Terje On Tue, 7 Sep 2004 11:24:24 -0700, Terje Oseberg <oseberg@gm...> wrote: > I have 2 systems. One SMP and the other not. I am running FUSE 1.3. > > On the SMP system, FUSE is working great. I can create over 500 FUSE > mounts and beat on them for days without failure. > > On the non SMP version, with the FUSE library running in multithreaded > mode, I"m getting two FORGET messages for the same inode. The second > FORGET message causes the FUSE library to core dump with an abort() > call in get_node(). > > Here"s a stack dump: > #1 0xb74c4a09 in raise () from /lib/tls/libc.so.6 > #2 0xb74c6235 in abort () from /lib/tls/libc.so.6 > #3 0x0804a2ad in get_node (f=0x804eba8, ino=2286) at fuse.c:93 > #4 0x0804aa16 in destroy_node (f=0x804eba8, ino=2286, version=23036) > at fuse.c:293 > #5 0x0804b0b3 in do_forget (f=0x804eba8, in=0x80708d0, arg=0x80708e4) > at fuse.c:476 > #6 0x0804c285 in __fuse_read_cmd (f=0x804eba8) at fuse.c:1065 > #7 0x0804c647 in do_work (data=0x806a3d8) at fuse_mt.c:40 > #8 0xb75d8dac in start_thread () from /lib/tls/libpthread.so.0 > #9 0xb75779ea in clone () from /lib/tls/libc.so.6 > > I"ve added some printf()"s to the fuse library and discovered that > abort() is being called because of a FORGET message from the kernel > through the /proc/fs/fuse/dev file on an inode that has previously > received a FORGET message. > > I"m currently a bit new to filesystems and don"t fully know what the > FORGET messages are for. I assume they"re for the kernel to control > the caching of inodes in the filesystem code. > > When I use the library in single threaded mode, the problem goes away. > I don"t understand because it"s looking like a problem with the kernel > making bogus requests of the FUSE library, but running the FUSE > library in single threaded mode shouldn"t effect the kernel messages > being sent. > > On the single CPU system, this problem occurs somewhere between 5 and > 20 mounts. Below 5 mounts, there seems to be no problems. > > Has anyone seen any similar behavior? Does anyone have any explanation > as to why this might be happening? Can anyone give me any advice on > how I should proceed to debug this? > > I will be spending the day trying to figure out the cause of this problem. |
From: Franco B. <fr...@ro...> - 2004-08-25 06:50:35
|
Just done a little more debugging on the core file and discovered that the entry in the inode table for 1241 is NULL, so doesn't this mean that the node has already been destroyed? (gdb) p f->ino_table[1241] $54 = (struct node *) 0x0 (gdb) p *f->ino_table[1242] $57 = {name_next = 0x0, ino_next = 0x0, ino = 1242, generation = 0, parent = 55, name = 0x8072de8 "XG91-246_2.shot_db", mode = 32768, rdev = 0, version = 98757, open_count = 0, is_hidden = 0} (gdb) p *f->ino_table[1240] $58 = {name_next = 0x0, ino_next = 0x0, ino = 1240, generation = 0, parent = 55, name = 0x41103c28 "XG91-252.rcvr_db", mode = 32768, rdev = 0, version = 98755, open_count = 0, is_hidden = 0} On Wed, 2004-08-25 at 12:56, Franco Broi wrote: > Is anybody else seeing fuse abort get_node? I've been trying for weeks > to figure out what could be wrong but it's proving very difficult to pin > point the problem. > > The traceback points to the do_forget call in __fuse_read_cmd and I can > see that struct fuse *f is fine in __fuse_read_cmd but changes address > after the call to do_forget. > > It's almost impossible to say under what circumstances it fails, it > sometimes works fine for many days and other times fails almost > immediately after I start the filesystem. > > Any help, hints or tips on tracing the problem would be much > appreciated. > > Franco. > > #5 0x40016ff1 in __fuse_read_cmd (f=0x8056ca0) at fuse.c:1483 > 1483 do_forget(f, in, (struct fuse_forget_in *) inarg); > (gdb) p f > $3 = (struct fuse *) 0x8056ca0 > (gdb) p *f > $4 = {flags = 0, fd = 3, op = {getattr = 0x80498c0 <sfs_getattr>, > readlink = 0x80499f0 <sfs_readlink>, getdir = 0x8049a6c > <sfs_getdir>, > mknod = 0x8049b98 <sfs_mknod>, mkdir = 0x8049cd8 <sfs_mkdir>, > unlink = 0x8049d8c <sfs_unlink>, rmdir = 0x8049e14 <sfs_rmdir>, > symlink = 0x8049ec0 <sfs_symlink>, rename = 0x8049f5c <sfs_rename>, > link = 0x804a0e8 <sfs_link>, chmod = 0x804a130 <sfs_chmod>, > chown = 0x804a194 <sfs_chown>, truncate = 0x804a1fc <sfs_truncate>, > utime = 0x804a268 <sfs_utime>, open = 0x804a2cc <sfs_open>, > read = 0x804a5c4 <sfs_read>, write = 0x804a7f8 <sfs_write>, > statfs = 0x804a938 <sfs_statfs>, flush = 0, > release = 0x804ab34 <sfs_release>, fsync = 0x804acf4 <sfs_fsync>, > setxattr = 0, getxattr = 0, listxattr = 0, removexattr = 0}, > name_table = 0x8056d68, name_table_size = 14057, ino_table = > 0x8064910, > ino_table_size = 14057, ctr = 1456, generation = 0, hidectr = 0, lock > = { > __m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, > __m_lock = {__status = 1091033532, __spinlock = 0}}, numworker = 4, > numavail = 2, getcontext = 0x400175bc <mt_getcontext>, context = {uid > = 0, > gid = 0}, context_key = 0, exited = 0, majorver = 3, minorver = 1} > > > (gdb) p *cmd > $14 = {buf = 0x8090a70 "", buflen = 24} > > (gdb) p *(struct fuse_in_header *) cmd->buf > $26 = {unique = 0, opcode = FUSE_FORGET, ino = 1241, uid = 0, gid = 0} > > (gdb) p inarg > $29 = (void *) 0x8090a84 > > > (gdb) down > #4 0x40014dc2 in destroy_node (f=0x40047e20, ino=1241, version=0) > at fuse.c:289 > 289 node = get_node(f, ino); > > > |
From: Franco B. <fr...@ro...> - 2004-08-31 09:46:51
|
Is it significant that the value of unique is always 0 for the do_forget which causes the abort in get_node? (He said clutching at straws!) I've updated to the latest CVS code and everything seems fine except for these occasional aborts. They always appear to be the result of the hash table entry being zero for the offending inode number, so either it was never allocated, or it's being cleared twice - why? Or should I say how? What could I be possibly doing to could cause this to happen? I've run the code with Electric-fence and it passed (valgrind wouldn't work for me on RH 9), although I've never managed to get it to fail in any of my tests, it only ever fails when I unleash 30 odd users spread across 24 machines each running 3 fuse filesystems. I now doubt that memory corruption is the problem, the funny value for the fuse struct f in destroy_node is probably just a red herring, as Miklos pointed out it's probably a result of optimisation which also removes the call to do_forget from the traceback. So what are the possible scenarios? do_forget is being called for an inode which has already been forgotten - is this possible? The entry in the hash table is being zeroed by some wayward code, but leaving the entries either side untouched - seems unlikely. Something called destroy_node when it shouldn't have. I should forget about pretending to be a programmer and go and grow sunflowers. On Wed, 2004-08-25 at 14:49, Franco Broi wrote: > Just done a little more debugging on the core file and discovered that > the entry in the inode table for 1241 is NULL, so doesn't this mean that > the node has already been destroyed? > > > (gdb) p f->ino_table[1241] > $54 = (struct node *) 0x0 > > (gdb) p *f->ino_table[1242] > $57 = {name_next = 0x0, ino_next = 0x0, ino = 1242, generation = 0, > parent = 55, name = 0x8072de8 "XG91-246_2.shot_db", mode = 32768, rdev > = 0, > version = 98757, open_count = 0, is_hidden = 0} > > (gdb) p *f->ino_table[1240] > $58 = {name_next = 0x0, ino_next = 0x0, ino = 1240, generation = 0, > parent = 55, name = 0x41103c28 "XG91-252.rcvr_db", mode = 32768, rdev > = 0, > version = 98755, open_count = 0, is_hidden = 0} > > > On Wed, 2004-08-25 at 12:56, Franco Broi wrote: > > Is anybody else seeing fuse abort get_node? I've been trying for weeks > > to figure out what could be wrong but it's proving very difficult to pin > > point the problem. > > > > The traceback points to the do_forget call in __fuse_read_cmd and I can > > see that struct fuse *f is fine in __fuse_read_cmd but changes address > > after the call to do_forget. > > > > It's almost impossible to say under what circumstances it fails, it > > sometimes works fine for many days and other times fails almost > > immediately after I start the filesystem. > > > > Any help, hints or tips on tracing the problem would be much > > appreciated. > > > > Franco. > > > > #5 0x40016ff1 in __fuse_read_cmd (f=0x8056ca0) at fuse.c:1483 > > 1483 do_forget(f, in, (struct fuse_forget_in *) inarg); > > (gdb) p f > > $3 = (struct fuse *) 0x8056ca0 > > (gdb) p *f > > $4 = {flags = 0, fd = 3, op = {getattr = 0x80498c0 <sfs_getattr>, > > readlink = 0x80499f0 <sfs_readlink>, getdir = 0x8049a6c > > <sfs_getdir>, > > mknod = 0x8049b98 <sfs_mknod>, mkdir = 0x8049cd8 <sfs_mkdir>, > > unlink = 0x8049d8c <sfs_unlink>, rmdir = 0x8049e14 <sfs_rmdir>, > > symlink = 0x8049ec0 <sfs_symlink>, rename = 0x8049f5c <sfs_rename>, > > link = 0x804a0e8 <sfs_link>, chmod = 0x804a130 <sfs_chmod>, > > chown = 0x804a194 <sfs_chown>, truncate = 0x804a1fc <sfs_truncate>, > > utime = 0x804a268 <sfs_utime>, open = 0x804a2cc <sfs_open>, > > read = 0x804a5c4 <sfs_read>, write = 0x804a7f8 <sfs_write>, > > statfs = 0x804a938 <sfs_statfs>, flush = 0, > > release = 0x804ab34 <sfs_release>, fsync = 0x804acf4 <sfs_fsync>, > > setxattr = 0, getxattr = 0, listxattr = 0, removexattr = 0}, > > name_table = 0x8056d68, name_table_size = 14057, ino_table = > > 0x8064910, > > ino_table_size = 14057, ctr = 1456, generation = 0, hidectr = 0, lock > > = { > > __m_reserved = 0, __m_count = 0, __m_owner = 0x0, __m_kind = 0, > > __m_lock = {__status = 1091033532, __spinlock = 0}}, numworker = 4, > > numavail = 2, getcontext = 0x400175bc <mt_getcontext>, context = {uid > > = 0, > > gid = 0}, context_key = 0, exited = 0, majorver = 3, minorver = 1} > > > > > > (gdb) p *cmd > > $14 = {buf = 0x8090a70 "", buflen = 24} > > > > (gdb) p *(struct fuse_in_header *) cmd->buf > > $26 = {unique = 0, opcode = FUSE_FORGET, ino = 1241, uid = 0, gid = 0} > > > > (gdb) p inarg > > $29 = (void *) 0x8090a84 > > > > > > (gdb) down > > #4 0x40014dc2 in destroy_node (f=0x40047e20, ino=1241, version=0) > > at fuse.c:289 > > 289 node = get_node(f, ino); > > > > > > |
From: Miklos S. <mi...@sz...> - 2004-08-25 08:56:34
|
> Is anybody else seeing fuse abort get_node? I've been trying for weeks > to figure out what could be wrong but it's proving very difficult to pin > point the problem. > > The traceback points to the do_forget call in __fuse_read_cmd and I can > see that struct fuse *f is fine in __fuse_read_cmd but changes address > after the call to do_forget. It's possible that you are seeing the address change because the library is compiled with optimization. You can turn that off by hacking the Makefiles or configure.in. That would help you debug the problem. Also you could try running the program under valgrind, a very useful debugging tool. > Just done a little more debugging on the core file and discovered that > the entry in the inode table for 1241 is NULL, so doesn't this mean that > the node has already been destroyed? Yes it does. Miklos |
From: Franco B. <fr...@cl...> - 2004-08-25 12:21:22
|
On Wed, 2004-08-25 at 16:56, Miklos Szeredi wrote: > > Just done a little more debugging on the core file and discovered that > > the entry in the inode table for 1241 is NULL, so doesn't this mean that > > the node has already been destroyed? > > Yes it does. > So how can this happen? And more importantly who's fault is it? valgrind looks very interesting I'll definitely give it a go. |
From: Miklos S. <mi...@sz...> - 2004-08-25 12:30:44
|
> So how can this happen? And more importantly who's fault is it? The node in removed from the hash chain in unhash_node(). This is called from destroy_node(), which in turn is called from do_forget(), and from lookup/create operations which were interrupted. So the only "normal" place where an inode is unhashed is the FORGET method, and it sounds quite unlikely that there would be two FORGETs for the same inode. So my guess is that this is a side effect of some other error, and valgrind may shed some light on this. Miklos |
From: Franco B. <fr...@cl...> - 2004-08-25 14:02:30
|
On Wed, 2004-08-25 at 20:30, Miklos Szeredi wrote: > > So how can this happen? And more importantly who's fault is it? > > The node in removed from the hash chain in unhash_node(). This is > called from destroy_node(), which in turn is called from do_forget(), > and from lookup/create operations which were interrupted. So the only > "normal" place where an inode is unhashed is the FORGET method, and it > sounds quite unlikely that there would be two FORGETs for the same > inode. So my guess is that this is a side effect of some other error, > and valgrind may shed some light on this. valgrind doesn't appear to work when fuse is running multi-threaded, just locks up. Didn't turn up anything unexpected when running single threaded. > > Miklos > > > ------------------------------------------------------- > SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media > 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 > Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. > http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 > _______________________________________________ > Avf-fuse-dev mailing list > Avf...@li... > https://lists.sourceforge.net/lists/listinfo/avf-fuse-dev |
From: Miklos S. <mi...@sz...> - 2004-08-25 16:47:29
|
> valgrind doesn't appear to work when fuse is running multi-threaded, > just locks up. Didn't turn up anything unexpected when running single > threaded. Version 2.1.2 of valgrind (debian package) works fine for me. Miklos |
From: Valient G. <vg...@po...> - 2004-08-25 17:07:48
|
On Wed, 2004-08-25 at 18:47, Miklos Szeredi wrote: > > valgrind doesn't appear to work when fuse is running multi-threaded, > > just locks up. Didn't turn up anything unexpected when running single > > threaded. > > Version 2.1.2 of valgrind (debian package) works fine for me. I use version 2.1.2 also, and it works fine for me debugging multi-threaded fuse based filesystems (on a SuSE 9.1 based box). However, in my experience, a program under valgrind is much less likely to trigger a race condition (probably because the program runs slower)... The 'helgrind' tool is supposed to check for potential race conditions, but I haven't had much luck with that, so I just use the memcheck or addrcheck tools. Valient |
From: Franco B. <fr...@cl...> - 2004-08-25 23:55:47
|
I was running 2.0.0, so I tried 2.1.2, doesn't run at all! ==10468== valgrind: vg_libpthread.c:2643 (__fork): Assertion `__fork_ptr != ((void *)0)' failed. ==10468== Please report this bug at: valgrind.kde.org On Thu, 2004-08-26 at 01:07, Valient Gough wrote: > On Wed, 2004-08-25 at 18:47, Miklos Szeredi wrote: > > > valgrind doesn't appear to work when fuse is running multi-threaded, > > > just locks up. Didn't turn up anything unexpected when running single > > > threaded. > > > > Version 2.1.2 of valgrind (debian package) works fine for me. > > I use version 2.1.2 also, and it works fine for me debugging > multi-threaded fuse based filesystems (on a SuSE 9.1 based box). > > However, in my experience, a program under valgrind is much less > likely to trigger a race condition (probably because the program runs > slower)... The 'helgrind' tool is supposed to check for potential > race conditions, but I haven't had much luck with that, so I just use > the memcheck or addrcheck tools. > > Valient > |
From: Miklos S. <mi...@sz...> - 2004-08-31 11:57:16
|
> Is it significant that the value of unique is always 0 for the > do_forget which causes the abort in get_node? (He said clutching at > straws!) It's zero because there is no reply message to a FORGET. > I've updated to the latest CVS code and everything seems fine except for > these occasional aborts. They always appear to be the result of the hash > table entry being zero for the offending inode number, so either it was > never allocated, or it's being cleared twice - why? Or should I say how? Did you try making a trace with '-d'. That would help checking if it was caused by a double FORGET. > What could I be possibly doing to could cause this to happen? I have no idea. It could certainly be a FUSE bug, that is only triggered in your case. > do_forget is being called for an inode which has already been forgotten > - is this possible? Not in theory. But tracing the operations should reveal if this is the case. > The entry in the hash table is being zeroed by some wayward code, but > leaving the entries either side untouched - seems unlikely. Hmm. > Something called destroy_node when it shouldn't have. Destroy node is only called from forget, and from failed creation operations. So again a trace should show if this is the case. > I should forget about pretending to be a programmer and go and grow > sunflowers. Oh, how well I know the feeling! But in the end a lot can be learned from these experiences. Miklos |
From: Franco B. <fr...@cl...> - 2004-08-31 12:20:46
|
> > Did you try making a trace with '-d'. That would help checking if it > was caused by a double FORGET. I've tried running with -d but the logfile is enormous, don't forget that it can take days of heavy use before the thing fails. It also really badly effects performance and since I've only ever seen failures during production it's not really practical. I'm tempted to make the error non fatal - what do you think would happen? It might be worth it just to see how it fails! |
From: Miklos S. <mi...@sz...> - 2004-08-31 12:29:54
|
> I've tried running with -d but the logfile is enormous, don't forget > that it can take days of heavy use before the thing fails. It also > really badly effects performance and since I've only ever seen failures > during production it's not really practical. You could still do it as a last resort. And after a failure you only need to look at the end of the trace, since that's where the interesting things should happen. BTW have you tried any automatic filesystem tests? LTP (http://ltp.sourceforge.net/) has good filesystem tests that stress the FS to it's limits. It helped me to resolve quite a few bugs, so it could also help you make this problem more reproducible. > I'm tempted to make the error non fatal - what do you think would > happen? It might be worth it just to see how it fails! You could do that, but it's not a solution obviously. Miklos |
From: Franco B. <fr...@cl...> - 2004-08-31 14:26:22
|
On Tue, 2004-08-31 at 20:29, Miklos Szeredi wrote: > > I've tried running with -d but the logfile is enormous, don't forget > > that it can take days of heavy use before the thing fails. It also > > really badly effects performance and since I've only ever seen failures > > during production it's not really practical. > > You could still do it as a last resort. And after a failure you only > need to look at the end of the trace, since that's where the > interesting things should happen. Might look into redirecting the printfs into a ring buffer in memory and only save the last 100 messages... > > BTW have you tried any automatic filesystem tests? LTP > (http://ltp.sourceforge.net/) has good filesystem tests that stress > the FS to it's limits. It helped me to resolve quite a few bugs, so > it could also help you make this problem more reproducible. I tried running a couple of copies of the ltp fs tests on my machine at home, only left it running for an hour but no sign of any problems. > > > I'm tempted to make the error non fatal - what do you think would > > happen? It might be worth it just to see how it fails! > > You could do that, but it's not a solution obviously. > > Miklos > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click > _______________________________________________ > Avf-fuse-dev mailing list > Avf...@li... > https://lists.sourceforge.net/lists/listinfo/avf-fuse-dev |
From: Miklos S. <mi...@sz...> - 2004-08-31 14:39:02
|
> Might look into redirecting the printfs into a ring buffer in memory and > only save the last 100 messages... The whole history might be interesting though. E.g. if you want to see whether a FORGET with the same inode number ever happened. So if you have a couple gigabytes of free disk space, I think it's worth it to save the whole log. I've been saving logs of LTP sessions and haven't found it such a big performance problem. Miklos |