From: Dom L. <dom...@gm...> - 2007-05-24 05:56:37
|
For a long time, I've been struggling to debug an application that uses FUSE. The problem is that GDB consistently crashes, just freezing, and providing no information about what is going on. Outside of GDB the application runs just fine (99.9% of the time, at least!) I first encountered this problem a long time ago (when FUSE 2.5.x was the current version), and found that if I used an older version of FUSE (a 2.4.x version, I think), GDB worked fine. Obviously, I should have investigated and reported the problem back then, but this workaround was good enough to allow me to make progress, so, regrettably, I didn't pursue the problem further. Since FUSE 2.6.x came out, we modified our application to use the newer 2.6 interface. The problem now, of course, is that I can't simply revert to an old version of FUSE for debugging. :-( I'm currently using a vanilla 2.6.20 kernel on Fedora Core 6, FUSE 2.6.5, and GDB 6.5.13, but I don't think this is very relevant. All the FUSE 2.6.x versions behave the same way. When I first encountered the GDB-FUSE interaction (just over a year ago, I think), I was using an older kernel, older GDB, older FUSE, older Fedora release. At the time, I found, serendipitously, that I could debug the application on a co-worker's Ubuntu machine. It occurred to me that GDB might work on that machine because of hardware differences, different kernel version/options, different system libraries etc. and eventually found that the only relevant difference was an older version of FUSE (2.4.?, not sure exactly which revision). Anyway, the problem of not being able to use GDB has become so painful that I've finally resolved to do something about the problem! I know that I need to try to understand exactly which FUSE revision broke GDB, try to trim code from our application to understand what parts are necessary to repeat the crash, investigate if the bug manifests itself on other platforms etc. That is very time-consuming, so I've been avoiding it. :-) I figured that by posting a message to this mailing-list, I could force myself to get started and to go through the process properly! But before I embark, does anyone have any advice? For example, are there any known GDB-FUSE interactions? (I've searched the mailing-list, and FAQ, and can't see any relevant messages.) Any suggestions about where to start? I'd *really* appreciate some help on this! Thanks in advance, -- Dominick BTW, I have tried to see if I could debug GDB with GDB, but (apart from being highly confusing) this doesn't work either. the whole thing just freezes in exactly the same way, and has to be killed manually. |
From: Miklos S. <mi...@sz...> - 2007-05-25 14:10:47
|
> For a long time, I've been struggling to debug an application that > uses FUSE. The problem is that GDB consistently crashes, just > freezing, and providing no information about what is going on. > Outside of GDB the application runs just fine (99.9% of the time, at > least!) > > I first encountered this problem a long time ago (when FUSE 2.5.x > was the current version), and found that if I used an older version > of FUSE (a 2.4.x version, I think), GDB worked fine. Obviously, I > should have investigated and reported the problem back then, but > this workaround was good enough to allow me to make progress, so, > regrettably, I didn't pursue the problem further. > > Since FUSE 2.6.x came out, we modified our application to use the > newer 2.6 interface. The problem now, of course, is that I can't > simply revert to an old version of FUSE for debugging. :-( Can you try fuse-2.7-rc1? 2.5 and 2.6 did some ugly things with signals on exit, and that may have been the cause of the bad interaction. OTOH gdb _should_ handle that sort of thing. A fuse filesystem really doesn't do anything special, it uses perfectly ordinary read and write calls to communicate with the kernel. Miklos |
From: Dom L. <dom...@gm...> - 2007-05-31 09:03:56
|
On 5/25/07, Miklos Szeredi <mi...@sz...> wrote: > Can you try fuse-2.7-rc1? 2.5 and 2.6 did some ugly things with > signals on exit, and that may have been the cause of the bad > interaction. OTOH gdb _should_ handle that sort of thing. Okay, I tried fuse-2.7-rc1 (including kernel module), and it doesn't make any difference. I've taken my application, and trimmed it down to the point where it does nothing but try to stat() a non-existent file on a FUSE-based filesystem. Similarly, I've taken the filesystem, and eviscerated it so that it doesn't really do a damn thing! What I'm left with is a small C++ program that runs fine on its own, but crashes when run under GDB. Do you (or anyone out there) think you can compile the code and confirm that you can reproduce the behavior? Here's a link to a tar archive: (7 kB) http://www.yousendit.com/download/UVJoOU1RMm1sUjgwTVE9PQ Thanks, -- Dom BTW, my slightly foggy recollection is that I first started encountering trouble with GDB when FUSE had transitioned to version 2.5, and that I was able to continue debugging using an old version of FUSE. I was just experimenting with old versions to see if I could pinpoint exactly where the change occurred. I went all the way back to 2.2.1, but the GDB problem persists. However, this was only with the userspace parts of FUSE, as the old kernel modules won't compile with a kernel newer than 2.6.17 (due to the readv/writev -> aio_read/write change). So I need to build a 2.6.17 kernel, and test again under that. |
From: Roger W. <ro...@fi...> - 2007-05-31 11:37:51
|
Dom Layfield wrote: > On 5/25/07, Miklos Szeredi <mi...@sz...> wrote: > >> Can you try fuse-2.7-rc1? 2.5 and 2.6 did some ugly things with >> signals on exit, and that may have been the cause of the bad >> interaction. OTOH gdb _should_ handle that sort of thing. >> > > Okay, I tried fuse-2.7-rc1 (including kernel module), and it doesn't > make any difference. > > I've taken my application, and trimmed it down to the point where it > does nothing but try to stat() a non-existent file on a FUSE-based > filesystem. Similarly, I've taken the filesystem, and eviscerated it > so that it doesn't really do a damn thing! > > What I'm left with is a small C++ program that runs fine on its own, > but crashes when run under GDB. Do you (or anyone out there) think > you can compile the code and confirm that you can reproduce the > behavior? > > I can reproduce this on a stock (ish) Centos 4.3 build. Here's the relevant sysrq output: -------------------------- gdb S 00000000 1696 12207 12074 12208 (NOTLB) c9529f28 00200082 00000000 00000000 00000000 00000000 00000000 00000000 c9529f30 c9529f30 c9529f30 c1407de0 00000000 00005c7e d2152f9e 0000341c c0320a80 d4c540b0 d4c5421c c9529f70 00000000 d4c54154 fffffe00 d4c540b0 Call Trace: [<c0125721>] do_wait+0x26e/0x449 [<c011e71b>] default_wake_function+0x0/0xc [<c011e71b>] default_wake_function+0x0/0xc [<c012598f>] sys_wait4+0x27/0x2a [<c01259a5>] sys_waitpid+0x13/0x17 [<c02d2797>] syscall_call+0x7/0xb gdb_test S 00000009 1696 12208 12207 12214 (NOTLB) dc18ecf4 00200082 c01f943c 00000009 b0b548e6 d4c55130 0b548e6b 00000001 d4c55130 00000000 c1408740 c1407de0 00000000 000078b9 d2117048 0000341c d4c55130 ca9ae030 ca9ae19c 00000000 c01201bc dbbcf380 dcdc0474 dc18ed18 Call Trace: [<c01f943c>] add_entropy_words+0x53/0x145 [<c01201bc>] prepare_to_wait+0x12/0x4c [<e0cf5194>] fuse_get_req+0x90/0xfb [fuse] [<c0120291>] autoremove_wake_function+0x0/0x2d [<c0120291>] autoremove_wake_function+0x0/0x2d [<e0cf6bf8>] fuse_lookup+0x42/0x1fd [fuse] [<c011e7a1>] __wake_up+0x29/0x3c [<c011e75d>] __wake_up_common+0x36/0x51 [<c02d0daa>] __cond_resched+0x14/0x39 [<c016eec6>] d_alloc+0x175/0x17d [<c0165a29>] real_lookup+0x6e/0xd2 [<c0165c46>] do_lookup+0x56/0x8f [<c0166487>] __link_path_walk+0x808/0xbb5 [<c0166877>] link_path_walk+0x43/0xbe [<c011cbf2>] recalc_task_prio+0x128/0x133 [<c0166c0c>] path_lookup+0x14b/0x17f [<c0166d54>] __user_walk+0x21/0x51 [<c0162009>] vfs_stat+0x14/0x3a [<c011cbf2>] recalc_task_prio+0x128/0x133 [<c02d06c9>] schedule+0x83d/0x8d3 [<c0162612>] sys_stat64+0xf/0x23 [<c0171e29>] dnotify_parent+0x1b/0x6e [<c02d2797>] syscall_call+0x7/0xb gdb_test t C02D06C9 3332 12213 12207 12214 (NOTLB) c44aaec4 00200082 c9529f9c c02d06c9 c44aaed4 d4c540b0 00000001 b7535df4 d4c540b0 00000000 c1408740 c1407de0 00000000 00002b10 d214d320 0000341c d4c540b0 d4c55130 d4c5529c 00000000 c44aa000 c44aa000 00000005 00000005 Call Trace: [<c02d06c9>] schedule+0x83d/0x8d3 [<c012c3fe>] ptrace_stop+0xa0/0xee [<c012c82c>] get_signal_to_deliver+0x142/0x346 [<c0105bb8>] do_signal+0x55/0xd9 [<c011d1a3>] try_to_wake_up+0x281/0x28c [<c012b263>] signal_wake_up+0x1e/0x2c [<c012b707>] specific_send_sig_info+0x9f/0xa6 [<c012b786>] force_sig_info+0x78/0x7f [<c0106d98>] do_int3+0x7f/0xcf [<c0105c64>] do_notify_resume+0x28/0x38 [<c02d27e2>] work_notifysig+0x13/0x15 gdb_test t 00000000 3332 12214 12207 12213 12208 (NOTLB) c90d3ec4 00200082 00000000 00000000 00000000 d4c540b0 00000000 00000000 d4c540b0 00000000 c1408740 c1407de0 00000000 00001709 d2144d53 0000341c d4c540b0 c386c730 c386c89c 00000000 c90d3000 c90d3000 00000013 00000013 Call Trace: [<c012c3fe>] ptrace_stop+0xa0/0xee [<c012c82c>] get_signal_to_deliver+0x142/0x346 [<c0105bb8>] do_signal+0x55/0xd9 [<c0105c64>] do_notify_resume+0x28/0x38 [<c02d27e2>] work_notifysig+0x13/0x15 ----------------------- Roger |
From: Miklos S. <mi...@sz...> - 2007-05-31 19:22:12
|
> On 5/25/07, Miklos Szeredi <mi...@sz...> wrote: > > Can you try fuse-2.7-rc1? 2.5 and 2.6 did some ugly things with > > signals on exit, and that may have been the cause of the bad > > interaction. OTOH gdb _should_ handle that sort of thing. > > Okay, I tried fuse-2.7-rc1 (including kernel module), and it doesn't > make any difference. > > I've taken my application, and trimmed it down to the point where it > does nothing but try to stat() a non-existent file on a FUSE-based > filesystem. Similarly, I've taken the filesystem, and eviscerated it > so that it doesn't really do a damn thing! > > What I'm left with is a small C++ program that runs fine on its own, > but crashes when run under GDB. Do you (or anyone out there) think > you can compile the code and confirm that you can reproduce the > behavior? > > Here's a link to a tar archive: (7 kB) > http://www.yousendit.com/download/UVJoOU1RMm1sUjgwTVE9PQ Aha, this helps. The thing that trips gdb is that it's a _single_ process which is: - providing the filesystem AND - accessing the same filesystem And that _can_ cause weird behavior like this. The reason is that gdb wants to send a SIGSTOP to the process for some reason, and the signal will stop the filesystem provider thread, and then proceed to wait for the filesystem accessor thread to finish the stat() syscall. But that syscall will never finish obviously. Why do you need these two things to be in the same process? Miklos |
From: Dom L. <dom...@gm...> - 2007-05-31 21:53:26
|
On 5/31/07, Miklos Szeredi <mi...@sz...> wrote: > Aha, this helps. The thing that trips gdb is that it's a _single_ > process which is: > > - providing the filesystem AND > - accessing the same filesystem > > And that _can_ cause weird behavior like this. The reason is that gdb > wants to send a SIGSTOP to the process for some reason, and the signal > will stop the filesystem provider thread, and then proceed to wait for > the filesystem accessor thread to finish the stat() syscall. But that > syscall will never finish obviously. > > Why do you need these two things to be in the same process? Good question. I'm not sure I have a valid answer, but let me explain why I did things that way. Firstly, conceptually, the filesystem and the program accessing the filesytem are one thing -- neither is useful without the other -- and I want to start and stop them together. Secondly, I want to be able to start *hundreds* of filesystems + accessor program pairs (simulating a whole network of these things) and need to maintain the association between each filesystem and it's accessor program/thread. I'm a little confused by your explanation. In the sample code, I create a second thread that runs the FUSE filesystem (fuse_loop_mt()), so the filesystem is being "provided" by a different thread. And as far as I'm aware, linux threads are processes. So they're not the same process. (Although they are both on the same branch of the process tree.) But I *think* I understand what you're saying. Let me restate the situation, and you can tell me if I'm in the right ballpark. 1. GDB can't interrupt a thread halfway through a system-call. Rather, it has to wait for the system-call to complete. 2. When GDB stops a program, it stops all threads belonging to that program (the entire process tree). 3. So if GDB tries to stop my test program when the accessor thread is calling stat(), it stops the thread that handles the filesystem, and waits for the system-call to complete, but the call never returns, since the thread that would handle it has been stopped by GDB. Hence the entire thing freezes. Is that about right? If so, then it seems I'm really in trouble! My network simulator would require a complete redesign. :-( BTW, can you think of any reason why this behavior would have changed in different FUSE versions? As I said before, there was a time when I found I could debug my simulator just fine. |
From: Miklos S. <mi...@sz...> - 2007-06-01 09:43:40
|
> > Why do you need these two things to be in the same process? > > Good question. I'm not sure I have a valid answer, but let me explain > why I did things that way. Firstly, conceptually, the filesystem and > the program accessing the filesytem are one thing -- neither is useful > without the other -- and I want to start and stop them together. > Secondly, I want to be able to start *hundreds* of filesystems + > accessor program pairs (simulating a whole network of these things) > and need to maintain the association between each filesystem and it's > accessor program/thread. You have many choices to automate this. Using fork() instead of pthread_create() will almost certainly do what you want, except you'll have two separate processes show up in the process list, and the memory won't be shared between them (but you probably don't want that anyway). To make sure they do exit together, install signal handler in the parent which makes sure the child will also stop. > I'm a little confused by your explanation. In the sample code, I > create a second thread that runs the FUSE filesystem (fuse_loop_mt()), > so the filesystem is being "provided" by a different thread. And as > far as I'm aware, linux threads are processes. Well, that's not exactly right. In the kernel they are called "tasks", and in linux those do correspond exactly to threads. But a process can be made of more than one task/thread. There was some change in how threads were implemented, so in the old days they looked more like processes, each having a separate entry in 'ps'. Nowdays processes and threads are more cleanly separated. > So they're not the > same process. (Although they are both on the same branch of the > process tree.) > > But I *think* I understand what you're saying. Let me restate the > situation, and you can tell me if I'm in the right ballpark. > > 1. GDB can't interrupt a thread halfway through a system-call. > Rather, it has to wait for the system-call to complete. > 2. When GDB stops a program, it stops all threads belonging to that > program (the entire process tree). > 3. So if GDB tries to stop my test program when the accessor thread > is calling stat(), it stops the thread that handles the filesystem, > and waits for the system-call to complete, but the call never returns, > since the thread that would handle it has been stopped by GDB. Hence > the entire thing freezes. Exactly. I'm not sure of the details, but I think gdb just sends one STOP signal to the whole process, and then waits for it to stop. The kernel will consider the process stopped, when all the constituent threads have stopped. But I'm guessing this mostly. > Is that about right? If so, then it seems I'm really in trouble! My > network simulator would require a complete redesign. :-( Hmm. Next question: if this is a simulator, why involve fuse and the kernel in it at all? Or are the filesystem calls deep in some library that you don't want to change? > BTW, can you think of any reason why this behavior would have changed > in different FUSE versions? As I said before, there was a time when I > found I could debug my simulator just fine. We still don't know why gdb wants to stop the process, and I'm not sure how we could find that out. Miklos |
From: Dom L. <dom...@gm...> - 2007-06-01 19:14:36
|
On 6/1/07, Miklos Szeredi <mi...@sz...> wrote: > Hmm. Next question: if this is a simulator, why involve fuse and the > kernel in it at all? Or are the filesystem calls deep in some library > that you don't want to change? Well, that's because I want to share as much code as possible between the network simulator, and the code that will run on individual nodes in an actual, deployed network. If you don't do things that way, you end up with divergent codebases for simulator and hardware, and also, you are never very sure how close to the real thing your simulation is. (The simulator also has a "fake" filesystem mode, which doesn't involve FUSE at all, but that doesn't help when I'm trying to run the debugger on the hardware code.) Debugging distributed applications is exceedingly painful, so an accurate simulator is invaluable. > > BTW, can you think of any reason why this behavior would have changed > > in different FUSE versions? As I said before, there was a time when I > > found I could debug my simulator just fine. > > We still don't know why gdb wants to stop the process, and I'm not > sure how we could find that out. Indeed. But if this mechanism (of gdb+application freezing) is real, then even if it can be avoided in this particular example, I'm going to keep encountering it. If one thread in a simulation of several hundred threads segfaults, then GDB will stop all threads, and there is a high probability that at least one thread will be in the middle of a FUSE-handled system-call, and hence the whole thing will lock up. Sigh. Back to the drawing-board, I guess... |