From: Rob L. <ro...@la...> - 2006-03-11 18:01:30
|
On linux-2.6.16-rc5 I'm occasionally getting a hang starting up (with rootfstype=hostfs), right after the initializing VFS line. This time when I killed it, I got the following error (repeated twice): remove_umid_dir - actually_do_remove failed with err = -2 I'm guessing that's where it's hanging? Rob -- Never bet against the cheap plastic solution. |
From: Blaisorblade <bla...@ya...> - 2006-03-12 20:23:30
|
On Saturday 11 March 2006 19:01, Rob Landley wrote: > On linux-2.6.16-rc5 I'm occasionally getting a hang starting up (with > rootfstype=hostfs), right after the initializing VFS line. This time when > I killed it, I got the following error (repeated twice): > > remove_umid_dir - actually_do_remove failed with err = -2 > I'm guessing that's where it's hanging? I guess not. That's a shutdown function - when UML exits it should remove ~/.uml/<umid> and contained files. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Jeff D. <jd...@ad...> - 2006-03-14 20:21:46
|
On Sat, Mar 11, 2006 at 01:01:41PM -0500, Rob Landley wrote: > On linux-2.6.16-rc5 I'm occasionally getting a hang starting up (with > rootfstype=hostfs), right after the initializing VFS line. I'm in the process of beating on the newer externfs-based hostfs and humfs, so I'm not too concerned with bugs in the old hostfs. Having said that, if you can get a stack from the hang, it might turn out to be an easily diagnosed and fixed bug. > This time when I > killed it, I got the following error (repeated twice): > > remove_umid_dir - actually_do_remove failed with err = -2 There's some confusion sometimes during shutdown when UML is forcibly killed. I haven't really looked at this, but it's harmless. Jeff |
From: Rob L. <ro...@la...> - 2006-03-14 21:31:37
|
On Tuesday 14 March 2006 3:22 pm, Jeff Dike wrote: > On Sat, Mar 11, 2006 at 01:01:41PM -0500, Rob Landley wrote: > > On linux-2.6.16-rc5 I'm occasionally getting a hang starting up (with > > rootfstype=hostfs), right after the initializing VFS line. > > I'm in the process of beating on the newer externfs-based hostfs and humfs, > so I'm not too concerned with bugs in the old hostfs. Having said that, > if you can get a stack from the hang, it might turn out to be an easily > diagnosed and fixed bug. If it's in code that's going away I can live with it. Let me know when the new hostfs is ready for wider testing and I'll give it a whirl... Rob -- Never bet against the cheap plastic solution. |
From: Werner A. <wa...@al...> - 2006-03-16 14:23:10
|
Jeff Dike wrote: > Having said that, > if you can get a stack from the hang, it might turn out to be an easily > diagnosed and fixed bug. Seems that arch/um/kernel/sigio_user.c:write_sigio_thread begins its poll before there are any fds in current_poll, so update_thread caused by a add_sigio_fd shortly thereafter hangs while waiting for a response. A sleep(1) at the beginning of write_sigio_thread "fixes" this :-) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa...@al... / /_http://www.almesberger.net/____________________________________________/ |
From: Blaisorblade <bla...@ya...> - 2006-03-17 19:38:08
Attachments:
poll-mustn_t-hang-0-fds.patch
|
On Thursday 16 March 2006 15:22, Werner Almesberger wrote: > Jeff Dike wrote: > > Having said that, > > if you can get a stack from the hang, it might turn out to be an easily > > diagnosed and fixed bug. > > Seems that arch/um/kernel/sigio_user.c:write_sigio_thread begins > its poll before there are any fds in current_poll Ah, and you mean it hangs... > , so update_thread > caused by a add_sigio_fd shortly thereafter hangs while waiting for > a response. Clearly > A sleep(1) at the beginning of write_sigio_thread "fixes" this :-) Er... there's only one thing which seems unclear - poll(NULL, 0, -1) shouldn't fail with -EFAULT instead of hanging? Yep, it should. I didn't believe what you said, so. But then I went to check the implementation. And, indeed, it seems that Linux isn't up to this :-). So: 1) I'll try to fix poll(2) to return -EINVAL. Dunno whether anyone will say "no, the app is stupid, it deserves no error", but hope not (with "try" I refer to this). Attached patch should do this. 2) write_sigio_thread should do a "down" on a semaphore/mutex and the first update_thread should "up" it. As usually, this semaphore would be indeed implemented as a pipe. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade |
From: Werner A. <wa...@al...> - 2006-03-17 21:27:39
|
Blaisorblade wrote: > 1) I'll try to fix poll(2) to return -EINVAL. Dunno whether anyone will say > "no, the app is stupid, it deserves no error", but hope not (with "try" I > refer to this). Attached patch should do this. I think the behaviour of "poll" with nfds = 0 is correct as it is. At least POSIX doesn't say anything to the extent that "nfds" couldn't be zero: http://www.opengroup.org/onlinepubs/009695399/functions/poll.html Furthermore, this is one of several ways to implement a "sleep" function with sub-second granularity, so existing applications may already depend on "poll" accepting nfds = 0. (Of course, they would probably be better off using "nanosleep".) > 2) write_sigio_thread should do a "down" on a semaphore/mutex and the first > update_thread should "up" it. Yup :-) Thanks, - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa...@al... / /_http://www.almesberger.net/____________________________________________/ |
From: Blaisorblade <bla...@ya...> - 2006-03-17 22:15:03
|
On Friday 17 March 2006 22:27, Werner Almesberger wrote: > Blaisorblade wrote: > > 1) I'll try to fix poll(2) to return -EINVAL. Dunno whether anyone will > > say "no, the app is stupid, it deserves no error", but hope not (with > > "try" I refer to this). Attached patch should do this. > > I think the behaviour of "poll" with nfds = 0 is correct as it is. > At least POSIX doesn't say anything to the extent that "nfds" > couldn't be zero: > http://www.opengroup.org/onlinepubs/009695399/functions/poll.html I don't think that "hanging" is a correct behaviour anyway... unless the application *wanted* to wait for a signal, but this is a clumsy way to do this (I don't think there's any difference from pause()). > Furthermore, this is one of several ways to implement a "sleep" > function with sub-second granularity > , so existing applications > may already depend on "poll" accepting nfds = 0. (Of course, > they would probably be better off using "nanosleep".) This is a very smart note, however usleep(3) exists since 4.3 BSD so this is a bit unlikely; however you're surely right on this. > > 2) write_sigio_thread should do a "down" on a semaphore/mutex and the > > first update_thread should "up" it. > > Yup :-) > > Thanks, > - Werner -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Werner A. <wa...@al...> - 2006-03-17 23:54:42
|
Blaisorblade wrote: > however usleep(3) exists since 4.3 BSD so this is a > bit unlikely; Oh, there must be a million of passably valid reasons why someone may still do it this way, e.g., - because they don't know of the more suitable function - because they don't trust the more suitable function for some reason - because they're using some restricted set of functions, e.g., from a portable library - because the other function isn't available on all platforms they care about, while the "poll" functionality (perhaps through some abstraction that may also use "select") is - because they use a signal to get out of the initial "hang" - because the semantics of their code actually say "I/O or timeout" - because that's what the code that has passed all those reviews and certifications does, and nobody is allowed to touch it. Ever. etc. Just because this valid if unexpected behaviour was a little inconvenient for us isn't a good reason for breaking it. If a safe, easy, and incredibly boring life was the top item on our agenda, we should start with getting rid of concurrency ;-) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa...@al... / /_http://www.almesberger.net/____________________________________________/ |
From: Werner A. <wa...@al...> - 2006-03-20 10:20:55
|
Blaisorblade wrote: > 2) write_sigio_thread should do a "down" on a semaphore/mutex and the first > update_thread should "up" it. As usually, this semaphore would be indeed > implemented as a pipe. Be the way, in case a proper fix is too involved for 2.6.16.x, here's a quick and dirty work-around that should have no ill side-effects: This hack works around a race condition that can make UML hang on startup. If the race condition is encountered, this change limits the time of the hang to one second. Besides that, it can cause a small performance penalty by waking up an otherwise idle "poll" every second. This is intended purely as a readily applicable band aid and should not be considered a long-term fix. - Werner ---------------------------------- cut here ----------------------------------- Signed-off-by: Werner Almesberger <we...@al...> --- linux-2.6.16.orig/arch/um/kernel/sigio_user.c 2006-03-20 02:53:29.%N -0300 +++ linux-2.6.16/arch/um/kernel/sigio_user.c 2006-03-20 06:27:11.%N -0300 @@ -184,7 +184,7 @@ signal(SIGWINCH, SIG_IGN); fds = ¤t_poll; while(1){ - n = poll(fds->poll, fds->used, -1); + n = poll(fds->poll, fds->used, 1000); if(n < 0){ if(errno == EINTR) continue; printk("write_sigio_thread : poll returned %d, " -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa...@al... / /_http://www.almesberger.net/____________________________________________/ |
From: Jeff D. <jd...@ad...> - 2006-03-21 01:00:48
|
On Fri, Mar 17, 2006 at 08:36:58PM +0100, Blaisorblade wrote: > 2) write_sigio_thread should do a "down" on a semaphore/mutex and the first > update_thread should "up" it. As usually, this semaphore would be indeed > implemented as a pipe. Is there anything wrong with this: Index: linux-2.6.15/arch/um/os-Linux/sigio.c =================================================================== --- linux-2.6.15.orig/arch/um/os-Linux/sigio.c 2006-03-20 19:48:51.000000000 -0500 +++ linux-2.6.15/arch/um/os-Linux/sigio.c 2006-03-20 19:49:55.000000000 -0500 @@ -271,6 +271,10 @@ void write_sigio_workaround(void) if(write_sigio_pid != -1) goto out_unlock; + current_poll = ((struct pollfds) { .poll = p, + .used = 1, + .size = 1 }); + write_sigio_pid = run_helper_thread(write_sigio_thread, NULL, CLONE_FILES | CLONE_VM, &stack, 0); @@ -284,10 +288,6 @@ void write_sigio_workaround(void) memcpy(write_sigio_fds, l_write_sigio_fds, sizeof(l_write_sigio_fds)); memcpy(sigio_private, l_sigio_private, sizeof(l_sigio_private)); - current_poll = ((struct pollfds) { .poll = p, - .used = 1, - .size = 1 }); - sigio_unlock(); return; That seems to me to fix the basic problem - the thread is using data that hasn't been set up yet, and I don't see that moving it up breaks anything. Jeff |
From: Blaisorblade <bla...@ya...> - 2006-03-21 01:21:57
|
On Tuesday 21 March 2006 02:00, Jeff Dike wrote: > On Fri, Mar 17, 2006 at 08:36:58PM +0100, Blaisorblade wrote: > > 2) write_sigio_thread should do a "down" on a semaphore/mutex and the > > first update_thread should "up" it. As usually, this semaphore would be > > indeed implemented as a pipe. > Is there anything wrong with this: > Index: linux-2.6.15/arch/um/os-Linux/sigio.c > =================================================================== > --- linux-2.6.15.orig/arch/um/os-Linux/sigio.c 2006-03-20 > 19:48:51.000000000 -0500 +++ > linux-2.6.15/arch/um/os-Linux/sigio.c 2006-03-20 19:49:55.000000000 -0500 > @@ -271,6 +271,10 @@ void write_sigio_workaround(void) > if(write_sigio_pid != -1) > goto out_unlock; > > + current_poll = ((struct pollfds) { .poll = p, > + .used = 1, > + .size = 1 }); > + > write_sigio_pid = run_helper_thread(write_sigio_thread, NULL, > CLONE_FILES | CLONE_VM, &stack, 0); > > @@ -284,10 +288,6 @@ void write_sigio_workaround(void) > memcpy(write_sigio_fds, l_write_sigio_fds, sizeof(l_write_sigio_fds)); > memcpy(sigio_private, l_sigio_private, sizeof(l_sigio_private)); > > - current_poll = ((struct pollfds) { .poll = p, > - .used = 1, > - .size = 1 }); > - > sigio_unlock(); > return; > > That seems to me to fix the basic problem - the thread is using data > that hasn't been set up yet, and I don't see that moving it up breaks > anything. > Jeff Ok, this is conceptually correct, and looking at 2.6.14 code this bug wasn't there, so _I_ introduced it, while cleaning it up. Sorry. Beyond, my idea of adding the semaphore was incorrect (it would work only if write_sigio_workaround did the up()). However, while I overlooked this, you overlooked what I did: *) you must also move the setting of write_sigio_fds and sigio_private above, or we'll get the same problem again in different places of write_sigio_thread. *) you must update the exit path. I've taken care so that only if everything succeed the result is stored, and to hold the lock for the minimum time needed (particularly not while doing any allocation). I'm not sure whether the first property is needed, but it would be conservative to do so to preserve coherence of data structures. In particular: if we store an fd != -1, some function may later close it (say when the console is closed); if we closed it in the exit path, then we're closing an unrelated fd and getting a bug. So, if the thread startup fails, you _must_ clear again current_poll, write_sigio_fds and sigio_private. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Jeff D. <jd...@ad...> - 2006-03-23 18:45:09
|
On Tue, Mar 21, 2006 at 02:21:40AM +0100, Blaisorblade wrote: > However, while I overlooked this, you overlooked what I did: > *) you must also move the setting of write_sigio_fds and sigio_private above, > or we'll get the same problem again in different places of > write_sigio_thread. Fixed. > *) you must update the exit path. I've taken care so that only if everything > succeed the result is stored, and to hold the lock for the minimum time > needed (particularly not while doing any allocation). I'm not sure whether > the first property is needed, but it would be conservative to do so to > preserve coherence of data structures. > > In particular: if we store an fd != -1, some function may later close it (say > when the console is closed); if we closed it in the exit path, then we're > closing an unrelated fd and getting a bug. > > So, if the thread startup fails, you _must_ clear again current_poll, > write_sigio_fds and sigio_private. OK, this is all done, I think. What do you think about this? Index: linux-2.6.15/arch/um/os-Linux/sigio.c =================================================================== --- linux-2.6.15.orig/arch/um/os-Linux/sigio.c 2006-03-23 12:55:27.000000000 -0500 +++ linux-2.6.15/arch/um/os-Linux/sigio.c 2006-03-23 13:41:11.000000000 -0500 @@ -269,42 +269,41 @@ void write_sigio_workaround(void) /* Did we race? Don't try to optimize this, please, it's not so likely * to happen, and no more than once at the boot. */ if(write_sigio_pid != -1) - goto out_unlock; + goto out_free; - write_sigio_pid = run_helper_thread(write_sigio_thread, NULL, - CLONE_FILES | CLONE_VM, &stack, 0); - - if (write_sigio_pid < 0) - goto out_clear; + current_poll = ((struct pollfds) { .poll = p, + .used = 1, + .size = 1 }); if (write_sigio_irq(l_write_sigio_fds[0])) - goto out_kill; + goto out_free; - /* Success, finally. */ memcpy(write_sigio_fds, l_write_sigio_fds, sizeof(l_write_sigio_fds)); memcpy(sigio_private, l_sigio_private, sizeof(l_sigio_private)); - current_poll = ((struct pollfds) { .poll = p, - .used = 1, - .size = 1 }); + write_sigio_pid = run_helper_thread(write_sigio_thread, NULL, + CLONE_FILES | CLONE_VM, &stack, 0); - sigio_unlock(); - return; + if (write_sigio_pid < 0) + goto out_clear; - out_kill: - l_write_sigio_pid = write_sigio_pid; - write_sigio_pid = -1; sigio_unlock(); - /* Going to call waitpid, avoid holding the lock. */ - os_kill_process(l_write_sigio_pid, 1); - goto out_free; + return; out_clear: write_sigio_pid = -1; - out_unlock: - sigio_unlock(); + current_poll = ((struct pollfds) { .poll = NULL, + .size = 0, + .used = 0 }); + write_sigio_fds[0] = -1; + write_sigio_fds[1] = -1; + + sigio_private[0] = -1; + sigio_private[1] = -1; + out_free: kfree(p); + sigio_unlock(); out_close2: close(l_sigio_private[0]); close(l_sigio_private[1]); |
From: Werner A. <wa...@al...> - 2006-03-23 19:06:05
|
Jeff Dike wrote: > Index: linux-2.6.15/arch/um/os-Linux/sigio.c Hmm, is this a link ? % find linux-2.6.15.5/arch/um -name sigio.c % The patch applies cleanly to arch/um/kernel/sigio_user.c of 2.6.16, though, and indeed seems to solve the problem. Thanks, - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa...@al... / /_http://www.almesberger.net/____________________________________________/ |
From: Jeff D. <jd...@ad...> - 2006-03-23 20:12:01
|
On Thu, Mar 23, 2006 at 04:05:33PM -0300, Werner Almesberger wrote: > Hmm, is this a link ? > > % find linux-2.6.15.5/arch/um -name sigio.c > % Nope, it's just later in my patchset than one which moves the file. > The patch applies cleanly to arch/um/kernel/sigio_user.c of 2.6.16, > though, and indeed seems to solve the problem. Yup, and that's the original location. Jeff |
From: Werner A. <wa...@al...> - 2006-03-23 21:23:23
|
Jeff Dike wrote: > Nope, it's just later in my patchset than one which moves the file. Ah, mysteries of the world, easily explained :-) Do you plan to submit this for 2.6.16.x ? - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa...@al... / /_http://www.almesberger.net/____________________________________________/ |
From: Jeff D. <jd...@ad...> - 2006-03-23 21:42:22
|
On Thu, Mar 23, 2006 at 06:23:11PM -0300, Werner Almesberger wrote: > Ah, mysteries of the world, easily explained :-) :-) > Do you plan to submit this for 2.6.16.x ? Haven't thought about that yet. It would certainly be a candidate. Jeff |
From: Blaisorblade <bla...@ya...> - 2006-03-24 00:38:17
Attachments:
uml-fix-sigio-startup-race-hang-fix.patch
|
On Thursday 23 March 2006 19:45, Jeff Dike wrote: > On Tue, Mar 21, 2006 at 02:21:40AM +0100, Blaisorblade wrote: > > However, while I overlooked this, you overlooked what I did: > > *) you must also move the setting of write_sigio_fds and sigio_private > > above, or we'll get the same problem again in different places of > > write_sigio_thread. > OK, this is all done, I think. > What do you think about this? Yep, it seems it looks good. But 1 bug still and two suggestions. See also attached patch (it misses one of the suggestions because it's late and I'm going to sleep). > Index: linux-2.6.15/arch/um/os-Linux/sigio.c > =================================================================== > --- linux-2.6.15.orig/arch/um/os-Linux/sigio.c 2006-03-23 > 12:55:27.000000000 -0500 +++ > linux-2.6.15/arch/um/os-Linux/sigio.c 2006-03-23 13:41:11.000000000 -0500 > @@ -269,42 +269,41 @@ void write_sigio_workaround(void) > /* Did we race? Don't try to optimize this, please, it's not so likely > * to happen, and no more than once at the boot. */ > if(write_sigio_pid != -1) > - goto out_unlock; > + goto out_free; > > - write_sigio_pid = run_helper_thread(write_sigio_thread, NULL, > - CLONE_FILES | CLONE_VM, &stack, 0); > - > - if (write_sigio_pid < 0) > - goto out_clear; > + current_poll = ((struct pollfds) { .poll = p, > + .used = 1, > + .size = 1 }); > > if (write_sigio_irq(l_write_sigio_fds[0])) > - goto out_kill; > + goto out_free; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Here you're not clearing current_poll! <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > - /* Success, finally. */ > memcpy(write_sigio_fds, l_write_sigio_fds, sizeof(l_write_sigio_fds)); > memcpy(sigio_private, l_sigio_private, sizeof(l_sigio_private)); > > - current_poll = ((struct pollfds) { .poll = p, > - .used = 1, > - .size = 1 }); > + write_sigio_pid = run_helper_thread(write_sigio_thread, NULL, > + CLONE_FILES | CLONE_VM, &stack, 0); > > - sigio_unlock(); > - return; > + if (write_sigio_pid < 0) > + goto out_clear; > > - out_kill: > - l_write_sigio_pid = write_sigio_pid; > - write_sigio_pid = -1; > sigio_unlock(); > - /* Going to call waitpid, avoid holding the lock. */ > - os_kill_process(l_write_sigio_pid, 1); > - goto out_free; > + return; > > out_clear: > write_sigio_pid = -1; > - out_unlock: > - sigio_unlock(); > + current_poll = ((struct pollfds) { .poll = NULL, > + .size = 0, > + .used = 0 }); > + write_sigio_fds[0] = -1; > + write_sigio_fds[1] = -1; > + > + sigio_private[0] = -1; > + sigio_private[1] = -1; > + If possible (there should be such array constructors) this shouldn't be redundant / duplicated with the initial declarations. Like write_sigio_fds = (int[]) {-1,-1} and having a common macro like #define SIGIO_FDS_INIT {-1,-1} static int write_sigio_fds[2] = SIGIO_FDS_INIT; ... write_sigio_fds = (int[]) SIGIO_FDS_INIT (we _will_ get one copy wrong, if we can :-)). If sending to 2.6.16 this can be kept separate or they could maybe complain "ah, no extra changes, just the fix!". Btw, this one is not attached. > out_free: > kfree(p); > + sigio_unlock(); p is a local variable, so sigio_unlock can be done before kfree(p). > out_close2: > close(l_sigio_private[0]); > close(l_sigio_private[1]); -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade |
From: Jeff D. <jd...@ad...> - 2006-03-24 02:26:16
Attachments:
sigio-race
|
On Fri, Mar 24, 2006 at 01:37:54AM +0100, Blaisorblade wrote: > Yep, it seems it looks good. But 1 bug still and two suggestions. > See also attached patch (it misses one of the suggestions because it's late > and I'm going to sleep). I merged that patch. > If possible (there should be such array constructors) this shouldn't be > redundant / duplicated with the initial declarations. > > Like write_sigio_fds = (int[]) {-1,-1} and having a common macro like > > #define SIGIO_FDS_INIT {-1,-1} > > static int write_sigio_fds[2] = SIGIO_FDS_INIT; > ... > write_sigio_fds = (int[]) SIGIO_FDS_INIT I tried this - I couldn't get anything like that to compile. I always got an "incompatible types in assignment" error. Below is the latest version. Jeff |