Menu

#3974 test exec-9.7 fails randomly

obsolete: 8.6a0
pending-works-for-me
7
2011-09-23
2008-04-12
Anonymous
No

With the Tcl HEAD from today with Ubuntu 7.0.4. I ran make test 5 times and found this once:

==== exec-9.7 commands returning errors FAILED
==== Contents of test case:

list [catch {exec [interpreter] "$path(sh)" -c "\"$path(echo)\" error msg 1>
&2 ; \"$path(sleep)\" 1" | [interpreter] "$path(sh)" -c "\"$path(echo)\" error
msg 1>&2 ; \"$path(sleep)\" 1"} msg] $msg

---- Result was:
1 {error msg}
---- Result should have been (exact matching):
1 {error msg
error msg}
==== exec-9.7 FAILED

Discussion

  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Donal, two questions:

    (1) Scary. What about prio 9 ?

    (2)Isn't the test case itself risking a race ? Indeed, it spawns in parallel two instances of

    sh -c {echo error msg >&2;sleep 1}

    In this setup, descriptor 2 of both children is the err pipe to Tcl.
    Of course, those are small writes which are supposed to be atomic; but in principle nothing prevents the OS from doing a task switch in the middle. The result seen in $msg could then be any "crossover" like "errorerror msmsg
    g
    "
    I would have found much safer to have & child doing
    echo error msg >&2;sleep 1
    and the other one doing
    sleep 1;echo error msg >&2

    OK, now investigating, looking for the order of EOF and waitpid in the child-reaping process.
    Has this changed recently ?

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    To the OP: could you please send an strace -f -tt of the thing ?
    Thanks.

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902
    Originator: NO

    Individual writes *should* be atomic; the OS should serialize them.

     
  • Donal K. Fellows

    • labels: 105657 --> 24. Channel Commands
    • priority: 5 --> 8
    • assigned_to: dkf --> andreas_kupries
     
  • Daniel A. Steffen

    Logged In: YES
    user_id=90580
    Originator: NO

    this test has been failing intermittently on OSX for me for years, particularly frequently on multi-processor boxes... since I added the two sleeps ~1 month ago I have only seen it failing rarely.

    While I was fiddling with it I noticed that if the two messages are made distinct, I would frequently see the order of the messages reversed from what was expected (on a dual-core box)...

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Hmmm, I'm wondering, doesn't all this look like an OS bug ?
    Couldn't we try to reproduce without Tcl ?

    Indeed, what we observe is similar to what happens when two processes write to the same file through two separately opened file descriptors; while the Tcl code properly uses a single fd that is dup2'ed to the 2's of all children. But since this apparently happens (only?) on bi-processor machines, it could also be interpreted as: buggy kernel code, failing to lock the SMP master lock when updating the seek pointer of the same undelying kernel-fd. The two processors, running in true parallelism, start with local copies of the seek pointer (both 0), and write the same bytes in parallel instead of sequence, hence the seek pointer ends up at the end of just one message.

    All this is wild speculation. I hope I'm wrong, and that SMP has been around for long enough that such trivial master lock issues are no longer thinkable. But...

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902
    Originator: NO

    Could it be that it's actually dependent on whether the error descriptor has the correct flags? (I suspect that O_APPEND is needed when going to a file, for example.)

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    That couldn't explain the "race" nature...
    Or are you saying that yes, there's a kernel bug, and predicting that O_APPEND could drag it under the rug ?

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902
    Originator: NO

    <sarcasm> A bug in the OS with SMP and appending? Who'd have thought it? </sarcasm>

    All we can really try for is doing the right thing ourselves; working around everyone else's bugs is something we should try to avoid (except on Windows, where the platform is fundamentally borkenated anyway...)

     
  • Donal K. Fellows

    • priority: 8 --> 7
     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Well, I'd like to be sure first:

    Please, "Nobody", whoever posted that report, come back !

    We definitely need somebody able to reproduce this, *and* an strace -tt -f.
    Only this will allow us to firmly blame the OS.

    Now regarding the possible workarounds, things are open.
    One possibility would be to verify (that's a personal prediction) that the bug doesn't happen on pipes, and modify the test case to use a pipe instead of Tcl's internal temporary file:

    set errpipe [open "|cat > $tmpfile" w]
    the_original_exec_command_line 2>@ $errpipe

    But let do the strace investigation first !

     
  • Alexandre Ferrieux

    Logged In: YES
    user_id=496139
    Originator: NO

    Well, I'd like to be sure first:

    Please, "Nobody", whoever posted that report, come back !

    We definitely need somebody able to reproduce this, *and* an strace -tt -f.
    Only this will allow us to firmly blame the OS.

    Now regarding the possible workarounds, things are open.
    One possibility would be to verify (that's a personal prediction) that the bug doesn't happen on pipes, and modify the test case to use a pipe instead of Tcl's internal temporary file:

    set errpipe [open "|cat > $tmpfile" w]
    the_original_exec_command_line 2>@ $errpipe

    But let do the strace investigation first !

     
  • Alexandre Ferrieux

    Timing out on request for followups.

     
  • Alexandre Ferrieux

    • status: open --> pending-works-for-me