With the Tcl HEAD from today with Ubuntu 7.0.4. I ran make test 5 times and found this once:
==== exec-9.7 commands returning errors FAILED
==== Contents of test case:
list [catch {exec [interpreter] "$path(sh)" -c "\"$path(echo)\" error msg 1>
&2 ; \"$path(sleep)\" 1" | [interpreter] "$path(sh)" -c "\"$path(echo)\" error
msg 1>&2 ; \"$path(sleep)\" 1"} msg] $msg
---- Result was:
1 {error msg}
---- Result should have been (exact matching):
1 {error msg
error msg}
==== exec-9.7 FAILED
Logged In: YES
user_id=496139
Originator: NO
Donal, two questions:
(1) Scary. What about prio 9 ?
(2)Isn't the test case itself risking a race ? Indeed, it spawns in parallel two instances of
sh -c {echo error msg >&2;sleep 1}
In this setup, descriptor 2 of both children is the err pipe to Tcl.
Of course, those are small writes which are supposed to be atomic; but in principle nothing prevents the OS from doing a task switch in the middle. The result seen in $msg could then be any "crossover" like "errorerror msmsg
g
"
I would have found much safer to have & child doing
echo error msg >&2;sleep 1
and the other one doing
sleep 1;echo error msg >&2
OK, now investigating, looking for the order of EOF and waitpid in the child-reaping process.
Has this changed recently ?
Logged In: YES
user_id=496139
Originator: NO
To the OP: could you please send an strace -f -tt of the thing ?
Thanks.
Logged In: YES
user_id=79902
Originator: NO
Individual writes *should* be atomic; the OS should serialize them.
Logged In: YES
user_id=90580
Originator: NO
this test has been failing intermittently on OSX for me for years, particularly frequently on multi-processor boxes... since I added the two sleeps ~1 month ago I have only seen it failing rarely.
While I was fiddling with it I noticed that if the two messages are made distinct, I would frequently see the order of the messages reversed from what was expected (on a dual-core box)...
Logged In: YES
user_id=496139
Originator: NO
Hmmm, I'm wondering, doesn't all this look like an OS bug ?
Couldn't we try to reproduce without Tcl ?
Indeed, what we observe is similar to what happens when two processes write to the same file through two separately opened file descriptors; while the Tcl code properly uses a single fd that is dup2'ed to the 2's of all children. But since this apparently happens (only?) on bi-processor machines, it could also be interpreted as: buggy kernel code, failing to lock the SMP master lock when updating the seek pointer of the same undelying kernel-fd. The two processors, running in true parallelism, start with local copies of the seek pointer (both 0), and write the same bytes in parallel instead of sequence, hence the seek pointer ends up at the end of just one message.
All this is wild speculation. I hope I'm wrong, and that SMP has been around for long enough that such trivial master lock issues are no longer thinkable. But...
Logged In: YES
user_id=79902
Originator: NO
Could it be that it's actually dependent on whether the error descriptor has the correct flags? (I suspect that O_APPEND is needed when going to a file, for example.)
Logged In: YES
user_id=496139
Originator: NO
That couldn't explain the "race" nature...
Or are you saying that yes, there's a kernel bug, and predicting that O_APPEND could drag it under the rug ?
Logged In: YES
user_id=496139
Originator: NO
Oh, these SMP-seek issues seem to have been active not so long ago:
http://osdir.com/ml/file-systems.pvfs2.users/2006-02/msg00003.html
Logged In: YES
user_id=79902
Originator: NO
<sarcasm> A bug in the OS with SMP and appending? Who'd have thought it? </sarcasm>
All we can really try for is doing the right thing ourselves; working around everyone else's bugs is something we should try to avoid (except on Windows, where the platform is fundamentally borkenated anyway...)
Logged In: YES
user_id=496139
Originator: NO
Well, I'd like to be sure first:
Please, "Nobody", whoever posted that report, come back !
We definitely need somebody able to reproduce this, *and* an strace -tt -f.
Only this will allow us to firmly blame the OS.
Now regarding the possible workarounds, things are open.
One possibility would be to verify (that's a personal prediction) that the bug doesn't happen on pipes, and modify the test case to use a pipe instead of Tcl's internal temporary file:
set errpipe [open "|cat > $tmpfile" w]
the_original_exec_command_line 2>@ $errpipe
But let do the strace investigation first !
Logged In: YES
user_id=496139
Originator: NO
Well, I'd like to be sure first:
Please, "Nobody", whoever posted that report, come back !
We definitely need somebody able to reproduce this, *and* an strace -tt -f.
Only this will allow us to firmly blame the OS.
Now regarding the possible workarounds, things are open.
One possibility would be to verify (that's a personal prediction) that the bug doesn't happen on pipes, and modify the test case to use a pipe instead of Tcl's internal temporary file:
set errpipe [open "|cat > $tmpfile" w]
the_original_exec_command_line 2>@ $errpipe
But let do the strace investigation first !
Timing out on request for followups.