Sometime after test chan-io-73.1
passes, an error aborts the evaluation
of chanio.test:
---- chan-io-73.1 start
++++ chan-io-73.1 PASSED
Test file error: error flushing "file5": broken pipe
while executing
"chan flush $f"
The error message is mysterious
since it blames a [chan flush]
command, and there are no visible
[chan flush] commands in the part
of chanio.test that follows test
chan-io-73.1. Perhaps its in the
body of some proc, or in some trace
handler, but in those cases I'd expect
a longer stack trace revealing that
context as well.
Based on the timing only, I suspect
the TIP 332 commit is to blame.
the error appears to arise
not during evaluation of
chanio.test, but in the
command [close $pipeFd] at
line 2767 of tcltest.tcl
in the tcltest package.
The error message is still broken.
The fact that only chanio.test
suffers, indicates there's some
interaction with it going on.
Attempts to learn more with the
-singleproc 1 option led to more
confusion. There's some nastiness
here where tricky testing of things
like std channels and their flush
patterns (?) are getting messed
with while the testing harness
needs them functioning reliably.
Maybe?
some indications the
original [chan flush]
error happens in test
chan-io-29.31. Don't know
why it's escaping to stderr
at all, let alone at unpredictable
times.
Even weirder:
$ make test TESTFLAGS="-file chanio.test -singleproc 1 -verbose bpste -skip chan-io-29.31"
...
---- chan-io-29.35 start
could not find interpreter "y"
while executing
"interp delete y"
++++ chan-io-29.35 PASSED
...
Some partial stack trace just
thrown out there on some visible
channel, but with no test failure,
nor any test file error reported.
I can't get a handle on things, but
indications are there's some significant
troubles in the recent commit.
FWIW, disabling test chan-io-28.6 (half-close) makes the problem disappear.
Just committed this change to avoid polluting others' lives.
But I'm not pulling it under the rug :-)
Investigation ongoing...
Strace investigation shows that the original "broken pipe" occurs in chanio-29.27.
Strangely, the error mentions "file5" while the EPIPE occurs on descriptor 4:
9927 10:34:27.076443 write(4, "output\n", 7) = -1 EPIPE (Broken pipe)
9927 10:34:27.076472 --- SIGPIPE (Broken pipe) @ 0 (0) ---
...
9927 10:34:27.077022 write(1, "++++ chan-io-29.27 PASSED\n", 26) = 26
So I suspect incorrect cleanup in the channel system during half-close.
For this I'd like Andreas' help since he wrote the generic layer of #332.
Homing in on the 6-legged animal:
set f [open "|echo" r+]
gets $f
after 1000
puts $f yay
close $f w
==> should flush and report EPIPE
==> but doesn't complain
close $f
==> barfs "bad file number"
Shouldn't be long now :)
OK confirmed, it's the same bug as mentioned in the patch entry, re the crash on [close w] of an unflushed channel: currently the Flush is ill-positioned within the tearoff sequence, occurring too late.
Will try to refactor alone in the absence of Andreas, but help welcome.
Just committed code doing a specific flush in the half-close-w pathway (because the normal flush is inside UnregisterChannel which must not be called for the "first half").
Andreas can you verify I'm doing it right, because I don't grok the subtleties of flagging blocks READY nor Tcl_Flush vs FlushChannel etc.
This doesn't solve the bug obtained when re-enabling chanio-28.6 though.
My investigation so far seems to indicate we are witnessing a "ghost error status" in a stale channel-related structure.
By this I mean, the _message_ is from an innocuous, previously caught error (a [chan flush] in a previous test), while possibly the TCL_ERROR status is fresh and is the core of the bug.
The error goes to the child tcltest's stderr because it occurs within a vwait (hence a bgerror or default C-level background error handler must be just spitting to stderr).
The vwait is: vwait [namespace which -variable counter] (near the end of chanio-29.31).
New data point: I have played a bit with chanio.test, re-enabling 28.6 and skipping 28.7..29.30.
In 29.31 where the offending vwait is, I added:
proc myhan args {exec echo "BGERROR:$args >>> $::errorInfo <<<" > /dev/tty}
interp bgerror {} [namespace which myhan]
this has the nice property of spitting out synchronously (which I find pretty hard to do in tcltest):
---- chan-io-29.31 start
BGERROR:{} {-code 0 -level 1} >>> can't read "ConstraintInitializer(asyncPipeChan)": no such element in array
while executing
"return $ConstraintInitializer($constraint)"
(procedure "ConstraintInitializer" line 5)
invoked from within
"ConstraintInitializer $n2" <<<
++++ chan-io-29.31 PASSED
I notice an empty direct error message and a ghost errorInfo...FWIW
While this won't solve the underlying
bug, if tcltest is getting confused
due to writes to stderr from background
error reporting, then setting an
[interp bgerror] handler to avoid that
during those tests should at least be
an improvement while the real bug is
tackled.
You should not expect the $::errorInfo
seen in an [interp bgerror] handler to
be in sync. [interp bgerror] handlers
are supposed to pull errorinfo data
out of the second dict argument.
The weird thing is the indication that
the handler is being called when there's
no exception for it to report. Arguments
indicate a TCL_OK return code and an
empty result.
dropping prio just a tiny bit
since this no longer prevents
a(n apparent) clean test suite run.
Bugs should still get fixed
of course, but if it's hard
enough to take a while fixing,
I don't think we need to
block 8.6b1 (which no one expects
to be bug free) waiting for
the fix.
Put the script from 'Homing in on ...' into a script hc1.tcl.
Running it with the rewrite as explained in the mail ...
andreask@gila:~/workbench/HalfClose/build/tcl86> ./tclsh hc1.tcl
broken pipe
while executing
"close $f w"
(file "hc1.tcl" line 5)
Which is the expected result. Next now to test the chan-io-28.* cases.
Ok, fixed a missing initialization and a place where a flag name was not properly modified, and the two chan-io-28.[67] test cases pass as well.
What I still have is
++++ chan-io-73.1 PASSED
Test file error: can't read "ConstraintInitializer(asyncPipeChan)": no such element in array
while executing
"return $ConstraintInitializer($constraint)"
(procedure "ConstraintInitializer" line 5)
invoked from within
"ConstraintInitializer $n2"
error flushing "file5": broken pipe
while executing
"chan flush $f"
Running with -singleproc 1 this vanishes ... Checking. Yes, this is written to stderr by something ...
Ok, we have _two_ places writing to stderr.
One, during chan-io-28.7 we get
Test file error: can't read "ConstraintInitializer(asyncPipeChan)": no such element in array
while executing
"return $ConstraintInitializer($constraint)"
(procedure "ConstraintInitializer" line 5)
invoked from within
"ConstraintInitializer $n2"
And two,
error flushing "file5": broken pipe
while executing
"chan flush $f"
comes from io-29.31
I ran the chanio file with -singleproc 1 -verbose bpste, then captured stdout and stderr in one file to get the proper order.
So, even with the rewrite this ghost thing is still present. Digging on ...
The first output to stderr is something which points into tcltest ...
Note: Using tclsh as shell in test cases is not good, it uses whatever it finds in the PATH.
Changed to [interpreter], which is a command to get the current executable.
Commit what you have and I will
look into the background error
troubles. I suspect you are
expressing concern about normal
background handling behavior.
FWIW, the background reporting is not mysterious; [interp bgerror] actually keeps it under control instead of confusing us by going to stderr and being delayed to the end of the test. So in my bg handler (in 29.31) I used the [testpanic] trick you gave me on the chat, and got a core with an explorable stack frame. Andreas/Don maybe you can do the same and understand what exactly is the cause of this ghost background error ?
background error handling is
done asynchronously! This may
be confusing or unexpected, but
it is not a bug. I have not
taken time to confirm, but I suspect
all this talk of "ghost errors"
is normal functioning of an
asynchronous reporting mechanism.
Instrumenting the test case ...
~~~~~~~~~~~~~~~~~~~~~~~
---- chan-io-28.7 start
MAKE/ECHO
OPEN/PIPE/ECHO
GETS/PORT
OPEN/SOCK
PUTS/HEY
PIPE/CLOSE
SETUP/LOOP
LOOP
can't read "ConstraintInitializer(asyncPipeChan)": no such element in array
while executing
"return $ConstraintInitializer($constraint)"
(procedure "ConstraintInitializer" line 5)
invoked from within
"ConstraintInitializer $n2"
CLOSE/SOCK
CLOSE/PIPE
DONE
++++ chan-io-28.7 PASSED
~~~~~~~~~~~~~~~~~~~~~
This is something written during the vwait event loop.
Ah. The error message is by 'chan-io-28.3', which is a skipped test. And is was skipped because of missing constraint asyncPipeChan. And the 28.7 test case somehow writes this to stderr ... Hrrm.
Note: Just running test 28.[367] generates the output. Running [67] generates a different output about a missing interpreter "z". Running 7 alone is ok.
Note: Looking at 28.[67] I see that the 'after 1000 ...' events are not canceled!! This can (well, with murphy, will) cause breakage in any other test coming later and using vwait on the same variable. Fixed that. Doesn't fix the isssue at hand however.
Oi. I should read SF more often. I missed all the discussion regarding bgerror and such.
Ok. I will undo my instrumentation, leave in the after cancel's I added and then
commit the fixes.
Committed my changes.
Thanks. First off, chanio-28.7 still fails on Windows (mingw): nothing is received back from the server.
Ok, this is now going into places (Windows, Windows Pipes) I have no real knowledge off. I could try tomorrow, when I am back in the office, at a Win system, however no promises there. tclguy and davygrvy are likely better suited.