I downloaded the latest Tcl cvs head from the activestate ftp site. The following tests in http11.test failed. I last tested at the end of May and don't recall seeing these failures.
http11.test
==== http-2.6 -channel,encoding gzip,non-chunked FAILED
==== Contents of test case:
set tok [http::geturl http://localhost:$httpd_port/testdoc.html?close=1 -ti
meout 5000 -channel $chan -headers {accept-encoding gzip}]
http::wait $tok
seek $chan 0
set data [read $chan]
list [http::status $tok] [http::code $tok] [check_crc $tok $data] [meta $tok
connection] [meta $tok content-encoding] [meta $tok transfer-encoding] [expr {[
file size testdoc.html]-[file size testfile.tmp]}]
---- Result was:
ok {HTTP/1.1 200 OK} {crc32 mismatch: c554bc27 ne edc98c73} close gzip {} 121845
---- Result should have been (exact matching):
ok {HTTP/1.1 200 OK} ok close gzip {} 0
==== http-2.6 FAILED
==== http-2.7 -channel,encoding deflate,non-chunked FAILED
==== Contents of test case:
set tok [http::geturl http://localhost:$httpd_port/testdoc.html?close=1 -ti
meout 5000 -channel $chan -headers {accept-encoding deflate}]
http::wait $tok
seek $chan 0
set data [read $chan]
list [http::status $tok] [http::code $tok] [check_crc $tok $data] [meta $tok
connection] [meta $tok content-encoding] [meta $tok transfer-encoding] [expr {[
file size testdoc.html]-[file size testfile.tmp]}]
---- Result was:
ok {HTTP/1.1 200 OK} {crc32 mismatch: c554bc27 ne edc98c73} close deflate {} 121
845
---- Result should have been (exact matching):
ok {HTTP/1.1 200 OK} ok close deflate {} 0
==== http-2.7 FAILED
==== http-2.8 -channel,encoding compress,non-chunked FAILED
==== Contents of test case:
set tok [http::geturl http://localhost:$httpd_port/testdoc.html?close=1 -ti
meout 5000 -channel $chan -headers {accept-encoding compress}]
http::wait $tok
seek $chan 0
set data [read $chan]
list [http::status $tok] [http::code $tok] [check_crc $tok $data] [meta $tok
connection] [meta $tok content-encoding] [meta $tok transfer-encoding] [expr {[
file size testdoc.html]-[file size testfile.tmp]}]
---- Result was:
ok {HTTP/1.1 200 OK} {crc32 mismatch: c554bc27 ne edc98c73} close compress {} 12
1845
---- Result should have been (exact matching):
ok {HTTP/1.1 200 OK} ok close compress {} 0
==== http-2.8 FAILED
these are new files. For some reason we are having trouble writing to a file via -channel so the crc and the size checks are wrong. Reproduced on my solaris10 box.
Just testing today on a 64-bit system:
$ uname -a
Linux <REDACTED>.nist.gov 2.6.18-128.1.6.el5 #1 SMP Wed Apr 1 09:10:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
and I see tests http-2.7 and http-2.8 reliably fail there
just as they do in the original report here.
No failure of http-2.6 though.
I can confirm the same failures on netbsd 5.0, on both 32 bits (i386 port) and 64 bits (amd64 port).
I think bug 2780687 is a duplicate if this.
It is definately some interaction with the 'zlib push' and the fcopy we use in the http package when the -channel option has been provided. The only tests that fail use both -channel and a compression algorithm which leads to a 'zlib push'. Test http11-2.9 is the same as 2.[678] but doesn't use any compression (hence does not 'zlib push' and it never fails.
Recognizing this I've added some puts commands into http::CopyDone and arranged to retry when we see an [eof]. On [eof] I set a flag in the state array and try one more time. This causes the tests to pass now and in the following output we can see that 2.7 had an early [eof] then continued. I suspect that given more data we might see additional [eof] later so a single retry is probably not robust.
The channel termination can't be due to the data being sent - first, -eofchar is {} here and secondly the three failing tests use different compression algorithms so the data stream will be different in each case.
Tests began at Thu Jun 25 22:03:25 BST 2009
http11.test
== CopyDone: count:8192 total: 8192 err:''
== Eof '1'
== Retry on Eof
== CopyDone: count:8192 total: 16384 err:''
== CopyDone: count:8192 total: 24576 err:''
== CopyDone: count:8192 total: 32768 err:''
== CopyDone: count:8192 total: 40960 err:''
== CopyDone: count:8192 total: 49152 err:''
== CopyDone: count:8192 total: 57344 err:''
== CopyDone: count:8192 total: 65536 err:''
== CopyDone: count:8192 total: 73728 err:''
== CopyDone: count:8192 total: 81920 err:''
== CopyDone: count:8192 total: 90112 err:''
== CopyDone: count:8192 total: 98304 err:''
== CopyDone: count:8192 total: 106496 err:''
== CopyDone: count:8192 total: 114688 err:''
== CopyDone: count:8192 total: 122880 err:''
== CopyDone: count:7157 total: 130037 err:''
== Eof '1'
++++ http11-2.7 PASSED
== CopyDone: count:8192 total: 8192 err:''
== CopyDone: count:8192 total: 16384 err:''
== CopyDone: count:8192 total: 24576 err:''
== CopyDone: count:8192 total: 32768 err:''
== CopyDone: count:8192 total: 40960 err:''
== CopyDone: count:8192 total: 49152 err:''
== CopyDone: count:8192 total: 57344 err:''
== CopyDone: count:8192 total: 65536 err:''
== CopyDone: count:8192 total: 73728 err:''
== CopyDone: count:8192 total: 81920 err:''
== CopyDone: count:8192 total: 90112 err:''
== CopyDone: count:8192 total: 98304 err:''
== CopyDone: count:8192 total: 106496 err:''
== CopyDone: count:8192 total: 114688 err:''
== CopyDone: count:8192 total: 122880 err:''
== CopyDone: count:7157 total: 130037 err:''
== Eof '1'
== Retry on Eof
== CopyDone: count:0 total: 130037 err:''
== Eof '1'
++++ http11-2.9 PASSED
Note that today, June 26, 2009, I downloaded the tcl cvs head from the activestate ftp site, configured and built it on my solaris 8 machine using Sun's C compiler, and here's what I see for these tests:
http11.test
==== http11-2.6 -channel,encoding gzip,non-chunked FAILED
==== Contents of test case:
set tok [http::geturl http://localhost:$httpd_port/testdoc.html?close=1 -ti
meout 5000 -channel $chan -headers {accept-encoding gzip}]
http::wait $tok
seek $chan 0
set data [read $chan]
list [http::status $tok] [http::code $tok] [check_crc $tok $data] [meta $tok
connection] [meta $tok content-encoding] [meta $tok transfer-encoding] [expr {[
file size testdoc.html]-[file size testfile.tmp]}]
---- Result was:
ok {HTTP/1.1 200 OK} {crc32 mismatch: c554bc27 ne edc98c73} close gzip {} 121845
---- Result should have been (exact matching):
ok {HTTP/1.1 200 OK} ok close gzip {} 0
==== http11-2.6 FAILED
==== http11-2.7 -channel,encoding deflate,non-chunked FAILED
==== Contents of test case:
set tok [http::geturl http://localhost:$httpd_port/testdoc.html?close=1 -ti
meout 5000 -channel $chan -headers {accept-encoding deflate}]
http::wait $tok
seek $chan 0
set data [read $chan]
list [http::status $tok] [http::code $tok] [check_crc $tok $data] [meta $tok
connection] [meta $tok content-encoding] [meta $tok transfer-encoding] [expr {[
file size testdoc.html]-[file size testfile.tmp]}]
---- Result was:
ok {HTTP/1.1 200 OK} {crc32 mismatch: c554bc27 ne edc98c73} close deflate {} 121
845
---- Result should have been (exact matching):
ok {HTTP/1.1 200 OK} ok close deflate {} 0
==== http11-2.7 FAILED
==== http11-2.8 -channel,encoding compress,non-chunked FAILED
==== Contents of test case:
set tok [http::geturl http://localhost:$httpd_port/testdoc.html?close=1 -ti
meout 5000 -channel $chan -headers {accept-encoding compress}]
http::wait $tok
seek $chan 0
set data [read $chan]
list [http::status $tok] [http::code $tok] [check_crc $tok $data] [meta $tok
connection] [meta $tok content-encoding] [meta $tok transfer-encoding] [expr {[
file size testdoc.html]-[file size testfile.tmp]}]
---- Result was:
ok {HTTP/1.1 200 OK} {crc32 mismatch: c554bc27 ne edc98c73} close compress {} 12
1845
---- Result should have been (exact matching):
ok {HTTP/1.1 200 OK} ok close compress {} 0
==== http11-2.8 FAILED
Just a note that this continues to be an issue, 5 months later. on Solaris 8.
FWIW, I see these tests fail much
more reliably on the dgp-refactor-branch
than on HEAD.
See new observations in (unrelated) 2946474.
Basically, (1) "strace -f -o /dev/null make test" reproduces with probability>90%, and (2) the problem occurs even with a blocking flush of the gzip channel. At this point only a single block of 435 gzipped bytes, corresponding to 8192 (!) uncompressed bytes, is written.
The sequence of events leading to that state is timing-dependent in this case, though the final step (flushing and truncating) is not.
New progress. A bit of debug puts in httpd11.tcl around
puts -nonewline $chan [zlib $encoding $data]
show that [string length $data] is 130037,
while [string length [zlib $encoding $data]] is 435 !!! (emphasis, not triple factorial)
So this looks like a plain bug in [zlib].
Now to find the reason for the timing-dependent trigger...
Random hypothesis: since this tiny httpd is coro-based, yielding to the event loop, timing affects the exact order of stack activation. Could a bug there corrupt the big byte array ?
blocking variant of httpd11.tcl
Hypothesis proven wrong. The attached patch to httpd11.tcl turns it from overlapped, coro-based, to blocking, one-client-at-a-time, and still displays the bug.
Hypothesis proven wrong. The attached patch to httpd11.tcl turns it from overlapped, coro-based, to blocking, one-client-at-a-time, and still displays the bug.
Just figured the 435 bytes are the valid result of compressing with gzip the highly repetitiver 130k input file :} So, was barking up the wrong tree. Now to look on the receive side.
Further investigation shows that the problem arises when a single
read/recv() contains both uncompressed and compressed data . Clearly in
that case the buffered zlib'd input is not properly passed down the stack.
Reassigning to Andreas for (1) an explanation of how this is supposed to work in the general case and (2) a review of how the zlib stacked channel (mis)handles it.
Alexandre,
can you tell me which of the failing test cases you investigated more deeply and are referencing in your comment ? Also, can you send me a patch of any changes you made to the sources for your investigation (i.e. additional output and such), if any.
For comparison look at tclIORTrans.c. That is the 'generic' transformation, i.e. the foundation for the 'chan push ' (TIP 230).
Sure: http11-2.6 "-channel,encoding gzip,non-chunked" is the one I looked more into, though with the strace slowdown the three 2.[678] fail equally. Clearly the compression method doesn't matter.
Ah, for traces I didn't change anything to the sources, just tcpdump/strace.
Now attaching a much simpler test case: no HTTP, no socket, just a standalone [chan pipe].
I'm sending the usual uncompressed/compressed mix, and the bottom line is that only the first 4096 bytes of decompressed data are received (out of 5000 send), while the compressed size is 41.
I wonder why it is 4096 here and 8192 in the original test (maybe socket vs pipe ?)
Anyway hope this simpler test helps.
much simpler failing case
Regarding "that case the buffered zlib'd input is not properly passed down the stack."
You are looking into the wrong direction. This is about _input_. Input walks _up_ the stack.
And while a transformation is able to flush data waiting in its state down the stack on chan pup, thus flushing _write_ buffers, the same is not possible for data waiting _read_ buffers/state. They have nowhere to go. Reading is demand-driven from the core I/O layer, the driver cannot announce 'hey here is some data to splace into your input buffer queue.
This means data waiting to be read at the time 'chan pop' is done (or C equivalent) is _lost_.
And this seems to me to be case here. Sometimes, the 'close' is done early, before all the decompressed data has been read out of the zlib state.
I have seen the http-2.6 test failing and succeeding here now. More fail, but sometimes it is ok. And for that case my trace of the zlib transform internals shows 16 calls of ZlibTransformInput after the initial call of same which pulled the data from the base channel. In the failing case none of these additional calls happen, only the ZlibTransformClose immediately after the initial call pulling data ... It really looks like a race to me where the close is done early, causing the pending data to be lost per the explanation above.