This report covers two issues regarding the usage of the TLS package with Tcl:
- empty reads when channels are readable according to [fileevent];
- stalling data transport for specific cases;
Please see attached file "tls_issues.tar.gz" which holds the detailed description of the issues, as well as scripts that exercise and identify these issues at the script level.
The exercisable setup was isolated from a more real-world, and compound setup which utilized HTTP.
Thanks also to Alexandre Ferrieux for his help, especially with the strace inspection.
We (Alex and I) feel that we've stretched our efforts as far as they are efficient.
Since I can't tell which of the two (Tcl or TLS) can be held responsible, this report has been cross-filed at the Tcl tracker as a reference.
Logged In: YES
user_id=496139
Originator: NO
Erik, maybe you could bump the prio here too. I cannot, having no status in TLS :-}
Logged In: YES
user_id=113903
Originator: YES
OK, set priority to 8 (= the same priority as was assigned to the corresponding issue in the Tcl tracker).
The sample file used thus far to exercise "the choking channel" has received company ...
The tarball "more_stalling_gifs.tar.gz" holds four more of them.
Each of these gif images gets stuck at a different byte offset (i.e. the notifier not responding to the last bytes having arrived).
When the exercise for any of these files is repeated, the byte offset where the stalling occurs is always exactly the same.
All tests performed with the scripts already here, using Linux, Tcl8.5.7.
Found another gif that got stuck. Added as "stalling_6.gif"
The stalling reads issue was explored at the script level. The exploration revealed a consistent pattern which is determined by the following three variables: the length of the byte sequence sent to the recipient, the buffer size of the channel and the block size by which reads are performed from the channel. Additionally, a still unexplained constant of 16384 bytes plays a role in the pattern. The rules by which this constant and these variables interact to exhibit the observed pattern is described in detail in section B of the report, see attached the file report.pdf.
With these results, the problem domain has been narrowed down in terms of programming constructs at the script level. This shows that the problem is much more generic than the set of gif files that have exhibited the problem thus far.
Whether stalling will occur, and exactly at which position in the byte sequence, has shown to be predictable to a very large degree, provided that a read size is specified for the read commands. Only when (non-empty) short reads occur, the predictability is subverted.
As long as the issue hasn't been fixed, the pattern rules can be taken advantage of to work around the misbehaviour. Work-arounds, as well as other implications, are provided in section C of the report.
The results of the script-level exploration, combined with a superficial inspection of the C code of the TLS extension, led to the following proposed explanation of the pattern at the C level (see section E. in the report for details and reservations):
*Stalling occurs upon those specific occasions that a successful, script-level read command caused an internal buffer to become filled up or emptied exactly.*
Along with this report comes an updated distribution tls_issues.2.tar.gz which holds the tools to verify the general outcomes of this exploration. May they provide directions for any further search for the cause of the misbehaviour.
Removed the gif files that up to now were the only witnesses of stalling behaviour. With the outcomes of the latest exploration (see the file report.pdf), you can now easily create and exercise your own stalling byte sequences in a much more generic and versatile way.
Testing environment, belonging to report.pdf
Added corrected distribution tls_issues.3.tar.gz
Excellent report and tools !
Now trying 'exercise 11 10 10' as suggested in the report, I see a random outcome: sometimes stalling, sometimes not. I also get both outcomes with strace, and the comparison shows that the main difference is that in the non-stalling case there are several recv() calls which yield EAGAIN. None in the stalling case... FWIW.
A few more observations. Still trying to make the minimal [exercise 11 10 10] a sure-fire, I played with priorities. Nicing the client and Boosting (as root) the server brings the stall rate to 80%. Adding a third process busylooping at normal priority brings the rate to 100%.
Given the fact that slowing down the client relative to the server helps, it is likely that single-stepping in gdb in the client could still be done in the stalling case (after removing the timeout).
I'm saying this in the hope that somebody knowing TLS did exactly that ;-)