From: <apn...@ya...> - 2022-12-27 09:36:16
|
+1 If there is a decoding error on a read or gets, I think it is completely appropriate, and desirable, to immediately raise an error and not bother preserving or returning any data already decoded on that read. Any further reads should be disallowed. I see little use for an application to process partially read data on an ill-formed input stream. /Ashok -----Original Message----- From: Rolf Ade <tcl...@po...> Sent: Tuesday, December 27, 2022 7:00 AM To: tcl...@li... Subject: Re: [TCLCORE] More on I/O with Tcl 9 Some of the new and long-desired Tcl 9 features necessarily require, under certain circumstances, a new behaviour by familiar and often-used commands, as "read" or "gets". In Tcl 8 (and before) the "read" command "reads all of the data from channelID up to the end of the file", as the "read" man page describes the behaviour. You typically did: set fd [open "some.file"] fconfigure $fd -encoding utf-8 set data [read $fd] close $fd Tcl 9 adds the feature "-strictencoding" to channels. If you want to use this feature the behaviour of the "read" command has to change - it somehow has to report that, for example, an UTF-8 encoding error in the data read from the channel has occurred. If with the current Tcl 9 development version (trunk) an encoding error happens while the "read" command reads data from a channel the command returns the data read so far without obvious sign of error. (Since ticket https://core.tcl-lang.org/tcl/info/b8f575aa2398b0e4 you can decide from [eof $fd] what happens.) Only a next read from the same channel will raise Tcl error. This behaviour of "read" in Tcl 9 surprises me. Up until now a Tcl core command was either able do what it was asked for and returned TCL_OK (and a result), or it raised TCL_ERROR. The "read" command in Tcl 9 does something in-between. If it cannot read all data from a channel because of an encoding error - it can't do what it was asked for because of an error - it returns TCL_OK and the data read up to this point. Only the next read from that channel will raise error. I'm not sure which TIP announced this new "read" behaviour although I've checked those which seemed to be related. Perhaps someone can help me with a link? TIP 633 (https://core.tcl-lang.org/tips/doc/trunk/tip/633.md) for example talks about handling encoding errors by throwing "an error on the corresponding commands". But I can't find the current trunk behaviour described there. Is this seen as implementation detail, not worth discussing? The behaviour seems unnecessarily laborious to me. Because what you then will have to do every time is: set fd [open "some.file"] fconfigure $fd -encoding utf-8 -strictencoding 1 set data "" while {![eof $fd]} { if {[catch {append data [read $fd]}]} { # Handle error } } close $fd Instead, I suggest to just raise TCL_ERROR immediately as soon as an encoding error gets detected on a channel which was configured with -strictencoding 1. That would simplify the above to: set fd [open "some.file"] fconfigure $fd -encoding utf-8 -strictencoding 1 if {[catch {set data [read $fd]}]} { # Handle error } close $fd I take it as given that the current Tcl channel system is unable to return a character position together with the I/O error code. It is true that if you follow my proposal you will have only the byte position of the error (per [tell $fd]) in the error handling code, while the current behaviour provides also the character position (per [string length $data]). But does this justify putting the burden of this boilerplate on everybody in every case -strictencoding 1 is used? Apparently, the "gets" command in Tcl 9 will/shall work like the "read" command. At the moment it does not work in such situations (it hangs), see https://core.tcl-lang.org/tcl/info/154ed7ce564a7b4c. I understand that "gets" will simply return the data read so far on encoding error, and only the next [gets] will raise error. The typically use pattern of "gets" is a loop like this: set chan [open "some.file.txt"] while {[gets $chan line] >= 0} { # process $line } close $chan With the new "gets" behaviour there's a good chance that some input checking code will raise error while processing the $line because of the short read due to the encoding error (eg. you import CSV data and the $line does not have the expected number of columns). This would result in a not really on the spot error message. As is the case with "read" it seems better to me if "gets" would immediately raises error. Again I was unable to find a TIP which announced this new script level behaviour. Every other languages I'm aware of with a similar feature raises error right away in such a situation. Of course this is no argument, but it shows that others had a similar language design problem and decided differently than current trunk. Tcl is free to do things its own way. I don't see why it does here. And I can't find an explanation in the TIPs either. rolf _______________________________________________ Tcl-Core mailing list Tcl...@li... https://lists.sourceforge.net/lists/listinfo/tcl-core |