|
From: <apn...@ya...> - 2023-02-03 03:56:12
|
The idea has some merit but I have a couple of concerns with the approach
below.
At first glance it tackles a different problem than what is being discussed.
It addresses configuration of what is to be considered an invalid byte
sequence. It does not address how a sequence considered invalid is to be
handled (map to U+FFFD, map to lossless, map to numeric equivalent etc.).
Now one could add those as additional dictionary options/keys but that
increases complexity from a user perspective (what does "strict 1 surrogates
0 invalid 1" etc. mean?). And the user / application does not care in the
vast majority of cases where the error stems from (exception being the
needmoredata case which is a separate category discussed elsewhere). It
feels like over-generalization to me.
Second, and possibly more important, I foresee considerable implementation
complexity in the encoders to handle this fine-grained, "tunable"
configuration. Particularly so since there is no mechanism currently to pass
this down into the encoder "call chains" and would entail API changes. Of
course, I might be wrong and a prototype implementation could immediately
refute this "implementability" concern.
/Ashok
From: Peter Da Silva <pet...@fl...>
Sent: Thursday, February 2, 2023 10:49 PM
To: Poor Yorick <org...@po...>; Tcl Core List
<tcl...@li...>
Subject: Re: [TCLCORE] Unicode in Tcl 9 - a commentary and critique
I really like this idea. It also adds the option of turning flags off (eg
{strict 0})
The value of "-encoding" could be a dictionary:
chan configure $chan -encoding {name utf-8 strict 1 surrogates 0
...}
If the number of items in the list is odd, "name" could be implied:
chan configure $chan -encoding {utf-8 strict 1 surrogates 0 ...}
chan configure $chan -encoding utf-8
|