From: <apn...@ya...> - 2025-05-21 02:40:57
|
All, We need to make a collective decision regarding the default encoding issues on Windows. In the last meeting, I promised to write up a summary of the status and considerations for the problem being addressed by TIP's 716 and 718. Here it is along with the choices we have as to the next step. Although specific to Windows, the basic issue does not need any deep understanding of Windows so I hope folks will read through and opine on the matter without copping out that they do not have Windows knowledge :-). For those who have not read 716 or 718 (particularly the Background section in 716) or did not find the explanations clear, here is an example scenario the TIP's are independently addressing that I hope will shed some light. A client library, linked to a Tcl extension, shares data (config info, db content etc.) with another program such as a server. The encoding used for the data is that returned by the Windows GetACP() call which returns the user's code page. Note that neither the library nor the server side is aware of Tcl's existence in any form except for the library being linked to a Tcl extension loaded into tclsh. This scenario worked fine in Tcl 8.6 because both the client and server ends saw the same encoding value returned from GetACP(). Tcl 9.0 broke the scenario because of the addition of the activeCodePage entry in the executable manifest for tclsh and wish. The presence of this setting means all calls to GetACP() from that process, including those from the loaded client library, will return the code page utf-8 irrespective of the user setting. The breakage may be because now the library encodes using utf-8 and the server component encodes using the user's real code page. Or it may be because the library cannot handle non-DBCS multibyte encodings. Does not really matter. The primary point is that the issue cannot be fixed by modifying or recompiling the extension. Moving forward, we can deal with this problem in one of several ways: (1) Remove the activeCodePage entry from the manifest with no other changes. This will revert 9.x behavior on Windows to that of 8.6 (and also consistent with non-Windows platforms since user settings are not ignored on them). (2) Ignore the issue. In scenarios where this is an issue, users can continue to use 8.6 or build tclsh/wish themselves after removing the activeCodePage entry from the manifest. (3) TIP 716 as detailed below. (4) TIP 718 as detailed below. Going with (1) means that 9.0.2 will not be compatible with 9.0.1 with respect to default encodings. For example, non-ASCII files written with writeFile in 9.0.1 will not be readable using readFile in 9.0.2 and vice versa. This will be a pain point users. Further, if as is possible, UTF-8 is made the default for Tcl on all platforms at some point in the future, there will be whole another round of incompatibility headaches. So as much as I wish the UTF-8 defaulting had never been made without discussion, 9.0 is already out there so I am not in favour of (1). Option (2) may be a candidate for discussion. We can just tell users in this situation to stick to 8.6 or build their own Tcl with the understanding that it will have similar compatibility issues with respect to "standard" Tcl as in (1) above. Not our problem so to speak. But also to be considered is the fact that the TPC/HammerDB is currently one of the most popular Tcl applications based on public download numbers. Similar situations may arise with other large applications. I am therefore skeptical that this is a viable solution. Details behind options (3) and (4) are in the respective TIP's. Leaving aside the minor "reconcilable" differences with respect to implementation of [encoding user] or [exec -encoding], here is the crucial difference between the two. * TIP 716 removes the manifest entry from tclsh and wish. GetACP then returns the user's code page setting from the registry allowing the HammerDB-like applications to run. At the same time, in order to keep compatibility with 9.0.0/1, it hardcodes utf-8 as the default encoding returned by [encoding system] et al. In my opinion, if we indeed wanted to force utf-8 onto users, this is the way it should have been done in the first place. * TIP 718 deals with the problem by proposing to ship two Tcl shells, the current tclsh and a second one, tclshc, which has the manifest removed. The tclsh shell will behave identically to the one in 9.0.0/1. tclshc will behave substantially similar to TIP 716 and can be used by HammerDB etc. which do not work with tclsh. The primary hesitation I have with 718 is this dual shell approach and the potential for confusion. Somehow a user needs to know to use tclsh for all scripts. Except (for example) when accessing DB2. And X, Y and Z as the case may be. How are they to make that determination? Will a scripted application author now have to tell the user which Tcl shell to use? In fairness, most scripts will work with either. Still, this is a point for potential confusion. It is also the case that extension writers will now have test their extensions with both variations (not that they will!). The advantage of tclsh (but not tclshc) in 718 over 716 (I am paraphrasing Jan here from the TIP 718 rationale, so he can clarify if I got it wrong) is that extensions that make use of the Windows ANSI API will automatically get UTF-8 support thereby supporting the entire Unicode range. With TIP 716 as well as 8.6, the ANSI API's will work only as long as the characters are supported by the user's code page. My counterpoint to this argument is that (a) extensions should not be using Windows ANSI API's in the first place, they should use the Unicode API's, (b) since 8.6 did not support this utf-8+ANSI API combination, such extensions are likely rare or old and in need of update anyways, and (c) the "automatic" support is likely overstated as not all API's and libraries support UTF-8 even when set as the code page (see TIP 716). The question to answer is then whether this perceived benefit of 718 is overrides the downside to shipping two tclsh variations. A lesser issue is that 718 does not include the -encoding option to exec as I believe Jan thinks it should be a separate TIP. That is not difficult to resolve if we agree the option is needed. The decisions for the community to make now are which of the four options listed above make sense moving forward. Important that we make a decision for 9.0.2 as the longer we wait, the more 9.0.0/1 applications are going to be out there that might face this issue. /Ashok |