|
From: <apn...@ya...> - 2023-02-01 11:14:29
|
I’ve updated the document at Unicode in Tcl 9 (magicsplat.com) <https://www.magicsplat.com/tcl9/tcl9unicode.html> and included the fossil commit id for reference. I agree that whether -strict is the default or not is a secondary question. The primary question to be answered is whether the combination of default behaviour, -strict and -nocomplain cover required error handling behaviors. My answer is no. When invalid byte sequences are encountered, at least the following behaviors are possibly desired in theory: 1. Treat as an error (either by raising an exception or via the -failindex mechanism) 2. Replace with an encoding-specific character in the target encoding (U+FFFD, question mark etc.) 3. Replace with a lossless internal representation (specific use cases filenames, environment vars, system apis) 4. Replace with a user-defined character 5. Replace with the numerically equivalent code point (Tcl8 behavior and current default) 6. Discard the byte(s) (seen as an option in Python etc.) As per the Unicode standard, options 1 and 2 are conformant. Option 3 is semi-blessed (as in recommended for specific use cases as discussed in the write up). Tcl 9 implements Option 1 (-strict) and 5 (implicit default, albeit with some caveats for out-of-range values). I believe it is important to support (2) and (3); the former because applications expect it, latter because it allows for correct operation when interacting with the system (see write up). (5) and (6) are in my opinion broken behaviors but let us assume (5) at least is mandated for Tcl 8 compatibility. Now the point of discussion may be: If you think standard conformant (2) and (3) are not useful, now or in the future, then that becomes the point of debate. The argument is whether (1) and (5) suffice for all time to come. However, if you agree (2) and (3) are useful, or that other behaviors may be desirable in the future, the discussion becomes how best to add them in 9.0 or 9.1. In the current -strict/-nocomplain model, one would likely have to add -replace (2), -lossless (3), -discard (6) etc and the equivalent -encodinglossless 0/1, -encodingreplace 0/1 etc. to fconfigure. Obviously, mutually exclusive. This is confusing and not good design to have multiple mutually exclusive options. Following TIP 654, the model if not the specifics, we would instead have -profile strict, fconfigure -encodingprofile strict for (1) -profile replace, fconfigure -encodingprofile replace (2) -profile lossless, fconfigure -encodingprofile lossless (3) -profile \UXXXX, fconfigure -encodingprofile \UXXXX (4) (meh, not sure I like that) -profile tcl8, fconfigure -encodingprofile tcl8 (5) -profile discard, fconfigure -encodingprofile discard (6) etc. which I think is a much cleaner, more extensible interface. /Ashok From: Jan Nijtmans <jan...@gm...> Since those 2 commits fix inconsistencies in the use of -strict, it would be useful to have Ashok's document updated, checking whether all inconsistencies reported regarding the use of "-strict" are gone now. It doesn't make sense starting a discussion on making "-strict" the default in Tcl 9.0, if there's still a discussion on what -strict should do. One thing is for sure: When using '-strict' (without -failindex), an exception should be thrown for any 'illegal' bytes or code-points. I don't want to discuss 'illegal': That's different for every encoding (although it should be clear for utf-8/-16/-32). Not throwing an exception when using -strict and encountering 'illegal' bytes or code-points, that's a bug. Please report it (unless there's already a ticket for it), and - even better - provide a test-case and/or patch. Do we have an agreement on what '-strict' is supposed to do? See also: https://core.tcl-lang.org/tips/doc/trunk/tip/346.md Regards, Jan Nijtmans |