From: SourceForge.net <no...@so...> - 2012-03-07 14:34:49
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 06:34 Message: But then we'd need to deal with the problem of how to send a Tcl_Obj through the channel API, and that's an API that may cross thread boundaries (which Tcl_Obj values _must not_ due to the way their memory is managed) and it's going to be hard to make it all work with source potentially getting data out of a VFS. Find something else to optimize. Something easy. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-07 06:09 Message: Because of pair Tcl_GetEncoding/Tcl_FreeEncoding and corresponding part of Tcl_SetChannelOption can be extracted/extended with such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or something as "Tcl_SetChannelObjOption". Idea here would be to use the function "Tcl_GetEncodingFromObj"... I'm optimizer ad infinitum :) ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |