From: SourceForge.net <no...@so...> - 2011-12-27 17:31:20
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Tracker Item Submitted) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 45. Parsing and Eval Group: current: 8.5.11 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2011-12-28 09:26:48
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 45. Parsing and Eval Group: current: 8.5.11 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2011-12-28 18:13:29
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 45. Parsing and Eval Group: current: 8.5.11 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2011-12-28 23:32:31
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 45. Parsing and Eval Group: current: 8.5.11 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2011-12-29 15:42:41
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Settings changed) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. >Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-01-09 14:12:17
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by dgp You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-01-09 15:07:32
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-02-19 15:27:38
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open >Resolution: Fixed Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-02-20 13:16:42
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Fixed Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-02-20 15:14:30
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Fixed Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-02-29 22:48:50
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 >Status: Closed Resolution: Fixed Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-01 22:49:03
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 >Status: Open >Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-02 10:05:16
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by sebres You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-06 14:23:21
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by sebres You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-06 14:31:15
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by sebres You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-07 13:54:04
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-07 14:09:58
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by sebres You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Serg G. Brester (sebres) Date: 2012-03-07 06:09 Message: Because of pair Tcl_GetEncoding/Tcl_FreeEncoding and corresponding part of Tcl_SetChannelOption can be extracted/extended with such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or something as "Tcl_SetChannelObjOption". Idea here would be to use the function "Tcl_GetEncodingFromObj"... I'm optimizer ad infinitum :) ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-07 14:34:49
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 06:34 Message: But then we'd need to deal with the problem of how to send a Tcl_Obj through the channel API, and that's an API that may cross thread boundaries (which Tcl_Obj values _must not_ due to the way their memory is managed) and it's going to be hard to make it all work with source potentially getting data out of a VFS. Find something else to optimize. Something easy. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-07 06:09 Message: Because of pair Tcl_GetEncoding/Tcl_FreeEncoding and corresponding part of Tcl_SetChannelOption can be extracted/extended with such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or something as "Tcl_SetChannelObjOption". Idea here would be to use the function "Tcl_GetEncodingFromObj"... I'm optimizer ad infinitum :) ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-07 20:37:47
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-07 12:37 Message: So, now we have 3 attempts in this branch. Thanks! However, I'm not satisfied with any of them, although up to now sebres' one was the most clever. Currently I am thinking about building it in the channel code: something like a channel option "-checkbom true", which has the effect that the channel first waits for the next 3 bytes, and checks if it is the BOM. If so, it outputs utf-8 BOM and switches encoding to utf-8, otherwise continue as before. Advantage: the channel code has access to the binary buffer, so doesn't need to do so much trickery. Disadvantage: A new channel option needs a TIP. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 06:34 Message: But then we'd need to deal with the problem of how to send a Tcl_Obj through the channel API, and that's an API that may cross thread boundaries (which Tcl_Obj values _must not_ due to the way their memory is managed) and it's going to be hard to make it all work with source potentially getting data out of a VFS. Find something else to optimize. Something easy. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-07 06:09 Message: Because of pair Tcl_GetEncoding/Tcl_FreeEncoding and corresponding part of Tcl_SetChannelOption can be extracted/extended with such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or something as "Tcl_SetChannelObjOption". Idea here would be to use the function "Tcl_GetEncodingFromObj"... I'm optimizer ad infinitum :) ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-07 22:03:49
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-07 14:03 Message: So, started a 4th attempt, building it in the channel code. Implemented is the addition of a -checkbom channel flag (name is open for discussion). Actual handling of this flag is not implemented yet, so the source-2.x testcases still fail. A complete implementation should make those testcases pass. Still didn't figure out where the flag handling should be done, but knowing that the implementation should not be difficult any more...... ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-07 12:37 Message: So, now we have 3 attempts in this branch. Thanks! However, I'm not satisfied with any of them, although up to now sebres' one was the most clever. Currently I am thinking about building it in the channel code: something like a channel option "-checkbom true", which has the effect that the channel first waits for the next 3 bytes, and checks if it is the BOM. If so, it outputs utf-8 BOM and switches encoding to utf-8, otherwise continue as before. Advantage: the channel code has access to the binary buffer, so doesn't need to do so much trickery. Disadvantage: A new channel option needs a TIP. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 06:34 Message: But then we'd need to deal with the problem of how to send a Tcl_Obj through the channel API, and that's an API that may cross thread boundaries (which Tcl_Obj values _must not_ due to the way their memory is managed) and it's going to be hard to make it all work with source potentially getting data out of a VFS. Find something else to optimize. Something easy. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-07 06:09 Message: Because of pair Tcl_GetEncoding/Tcl_FreeEncoding and corresponding part of Tcl_SetChannelOption can be extracted/extended with such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or something as "Tcl_SetChannelObjOption". Idea here would be to use the function "Tcl_GetEncodingFromObj"... I'm optimizer ad infinitum :) ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-08 09:15:59
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by sebres You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Serg G. Brester (sebres) Date: 2012-03-08 01:15 Message: Hi Jan, you have placed Tcl_SetChannelOption "-checkbom" in else scope of if (encodingName != NULL) ... that means the following code does not ignore BOM from utf-8 file "test.tcl": source -encoding utf-8 test.tcl only for this one: source test.tcl ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-07 14:03 Message: So, started a 4th attempt, building it in the channel code. Implemented is the addition of a -checkbom channel flag (name is open for discussion). Actual handling of this flag is not implemented yet, so the source-2.x testcases still fail. A complete implementation should make those testcases pass. Still didn't figure out where the flag handling should be done, but knowing that the implementation should not be difficult any more...... ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-07 12:37 Message: So, now we have 3 attempts in this branch. Thanks! However, I'm not satisfied with any of them, although up to now sebres' one was the most clever. Currently I am thinking about building it in the channel code: something like a channel option "-checkbom true", which has the effect that the channel first waits for the next 3 bytes, and checks if it is the BOM. If so, it outputs utf-8 BOM and switches encoding to utf-8, otherwise continue as before. Advantage: the channel code has access to the binary buffer, so doesn't need to do so much trickery. Disadvantage: A new channel option needs a TIP. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 06:34 Message: But then we'd need to deal with the problem of how to send a Tcl_Obj through the channel API, and that's an API that may cross thread boundaries (which Tcl_Obj values _must not_ due to the way their memory is managed) and it's going to be hard to make it all work with source potentially getting data out of a VFS. Find something else to optimize. Something easy. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-07 06:09 Message: Because of pair Tcl_GetEncoding/Tcl_FreeEncoding and corresponding part of Tcl_SetChannelOption can be extracted/extended with such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or something as "Tcl_SetChannelObjOption". Idea here would be to use the function "Tcl_GetEncodingFromObj"... I'm optimizer ad infinitum :) ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |
From: SourceForge.net <no...@so...> - 2012-03-08 09:48:43
|
Bugs item #3466099, was opened at 2011-12-27 09:31 Message generated for change (Comment added) made by sebres You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Donal K. Fellows (dkf) Assigned to: Jan Nijtmans (nijtmans) Summary: BOM in Unicode Initial Comment: I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference: https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30 I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away. ---------------------------------------------------------------------- >Comment By: Serg G. Brester (sebres) Date: 2012-03-08 01:48 Message: Have commited revised "source.test" ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-08 01:15 Message: Hi Jan, you have placed Tcl_SetChannelOption "-checkbom" in else scope of if (encodingName != NULL) ... that means the following code does not ignore BOM from utf-8 file "test.tcl": source -encoding utf-8 test.tcl only for this one: source test.tcl ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-07 14:03 Message: So, started a 4th attempt, building it in the channel code. Implemented is the addition of a -checkbom channel flag (name is open for discussion). Actual handling of this flag is not implemented yet, so the source-2.x testcases still fail. A complete implementation should make those testcases pass. Still didn't figure out where the flag handling should be done, but knowing that the implementation should not be difficult any more...... ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-07 12:37 Message: So, now we have 3 attempts in this branch. Thanks! However, I'm not satisfied with any of them, although up to now sebres' one was the most clever. Currently I am thinking about building it in the channel code: something like a channel option "-checkbom true", which has the effect that the channel first waits for the next 3 bytes, and checks if it is the BOM. If so, it outputs utf-8 BOM and switches encoding to utf-8, otherwise continue as before. Advantage: the channel code has access to the binary buffer, so doesn't need to do so much trickery. Disadvantage: A new channel option needs a TIP. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 06:34 Message: But then we'd need to deal with the problem of how to send a Tcl_Obj through the channel API, and that's an API that may cross thread boundaries (which Tcl_Obj values _must not_ due to the way their memory is managed) and it's going to be hard to make it all work with source potentially getting data out of a VFS. Find something else to optimize. Something easy. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-07 06:09 Message: Because of pair Tcl_GetEncoding/Tcl_FreeEncoding and corresponding part of Tcl_SetChannelOption can be extracted/extended with such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or something as "Tcl_SetChannelObjOption". Idea here would be to use the function "Tcl_GetEncodingFromObj"... I'm optimizer ad infinitum :) ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-03-07 05:54 Message: What's the problem with “encodingName” being a “const char *”? That's the type of the argument to Tcl_SetChannelOption… ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:31 Message: A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is not a Tcl_Obj. ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-06 06:23 Message: Commit of another solution (1) with auto recognition (without fixed cpBomTable). See http://core.tcl.tk/tcl/info/8da0451f94 Tests: source.test: Total 31 Passed 31 Skipped 0 Failed 0 ---------------------------------------------------------------------- Comment By: Serg G. Brester (sebres) Date: 2012-03-02 02:05 Message: The solution with cpBomTable is although good, but theoretically the parameter 'encodingName' could be another single byte encoding such iso8859-X, etc. So the prepared array 'cpBomTable' should be greater and it will be no more practical. I see 2 solution here: 1) read 4 first characters binary, if not BOM convert its to given 'encodingName', set channel encoding to 'encodingName', read further. 2) read 1 first characters in utf-8, if not BOM seek to start, set channel encoding to 'encodingName', read further. I'm trying now the solution number 1. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-03-01 14:49 Message: Re-opening, because one situation is not handled yet, which can cause problems. On Windows, normally the system encoding is cp1252 (actually cp1250-1258). The previous part only handles the situation that the encoding is set to utf-8 explicitely. However, the BOM indicates that the remaining should be handled as utf-8, regardless of the system encoding. Therefore, I created a new branch bug-3466099, meant as an experiment (again). What it does: If the system encoding is cp125[0-8] or identity and the file starts with a BOM, the BOM is skipped and the encoding is automatically set to utf-8 while reading the remaining of the file This experiment is committed in branch bug-3466099. Remarks more than welcome. Test cases are source-2.8 up to source-2.17 for the different system encodings. Specific question. Is there any other encoding commonly encountered on Windows, which should be handled the same way? Regards, Jan Nijtmans ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-29 14:48 Message: Committed to core-8-4-branch, core-8-5-branch and trunk. The situation as described in the above reference, where the Tcl script file started with BOM, now works as expected. I don't thing that putting a BOM as start in an UTF-8 file is wrong, see http://unicode.org/faq/utf_bom.html Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Now Tcl conforms to that, which it never did. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-20 07:14 Message: Thanks! Yes I agree with your changes. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-02-20 05:16 Message: Your test passes and I think it is correctly testing this feature. (I made the test clearer so that it has the file contents setup in the test body; that's clearer if the test fails.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-02-19 07:27 Message: New attempt in bug-3466099 branch (threw away the old one) Tcl_FSEvalFileEx now throws away the BOM when it is the first character in the stream. If the encoding is set correctly (e.g. to UTF-8) this will work on Unix and Windows. Added test case source-2.7 to prove that. Advantage: no seek is needed, as in the previous implementation. So it is harmless for Tcl 8.4/8.5 as well. This could be improved by adding a new encoding named "", which is almost the same as the system encoding. The only difference is that, if the first characters of the stream is a BOM in any of the UTF-8 or UTF-16 forms it will swith to this encoding, otherwise it will behave exactly like the system encoding. I'll leave that for some other day (and for Tcl 8.6 only) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 07:07 Message: >I think that when Tcl_FSEvalFileEx() >receives a non-NULL value of encodingName, >that request ought to be honored. Agreed, but it's a little bit trickier. If the user explicitly speciefies "utf-8" or "unicode" that should be honored too. Actually, I am thinking about splitting the functionality between Tcl_FSEvalFileEx and the encoding machinery. Currently the encoding "" is synonymous with the system encoding. We could also create an additional encoding with the name "", which reads the first 2/3/4 bytes to see if it is some kind of BOM. If it is, switch to the corresponding encoding (utf-8, utf-16, ....) otherwise go on using default system encoding. Then the only thing left to be done in Tcl_FSEvalFileEx is to strip the BOM. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2012-01-09 06:12 Message: I think that when Tcl_FSEvalFileEx() receives a non-NULL value of encodingName, that request ought to be honored. When the caller hasn't made an explicit request, then I can see some value in using BOM contents as a way to make a better guess than blindly using the system encoding. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 15:32 Message: First attempt implemented in branch bug-3466099 Donal, do you see any negative effects of this? The disadvantage is that any stream which does not contain a BOM will need to seek to the start, and be read again in (possibly) another encoding... Still, I think this is the way I would go. Any feedback is highly appreciated! ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2011-12-28 10:13 Message: Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx assume that the file's contents are supposed to be a script and so do a bit more magic than normal. (Theoretically, we also ought to think about doing progressive evaluation of "large" files, say over 1MB. That's for another time.) ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-28 01:26 Message: I think I would modify Tcl_FSEvalFileEx such that when it encounters a BOM as first character (in any of the forms allowed by Unicode), it would switch the encoding accordingly. Then it would work with UTF-16 as well, in both little- and big-endian formst. It will be about the same amount of work. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894 |