[Tcl-bugs] [ tcl-Bugs-3466099 ] BOM in Unicode

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #3466099, was opened at 2011-12-27 09:31
Message generated for change (Comment added) made by dkf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 44. UTF-8 Strings
Group: current: 8.5.11
Status: Open
Resolution: Accepted
Priority: 5
Private: No
Submitted By: Donal K. Fellows (dkf)
Assigned to: Jan Nijtmans (nijtmans)
Summary: BOM in Unicode

Initial Comment:
I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference:
https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30

I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away.

----------------------------------------------------------------------

>Comment By: Donal K. Fellows (dkf)
Date: 2012-03-07 06:34

Message:
But then we'd need to deal with the problem of how to send a Tcl_Obj
through the channel API, and that's an API that may cross thread boundaries
(which Tcl_Obj values _must not_ due to the way their memory is managed)
and it's going to be hard to make it all work with source potentially
getting data out of a VFS.

Find something else to optimize. Something easy.

----------------------------------------------------------------------

Comment By: Serg G. Brester (sebres)
Date: 2012-03-07 06:09

Message:
Because of pair Tcl_GetEncoding/Tcl_FreeEncoding
and corresponding part of Tcl_SetChannelOption can be extracted/extended
with  such as "Tcl_SetChannelEncoding"/"Tcl_GetChannelEncoding" or
something as "Tcl_SetChannelObjOption".

Idea here would be to use the function "Tcl_GetEncodingFromObj"...
I'm optimizer ad infinitum :)

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2012-03-07 05:54

Message:
What's the problem with “encodingName” being a “const char *”?
That's the type of the argument to Tcl_SetChannelOption…

----------------------------------------------------------------------

Comment By: Serg G. Brester (sebres)
Date: 2012-03-06 06:31

Message:
A pity that parameter "encodingName" of the function "Tcl_FSEvalFileEx" is
not a Tcl_Obj.

----------------------------------------------------------------------

Comment By: Serg G. Brester (sebres)
Date: 2012-03-06 06:23

Message:
Commit of another solution (1) with auto recognition (without fixed
cpBomTable).
See http://core.tcl.tk/tcl/info/8da0451f94

Tests:
source.test:    Total   31      Passed  31      Skipped 0       Failed  0

----------------------------------------------------------------------

Comment By: Serg G. Brester (sebres)
Date: 2012-03-02 02:05

Message:
The solution with cpBomTable is although good, but theoretically the
parameter 'encodingName' could be another single byte encoding such
iso8859-X, etc.
So the prepared array 'cpBomTable' should be greater and it will be no more
practical.
I see 2 solution here:
  1) read 4 first characters binary, if not BOM convert its to given
'encodingName', set channel encoding to 'encodingName', read further.
  2) read 1 first characters in utf-8, if not BOM seek to start, set
channel encoding to 'encodingName', read further.

I'm trying now the solution number 1.

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2012-03-01 14:49

Message:
Re-opening, because one situation is not handled yet,
which can cause problems. On Windows, normally the
system encoding is cp1252 (actually cp1250-1258). The
previous part only handles the situation that the encoding
is set to utf-8 explicitely.

However, the BOM indicates that the remaining should be
handled as utf-8, regardless of the system encoding.

Therefore, I created a new branch bug-3466099,
meant as an experiment (again). What it does:
If the system encoding is cp125[0-8] or identity
and the file starts with a BOM, the BOM is
skipped and the encoding is automatically
set to utf-8 while reading the remaining of the file

This experiment is committed in branch
bug-3466099. Remarks more than welcome.

Test cases are source-2.8 up to source-2.17 for
the different system encodings.

Specific question. Is there any other encoding
commonly encountered on Windows, which should
be handled the same way?

Regards,
           Jan Nijtmans

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2012-02-29 14:48

Message:
Committed to core-8-4-branch, core-8-5-branch and trunk.

The situation as described in the above reference, where
the Tcl script file started with BOM, now works as expected.

I don't thing that putting a BOM as start in an UTF-8 file
is wrong, see http://unicode.org/faq/utf_bom.html

Q: When a BOM is used, is it only in 16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the Unicode text is
transformed: UTF-16, UTF-8, or UTF-32. The exact bytes comprising the BOM
will be whatever the Unicode character U+FEFF is converted into by that
transformation format. In that form, the BOM serves to indicate both that
it is a Unicode file, and which of the formats it is in.

Now Tcl conforms to that, which it never did.

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2012-02-20 07:14

Message:
Thanks! Yes I agree with your changes.

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2012-02-20 05:16

Message:
Your test passes and I think it is correctly testing this feature. (I made
the test clearer so that it has the file contents setup in the test body;
that's clearer if the test fails.)

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2012-02-19 07:27

Message:
New attempt in bug-3466099 branch (threw away the old one)

Tcl_FSEvalFileEx now throws away the BOM when it is the
first character in the stream. If the encoding is set correctly
(e.g. to UTF-8) this will work on Unix and Windows.
Added test case source-2.7 to prove that. Advantage:
no seek is needed, as in the previous implementation.
So it is harmless for Tcl 8.4/8.5 as well.

This could be improved by adding a new encoding
named "", which is almost the same as the system
encoding. The only difference is that, if the first
characters of the stream is a BOM in any of the
UTF-8 or UTF-16 forms it will swith to this
encoding, otherwise it will behave exactly
like the system encoding. I'll leave that
for some other day (and for Tcl 8.6 only)

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2012-01-09 07:07

Message:
>I think that when Tcl_FSEvalFileEx()
>receives a non-NULL value of encodingName,
>that request ought to be honored.

Agreed, but it's a little bit trickier. If the user
explicitly speciefies "utf-8" or "unicode" that
should be honored too.

Actually, I am thinking about splitting
the functionality between Tcl_FSEvalFileEx
and the encoding machinery.

Currently the encoding "" is synonymous with the
system encoding. We could also create an
additional encoding with the name "", which
reads the first 2/3/4 bytes to see if it is some kind
of BOM. If it is, switch to the corresponding
encoding (utf-8, utf-16, ....) otherwise go on
using default system encoding. Then the
only thing left to be done in Tcl_FSEvalFileEx
is to strip the BOM.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2012-01-09 06:12

Message:
I think that when Tcl_FSEvalFileEx() 
receives a non-NULL value of encodingName,
that request ought to be honored.

When the caller hasn't made an explicit request, then
I can see some value in using BOM contents as a way
to make a better guess than blindly using the system encoding.

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2011-12-28 15:32

Message:
First attempt implemented in branch bug-3466099

Donal, do you see any negative effects of this? The disadvantage is that
any stream which does not contain a BOM will need to seek to the
start, and be read again in (possibly) another encoding...
Still, I think this is the way I would go.

Any feedback is highly appreciated!

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2011-12-28 10:13

Message:
Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx
assume that the file's contents are supposed to be a script and so do a bit
more magic than normal. (Theoretically, we also ought to think about doing
progressive evaluation of "large" files, say over 1MB. That's for another
time.)

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2011-12-28 01:26

Message:
I think I would modify Tcl_FSEvalFileEx such that when it
encounters a BOM as first character (in any of the forms
allowed by Unicode), it would switch the encoding
accordingly. Then it would work with UTF-16 as well, in
both little- and big-endian formst. It will be about the
same amount of work.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894

[Tcl-bugs] [ tcl-Bugs-3466099 ] BOM in Unicode

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-3466099 ] BOM in Unicode