[Tcl-bugs] [ tcl-Bugs-3466099 ] BOM in Unicode

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #3466099, was opened at 2011-12-27 09:31
Message generated for change (Settings changed) made by nijtmans
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
>Category: 44. UTF-8 Strings
Group: current: 8.5.11
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Donal K. Fellows (dkf)
Assigned to: Jan Nijtmans (nijtmans)
Summary: BOM in Unicode

Initial Comment:
I was reading about the problems that some people are having with Tcl scripts on Windows due to that platform's insistence on putting a byte-order mark at the start of a UTF-8 file. (Arguably wrong, but we're stuck with it.) For reference:
https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/cb6fbae11b95fac6/c4211cabc90a8b30?hl=en#c4211cabc90a8b30

I was wondering if the most effective way of dealing with this would be to make Tcl treat a stray BOM as whitespace for the purpose of script parsing? I don't know exactly how practical this is, but it would make the most pressing part of the problem Go Away.

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2011-12-28 15:32

Message:
First attempt implemented in branch bug-3466099

Donal, do you see any negative effects of this? The disadvantage is that
any stream which does not contain a BOM will need to seek to the
start, and be read again in (possibly) another encoding...
Still, I think this is the way I would go.

Any feedback is highly appreciated!

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2011-12-28 10:13

Message:
Makes sense. I think we can get away just fine with making Tcl_FSEvalFileEx
assume that the file's contents are supposed to be a script and so do a bit
more magic than normal. (Theoretically, we also ought to think about doing
progressive evaluation of "large" files, say over 1MB. That's for another
time.)

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2011-12-28 01:26

Message:
I think I would modify Tcl_FSEvalFileEx such that when it
encounters a BOM as first character (in any of the forms
allowed by Unicode), it would switch the encoding
accordingly. Then it would work with UTF-16 as well, in
both little- and big-endian formst. It will be about the
same amount of work.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3466099&group_id=10894

[Tcl-bugs] [ tcl-Bugs-3466099 ] BOM in Unicode

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-3466099 ] BOM in Unicode