From: Mojca M. <moj...@gm...> - 2011-01-27 16:14:24
|
On Thu, Jan 27, 2011 at 14:46, <mw...@gm...> wrote: > Hi, > > just my 2 cents: > > * as <http://unicode.org/faq/utf_bom.html> points out, utf-8 has no byte order > (in contrast to utf-16 and utf-32) and thus does not need a byte order mark. It definitely doesn't need it. The fact is that files do have it (if nothing else to signal that it is UTF-8 and not Latin1 encoding for example), by default when created with some tools. > The 3 byte sequens however serve as a hint to the encoding of the file. On > the other hand U+FEFF is a valid and normal unicode character ("ZERO WIDTH > NO-BREAK SPACE") even if it is at the beginning of a file However Wikipedia also says: If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "zero-width non-breaking space" (essentially a null character). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character, U+2060, strongly preferred. > * The 3 byte sequence should only be skipped if it is at the beginning of a > file or string. But in addition to the statement above ... unless one will have a super-advanced typographically-aware terminal with enabled hyphenation ... this character is supposed to be ignored anyway. > * Having an optional 3 byte sequence at the beginning of a file complicates > things a lot. I think a script to "fix" damaged utf-8 files is probably the > best solution: > > awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt > # http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8 Unless somebody is working on windows and awk comes preinstalled with the system ... :) :) :) > * Nevertheless being tolerant with respect to input is in general a good > thing. > > * My approach would look like: Your code works for me as well, with one exception: gnuplot < testscript.plt or cat testscript.plt | gnuplot breaks with your code while it works with the one I sent. Of course gnuplot testscript.plt still works. My personal preferences are: - I find it better to ignore BOM in any line to also support cases with piping (I don't see where it could break anything except in data file that are read with different routines anyway). Does that sequence represent anything sensible in any other encoding? - Either solution is better than no patch at all. - (I'm not sure if it is better to issue warnings or not. Or at least ... maybe one would want to issue it just once per gnuplot session, else it probably gets really annoying if one doesn't fix it, so the fix becomes just a better place to spot the message when compared to documentation, but one needs to fix it anyway.) Mojca |