Re: utf-8 description (specify to use utf-8 without BOM)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Jan 27, 2011 at 14:46,  <mw...@gm...> wrote:
> Hi,
>
> just my 2 cents:
>
> * as <http://unicode.org/faq/utf_bom.html> points out, utf-8 has no byte order
>  (in contrast to utf-16 and utf-32) and thus does not need a byte order mark.

It definitely doesn't need it. The fact is that files do have it (if
nothing else to signal that it is UTF-8 and not Latin1 encoding for
example), by default when created with some tools.

>  The 3 byte sequens however serve as a hint to the encoding of the file. On
>  the other hand U+FEFF is a valid and normal unicode character ("ZERO WIDTH
>  NO-BREAK SPACE") even if it is at the beginning of a file

However Wikipedia also says:

If the BOM character appears in the middle of a data stream, it
should, according to Unicode, be interpreted as a "zero-width
non-breaking space" (essentially a null character). Its deliberate use
for this purpose is deprecated in Unicode 3.2, however, with the "Word
Joiner" character, U+2060, strongly preferred.

> * The 3 byte sequence should only be skipped if it is at the beginning of a
>  file or string.

But in addition to the statement above ... unless one will have a
super-advanced typographically-aware terminal with enabled hyphenation
... this character is supposed to be ignored anyway.

> * Having an optional 3 byte sequence at the beginning of a file complicates
>  things a lot. I think a script to "fix" damaged utf-8 files is probably the
>  best solution:
>
>    awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt
>    # http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8

Unless somebody is working on windows and awk comes preinstalled with
the system ... :) :) :)

> * Nevertheless being tolerant with respect to input is in general a good
>  thing.
>
> * My approach would look like:

Your code works for me as well, with one exception:
    gnuplot < testscript.plt
or
    cat testscript.plt | gnuplot
breaks with your code while it works with the one I sent. Of course
    gnuplot testscript.plt
still works.

My personal preferences are:
- I find it better to ignore BOM in any line to also support cases
with piping (I don't see where it could break anything except in data
file that are read with different routines anyway). Does that sequence
represent anything sensible in any other encoding?
- Either solution is better than no patch at all.
- (I'm not sure if it is better to issue warnings or not. Or at least
... maybe one would want to issue it just once per gnuplot session,
else it probably gets really annoying if one doesn't fix it, so the
fix becomes just a better place to spot the message when compared to
documentation, but one needs to fix it anyway.)

Mojca

Re: utf-8 description (specify to use utf-8 without BOM)

A portable, multi-platform, command-line driven graphing utility

Re: utf-8 description (specify to use utf-8 without BOM)