Thread: utf-8 description (specify to use utf-8 without BOM)

A portable, multi-platform, command-line driven graphing utility

Brought to you by: broeker, cgaylord, lhecking, sfeam

gnuplot-beta

utf-8 description (specify to use utf-8 without BOM)

From: Tatsuro M. <tma...@ya...> - 2011-01-24 22:11:20

Hello

gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM.

I think it is better to mention it in a proper position  in the manual

The below is my proposal.

**************************
--- gnuplot.orig.doc    2011-01-17 08:00:38 +0900
+++ gnuplot.doc 2011-01-25 07:05:29 +0900
@@ -7520,7 +7520,7 @@
     cp1251      - codepage for 8-bit Russian, Serbian, Bulgarian, Macedonian
     cp1254      - codepage for MS Windows, Turkish (superset of Latin5)
     utf8        - variable-length (multibyte) representation of Unicode
-                  entry point for each character
+                  entry point for each character (use utf-8 without BOM(Byte Or der Mark))

  The command `set encoding locale` is different from the other options.
  It attempts to determine the current locale from the runtime environment.

*****************************

--------------------------------------
Get the new Internet Explorer 8 optimized for Yahoo! JAPAN
http://pr.mail.yahoo.co.jp/ie8/

Re: utf-8 description (specify to use utf-8 without BOM)

From: Mojca M. <moj...@gm...> - 2011-01-24 22:21:41

On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote:
> Hello
>
> gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM.
>
> I think it is better to mention it in a proper position  in the manual

Or even better: to fix the source code :)

Mojca

Re: utf-8 description (specify to use utf-8 without BOM)

From: Tatsuro M. <tma...@ya...> - 2011-01-24 23:23:40

Hello

--- Mojca Miklavec wrote:

> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote:
> > Hello
> >
> > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM.
> >
> > I think it is better to mention it in a proper position &#65533;in the manual
> 
> Or even better: to fix the source code :)
> 
> Mojca

When script saved in utf-8 with BOM , bit order marks are attached to the script contests.
I think that it is not practical to rewrite gnuplot code to accept the script with the utf-8 with BOM.

Regards

Tatsuro 


--------------------------------------
Get the new Internet Explorer 8 optimized for Yahoo! JAPAN
http://pr.mail.yahoo.co.jp/ie8/

Re: utf-8 description (specify to use utf-8 without BOM)

From: Allin C. <cot...@wf...> - 2011-01-25 02:12:09

On Tue, 25 Jan 2011, Tatsuro MATSUOKA wrote:

> --- Mojca Miklavec wrote:
>
> > On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote:
> > >
> > > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark)
> > > but not that with the BOM.
> > >
> > > I think it is better to mention it in a proper position
> > > &#65533;in the manual
> >
> > Or even better: to fix the source code :)

I'm not sure I'd call this a "fix". Wikipedia says of the BOM in
UTF-8:

"While Unicode standard allows BOM in UTF-8, it does not require
or recommend it. Byte order has no meaning in UTF-8 so a BOM only
serves to identify a text stream or file as UTF-8 or that it was
converted from another format that has a BOM."

Some MS Windows applications add these redundant bytes to UTF-8
files but "proper" UTF-8 gets by fine without them.

Allin Cottrell

Re: utf-8 description (specify to use utf-8 without BOM)

From: Tatsuro M. <tma...@ya...> - 2011-01-25 02:48:20

Hello Allin

--- Allin Cottrell  wrote:

> I'm not sure I'd call this a "fix". Wikipedia says of the BOM in
> UTF-8:
> 
> "While Unicode standard allows BOM in UTF-8, it does not require
> or recommend it. Byte order has no meaning in UTF-8 so a BOM only
> serves to identify a text stream or file as UTF-8 or that it was
> converted from another format that has a BOM."
> 
> Some MS Windows applications add these redundant bytes to UTF-8
> files but "proper" UTF-8 gets by fine without them.

Thanks for explanation. 

All text editors I have used in MS-windows seem to add byte when files are used in utf-8 with the BOM
format. Script files with saved the BOM have not ever be able to use.

Even if this phenomenon is specific to the MS-windows, 
this fact is to be better to mention in somewhere.

I think it is better to mention it in gnuplot.doc (i.e. manual and help) .
Another candidate is FAQ, I think.  

My preference is the gnuplot.doc but I'm not against that this issue is described in the FAQ or some
other places.  What is important is that users easy get to know that scripts written in the utf-8 with
the BOM format cannot be used gnuplot on windows.

Regards

Tatsuro

--------------------------------------
Get the new Internet Explorer 8 optimized for Yahoo! JAPAN
http://pr.mail.yahoo.co.jp/ie8/

Re: utf-8 description (specify to use utf-8 without BOM)

From: Mojca M. <moj...@gm...> - 2011-01-26 02:51:01

On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote:
>
> I'm not sure I'd call this a "fix". Wikipedia says of the BOM in
> UTF-8:
>
> "While Unicode standard allows BOM in UTF-8, it does not require
> or recommend it.

However ... this has to be read as: gnuplot is not required to
*output* files with BOM (and thus doesn't need to be fixed to create
BOM marks in output), but it should better support them when *opening*
external files. Even if the marks are not required by the standard,
they are still there. Even worse ... from what some people here say
they are even there by default in some standard Windows tools.

Mojca

(But once again: I don't know the source good enough, so I have no
idea how difficult it would be to fix that particular behaviour.)

Re: utf-8 description (specify to use utf-8 without BOM)

From: Ethan M. <merritt@u.washington.edu> - 2011-01-26 04:07:04

On Tuesday, January 25, 2011, Mojca Miklavec wrote:
> On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote:
> >
> > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in
> > UTF-8:
> >
> > "While Unicode standard allows BOM in UTF-8, it does not require
> > or recommend it.

That same Wikipedia paragraph goes on to say:
   The BOM will make a batch file not executable on Windows, so batch
   files must be saved as ANSI, not Unicode[...] On any platform, 
   a UTF-8 BOM will interfere with the interpretation of source code
   for compiler and tools that don't recognise it but could otherwise
   handle UTF-8. 

> However ... this has to be read as: gnuplot is not required to
> *output* files with BOM (and thus doesn't need to be fixed to create
> BOM marks in output), but it should better support them when *opening*
> external files. Even if the marks are not required by the standard,
> they are still there. Even worse ... from what some people here say
> they are even there by default in some standard Windows tools.

It is worse than you may think.  Notepad cannot even read _it's own files_
reliably. I'm sure you can find many discussions on Notepad and the BOM
problem via Google;  here are pointers to a couple: 
   http://www.eeggs.com/items/48383.html 
   http://www.datamystic.com/forums/viewtopic.php?t=586 

Best to view it as some Windows-specific craziness that must be
stripped from the file when transferring it to unix/linux, exactly
the same as we must strip the extra ^M at the end of every line.

I realize that may leave you with a problem if you are both creating
and using the files on Windows, but I do not have a good solution for
that.  I did come across several recommendations to replace Notepad with
Notepad++, which offers the option to edit and save UTF-8 files without
adding a BOM.

It's not just the script files, by the way.  The same problem with
presence or absence of a BOM applies to data files as well, including
so far as I know binary files.  So if you are unlucky enough to have
a binary data file that just happens to contain the BOM bit pattern at
the start, many Windows tools will handle it incorrectly.

> (But once again: I don't know the source good enough, so I have no
> idea how difficult it would be to fix that particular behaviour.)

A check for BOM would have to be made every time a file is opened.
So it might have to be handled in the readline library, and/or by providing
a custom fopen() routine.  But even that wouldn't help if you fed the
input file to gnuplot via
    gnuplot < my-file-with-BOM.gp

Re: utf-8 description (specify to use utf-8 without BOM)

From: Tatsuro M. <tma...@ya...> - 2011-01-26 22:37:07

Hello

Judging from posts in the perhaps no one want to implement functionality to use UTF-8 with the BOM
otherwise Mojca himself will try to it.  

What is not good for users, fact that the utf-8 with the BOM cannot be used in the current gnuplot is
not open to the public. 

In the first mail to this thread I have proposed manual modification (gnuplot.doc).
If it is accepted, it is grateful for me.  If the place I have proposed is not good, please suggest
where is the appropriate place to write it.

Regards

Tatsuro
 
--- Mojca Miklavec wrote:

> On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote:
> >
> > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in
> > UTF-8:
> >
> > "While Unicode standard allows BOM in UTF-8, it does not require
> > or recommend it.
> 
> However ... this has to be read as: gnuplot is not required to
> *output* files with BOM (and thus doesn't need to be fixed to create
> BOM marks in output), but it should better support them when *opening*
> external files. Even if the marks are not required by the standard,
> they are still there. Even worse ... from what some people here say
> they are even there by default in some standard Windows tools.
> 
> Mojca
> 
> (But once again: I don't know the source good enough, so I have no
> idea how difficult it would be to fix that particular behaviour.)
> 
> ------------------------------------------------------------------------------
> Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
> Finally, a world-class log management solution at an even better price-free!
> Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
> February 28th, so secure your free ArcSight Logger TODAY! 
> http://p.sf.net/sfu/arcsight-sfd2d
> _______________________________________________
> gnuplot-beta mailing list
> gnu...@li...
> https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
> 


--------------------------------------
Get the new Internet Explorer 8 optimized for Yahoo! JAPAN
http://pr.mail.yahoo.co.jp/ie8/

Re: utf-8 description (specify to use utf-8 without BOM)

From: Tatsuro M. <tma...@ya...> - 2011-01-26 22:45:45

Hello 

Mojca made a patch on this matter so that what I wrote is to be ignored.

Regards

Tatsuro

--- Tatsuro MATSUOKA  wrote:

> Hello
> 
> Judging from posts in the perhaps no one want to implement functionality to use UTF-8 with the
> BOM
> otherwise Mojca himself will try to it.  
> 
> What is not good for users, fact that the utf-8 with the BOM cannot be used in the current
> gnuplot is
> not open to the public. 
> 
> In the first mail to this thread I have proposed manual modification (gnuplot.doc).
> If it is accepted, it is grateful for me.  If the place I have proposed is not good, please
> suggest
> where is the appropriate place to write it.
> 
> Regards
> 
> Tatsuro
>  
> --- Mojca Miklavec wrote:
> 
> > On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote:
> > >
> > > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in
> > > UTF-8:
> > >
> > > "While Unicode standard allows BOM in UTF-8, it does not require
> > > or recommend it.
> > 
> > However ... this has to be read as: gnuplot is not required to
> > *output* files with BOM (and thus doesn't need to be fixed to create
> > BOM marks in output), but it should better support them when *opening*
> > external files. Even if the marks are not required by the standard,
> > they are still there. Even worse ... from what some people here say
> > they are even there by default in some standard Windows tools.
> > 
> > Mojca
> > 
> > (But once again: I don't know the source good enough, so I have no
> > idea how difficult it would be to fix that particular behaviour.)
> > 
> > ------------------------------------------------------------------------------
> > Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
> > Finally, a world-class log management solution at an even better price-free!
> > Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
> > February 28th, so secure your free ArcSight Logger TODAY! 
> > http://p.sf.net/sfu/arcsight-sfd2d
> > _______________________________________________
> > gnuplot-beta mailing list
> > gnu...@li...
> > https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
> > 
> 
> 
> --------------------------------------
> Get the new Internet Explorer 8 optimized for Yahoo! JAPAN
> http://pr.mail.yahoo.co.jp/ie8/
> 
> ------------------------------------------------------------------------------
> Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
> Finally, a world-class log management solution at an even better price-free!
> Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
> February 28th, so secure your free ArcSight Logger TODAY! 
> http://p.sf.net/sfu/arcsight-sfd2d
> _______________________________________________
> gnuplot-beta mailing list
> gnu...@li...
> https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
> 


--------------------------------------
Get the new Internet Explorer 8 optimized for Yahoo! JAPAN
http://pr.mail.yahoo.co.jp/ie8/

Re: utf-8 description (specify to use utf-8 without BOM)

From: Mojca M. <moj...@gm...> - 2011-01-26 22:55:50

2011/1/26 Tatsuro MATSUOKA wrote:
> Hello
>
> Mojca made a patch on this matter so that what I wrote is to be ignored.

Did you manage to try it out?

I created a file with BOM and I did some basic tests on mac (except
with data files which need another patch), but I would be grateful if
you could try it out and do some more tests to see if it is working
properly in all the border cases.

(In particular I would say that an additional "if" is desirable to
check that varible "expression" is long enough.)

>> otherwise Mojca himself will try to do it.

(herself, actually)

Best regards,
    Mojca

Re: utf-8 description (specify to use utf-8 without BOM)

From: Ethan A M. <sf...@us...> - 2011-01-24 23:16:05

On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote:
> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote:
> > Hello
> >
> > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM.
> >
> > I think it is better to mention it in a proper position  in the manual
> 
> Or even better: to fix the source code :)

You mean the source code for Notepad?

Re: utf-8 description (specify to use utf-8 without BOM)

From: Mojca M. <moj...@gm...> - 2011-01-26 02:38:53

On Mon, Jan 24, 2011 at 23:57, Ethan A Merritt wrote:
> On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote:
>> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote:
>> > Hello
>> >
>> > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM.
>> >
>> > I think it is better to mention it in a proper position  in the manual
>>
>> Or even better: to fix the source code :)
>
> You mean the source code for Notepad?

I understand that that was sarcasm, but still ...

BOM is allowed by the standard. One could argue that Notepad could
offer a few more advanced settings, but it is definitely not
misbehaving, while gnuplot *is* misbehaving according to the standard
if it doesn't accept and ignore the BOM mark.

2011/1/25 Tatsuro MATSUOKA wrote:
>
> When script saved in utf-8 with BOM , bit order marks are attached to the script contests.
> I think that it is not practical to rewrite gnuplot code to accept the script with the utf-8 with BOM.

I don't know enough about gnuplot's source, so I don't know how
difficult it is to change it, but if there is no problem to support
comments (in both data files and scripts), I don't see why ignoring
the first two bytes would not be doable. I consider it "equally hard".

It might be even less practical for users to do dirty tricks to remove
BOM marks from their files. Source code needs to be fixed just once,
while users need to repeat the process over and over again. I never
had any problem with BOM marks, so I don't know how serious problem
that presents in practice.

Mojca

PS: I definitely have to give a compliment about a really nice
surprize to see unicode work almost satisfactory with the latest wxt
terminal in windows (compared to the old one with its own console) ...
It could still be improved (supporting the whole range of unicode as
opposed to just a subset that corresponds to local codepage; and using
unicode automatically/by default), but it is already lightyears ahead
of what it was before that change. This tiny change with BOM seems
nothing compared to the horrible zero-nonascii-support before the new
terminal.

Re: utf-8 description (specify to use utf-8 without BOM)

From: sfeam (E. Merritt) <eam...@gm...> - 2011-01-26 04:16:20

On Tuesday, January 25, 2011, Mojca Miklavec wrote:
> I don't know enough about gnuplot's source, so I don't know how
> difficult it is to change it, but if there is no problem to support
> comments (in both data files and scripts), I don't see why ignoring
> the first two bytes would not be doable. I consider it "equally hard".

If you want to experiment with that approach,  you can find the
relevant switch statement at line 201 of scanner.c (scanner):

            switch (expression[current]) {
            case '#':           /* DFK: add comments to gnuplot */
                goto endline;   /* ignore the rest of the line */
            case '^':
            case '+':

That isn't going to help with data files, however.
Only with command lines that unexpectedly contain the BOM sequence.

Re: utf-8 description (specify to use utf-8 without BOM)

From: Mojca M. <moj...@gm...> - 2011-01-26 22:37:49

On Wed, Jan 26, 2011 at 05:16, sfeam (Ethan Merritt) wrote:
> On Tuesday, January 25, 2011, Mojca Miklavec wrote:
>> I don't know enough about gnuplot's source, so I don't know how
>> difficult it is to change it, but if there is no problem to support
>> comments (in both data files and scripts), I don't see why ignoring
>> the first two bytes would not be doable. I consider it "equally hard".
>
> If you want to experiment with that approach,  you can find the
> relevant switch statement at line 201 of scanner.c (scanner):
>
>            switch (expression[current]) {
>            case '#':           /* DFK: add comments to gnuplot */
>                goto endline;   /* ignore the rest of the line */
>            case '^':
>            case '+':
>
> That isn't going to help with data files, however.
> Only with command lines that unexpectedly contain the BOM sequence.

I can catch BOM with the following code:

--- a/src/scanner.c
+++ b/src/scanner.c
@@ -114,8 +114,14 @@ scanner(char **expressionp, size_t *expressionlenp)
            /* leave space for dummy end token */
            extend_token_table();
        }
-       if (isspace((unsigned char) expression[current]))
+       if (isspace((unsigned char) expression[current])) {
            continue;           /* skip the whitespace */
+       } else if (((unsigned char)expression[current] == 0xef) &&
((unsigned char)expression[current+1] == 0xbb) && ((unsigned
char)expression[current+2] == 0xbf)) {
+           current += 2;
+           // optional warning
+           // int_warn(t_num, "Your file starts with a BOM character;
you might want to remove it.");
+           continue;
+       }
        token[t_num].start_index = current;
        token[t_num].length = 1;
        token[t_num].is_token = TRUE;   /* to start with... */

(NOTE 1: to avoid possible segmentation faults or other problems on
files with less than 3 characters one would probably want to test if
expression is long enough first. I didn't test if it really segfaults
or not though, but it is probably polite to check if
expression[current+2] is valid at all ...)

(NOTE 2: I'm not sure if that is a good idea or not; one might want to
set "utf-8" encoding by default in case that BOM is encountered. But
on the other hand doing that might encourage users to always use BOM
to avoid the need to set encoding.)

This would catch any of the following:
- gnuplot filewithbom.plt
- gluplot < filewithbom.plt
- load 'filewithboth.plt'

However it wouldn't catch problematic datafiles (as already
mentioned), but it might be enough to patch df_readascii in datafile.c
to account for those as well. I didn't play with that yet, but I would
like to know what you think about the patch mentioned above.

Mojca

Re: utf-8 description (specify to use utf-8 without BOM)

From: <ri...@pi...> - 2011-01-26 09:18:13

On 01/24/11 23:57, Ethan A Merritt wrote:
> On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote:
>> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote:
>>> Hello
>>>
>>> gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM.
>>>
>>> I think it is better to mention it in a proper position  in the manual
>>
>> Or even better: to fix the source code :)
>
> You mean the source code for Notepad?
>
>

Is this just Notepad or other more general MS sillyness?

Wouldn't it be simpler to just parse the data with awk or similar?

Peter.

Re: utf-8 description (specify to use utf-8 without BOM)

From: Ethan A M. <sf...@us...> - 2011-01-27 19:21:17

On Wednesday, January 26, 2011 01:21:52 am ri...@pi... wrote:
> On 01/24/11 23:57, Ethan A Merritt wrote:
> > On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote:
> >> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote:
> >>> Hello
> >>>
> >>> gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM.
> >>>
> >>> I think it is better to mention it in a proper position  in the manual
> >>
> >> Or even better: to fix the source code :)
> >
> > You mean the source code for Notepad?
> >
> >
> 
> Is this just Notepad or other more general MS sillyness?

I gather that other tools allow you to set a preference for +/- BOM,
but Notepad gives you no such option.   There is, I am told, 
an equivalent program called Notepad++ that does allow you to set
a preference.

	Ethan

Re: utf-8 description (specify to use utf-8 without BOM)

From: Hans-Bernhard B. <HBB...@t-...> - 2011-01-27 21:22:30

On 27.01.2011 20:20, Ethan A Merritt wrote:

> I gather that other tools allow you to set a preference for +/- BOM,
> but Notepad gives you no such option.   There is, I am told,
> an equivalent program called Notepad++ that does allow you to set
> a preference.

Calling Notepad++ an equivalent to MS Notepad would be grievously 
unjust.  It is *way* more than that.  I haven't seen many better 
open-source, free programmers' text editors, and none of those as 
seamlessly native to the MS Windows look&feel as Notepad++.

Re: utf-8 description (specify to use utf-8 without BOM)

From: <mw...@gm...> - 2011-01-27 13:47:00

Hi,

just my 2 cents:

* as <http://unicode.org/faq/utf_bom.html> points out, utf-8 has no byte order
  (in contrast to utf-16 and utf-32) and thus does not need a byte order mark.
  The 3 byte sequens however serve as a hint to the encoding of the file. On
  the other hand U+FEFF is a valid and normal unicode character ("ZERO WIDTH
  NO-BREAK SPACE") even if it is at the beginning of a file. Treating it
  special is just a well educated guess.

* The 3 byte sequence should only be skipped if it is at the beginning of a
  file or string.

* Having an optional 3 byte sequence at the beginning of a file complicates
  things a lot. I think a script to "fix" damaged utf-8 files is probably the
  best solution:

    awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt
    # http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8

* Nevertheless being tolerant with respect to input is in general a good
  thing.

* My approach would look like:

diff --git a/src/misc.c b/src/misc.c
index afe3967..ac8ddb4 100644
--- a/src/misc.c
+++ b/src/misc.c
@@ -213,6 +213,8 @@ load_file(FILE *fp, char *name, TBOOLEAN can_do_args)
     int more;
     int stop = FALSE;
 
+    bool start_of_file = true;
+
     lf_push(fp, name, NULL); /* save state for errors and recursion */
     do_load_arg_substitution = can_do_args;
 
@@ -274,6 +276,24 @@ load_file(FILE *fp, char *name, TBOOLEAN can_do_args)
 		}
 	    }
 
+            /* ignore "BOM" ([which is] "only an encoding signature to
+             * distinguish UTF-8 from other encodings - it has nothing to do
+             * with byte order [in the case of UTF-8]"
+             * <http://unicode.org/faq/utf_bom.html>) */
+            if (start_of_file
+                    && strlen(gp_input_line) >= 3
+                    && ((unsigned char)gp_input_line[0] == 0xef)
+                    && ((unsigned char)gp_input_line[1] == 0xbb)
+                    && ((unsigned char)gp_input_line[2] == 0xbf)) {
+
+                int_warn(NO_CARET, "Your file starts with a BOM (byte order mark). UTF-8 has no byte order, please see <http://unicode.org/faq/utf_bom.html>. You also might want to remove it.");
+
+                char *inlptr = gp_input_line + 3;
+                memmove(gp_input_line, inlptr, strlen(inlptr));
+                gp_input_line[strlen(inlptr)] = NUL;
+            }
+            start_of_file = false; /* only check at once */
+
 	    /* process line */
 	    if (strlen(gp_input_line) > 0) {
 		if (can_do_args)

-- 
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit 
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Re: utf-8 description (specify to use utf-8 without BOM)

From: Mojca M. <moj...@gm...> - 2011-01-27 16:14:24

On Thu, Jan 27, 2011 at 14:46,  <mw...@gm...> wrote:
> Hi,
>
> just my 2 cents:
>
> * as <http://unicode.org/faq/utf_bom.html> points out, utf-8 has no byte order
>  (in contrast to utf-16 and utf-32) and thus does not need a byte order mark.

It definitely doesn't need it. The fact is that files do have it (if
nothing else to signal that it is UTF-8 and not Latin1 encoding for
example), by default when created with some tools.

>  The 3 byte sequens however serve as a hint to the encoding of the file. On
>  the other hand U+FEFF is a valid and normal unicode character ("ZERO WIDTH
>  NO-BREAK SPACE") even if it is at the beginning of a file

However Wikipedia also says:

If the BOM character appears in the middle of a data stream, it
should, according to Unicode, be interpreted as a "zero-width
non-breaking space" (essentially a null character). Its deliberate use
for this purpose is deprecated in Unicode 3.2, however, with the "Word
Joiner" character, U+2060, strongly preferred.

> * The 3 byte sequence should only be skipped if it is at the beginning of a
>  file or string.

But in addition to the statement above ... unless one will have a
super-advanced typographically-aware terminal with enabled hyphenation
... this character is supposed to be ignored anyway.

> * Having an optional 3 byte sequence at the beginning of a file complicates
>  things a lot. I think a script to "fix" damaged utf-8 files is probably the
>  best solution:
>
>    awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt
>    # http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8

Unless somebody is working on windows and awk comes preinstalled with
the system ... :) :) :)

> * Nevertheless being tolerant with respect to input is in general a good
>  thing.
>
> * My approach would look like:

Your code works for me as well, with one exception:
    gnuplot < testscript.plt
or
    cat testscript.plt | gnuplot
breaks with your code while it works with the one I sent. Of course
    gnuplot testscript.plt
still works.

My personal preferences are:
- I find it better to ignore BOM in any line to also support cases
with piping (I don't see where it could break anything except in data
file that are read with different routines anyway). Does that sequence
represent anything sensible in any other encoding?
- Either solution is better than no patch at all.
- (I'm not sure if it is better to issue warnings or not. Or at least
... maybe one would want to issue it just once per gnuplot session,
else it probably gets really annoying if one doesn't fix it, so the
fix becomes just a better place to spot the message when compared to
documentation, but one needs to fix it anyway.)

Mojca

Re: utf-8 description (specify to use utf-8 without BOM)

From: <pl...@pi...> - 2011-01-27 16:10:08

On 01/27/11 14:46, mw...@gm... wrote:
> * Having an optional 3 byte sequence at the beginning of a file complicates
>    things a lot. I think a script to "fix" damaged utf-8 files is probably the
>    best solution:
>
>      awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt
>      #http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8
>

Hi,

thanks for the script, that is what I suggested dong a couple of days 
ago but I now find I sent from the wrong account so the list apparently 
dropped it.  (Didn't it used to send a warning for that ??)

Since it appears that this BOM is a valid uft-8 white space character 
isn't it conceivable that try to dance around MS non-standard stupidity 
could mess up interpretation of a valid input file or gnuplot script?

regards

Re: utf-8 description (specify to use utf-8 without BOM)

From: Mojca M. <moj...@gm...> - 2011-01-27 16:40:52

On Thu, Jan 27, 2011 at 17:10,  <pl...@pi...> wrote:
>
> Since it appears that this BOM is a valid uft-8 white space character
> isn't it conceivable that try to dance around MS non-standard stupidity
> could mess up interpretation of a valid input file or gnuplot script?

My questions are:

- What is the percentage of windows users who have no idea what BOM is
and would want to run the script? (Imagine ... you are not even able
to see it with any given editor apart from hex viewer.) I think that
this is not neglegible.

- What would you need the character for in gnuplot scripting? Can you
give me an example of when you would want to use it? (I really cannot
think of any. Maybe "set xlabel 'abc<zerowidthspace>def'", but what
good does that do, even if the terminal supports the character?)
Gnuplot is not supposed to do high-quality typography with hyphenation
or to implement spell-checker for words ...

Even if there are some obscure examples that do make sense, the
percentage of people that would want to misuse the character in script
is neglegible compared to the poor windows users with no control of
Notepad behaviour. (Seriously: what could be the example?)

- In what way exactly could "please ignore that character" instruction
mess up with "valid input file"? To the contrary. Current implemention
without BOM support that "doesn't dance around the stupidity" might at
best reserve three extra character widths to fit that "zero width"
character between "abc" and "def" in the above example, so that
something that gets printed as "abcdef" would consume 9 character
widths.

(I didn't test what my patch would do with 'abc<zerowidthspace>def',
but no matter whether it does or doesn't do anything, there is no harm
being done if interpreter just ignores the <zerowidthspace>.)

- The only valid reason when this would break something is when
somebody using Latin1 encoding would want to type

    set xlabel 'abc\
    ï»¿ def'

What is the percentage of those users?

Mojca