Re: utf-8 description (specify to use utf-8 without BOM)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, Jan 26, 2011 at 05:16, sfeam (Ethan Merritt) wrote:
> On Tuesday, January 25, 2011, Mojca Miklavec wrote:
>> I don't know enough about gnuplot's source, so I don't know how
>> difficult it is to change it, but if there is no problem to support
>> comments (in both data files and scripts), I don't see why ignoring
>> the first two bytes would not be doable. I consider it "equally hard".
>
> If you want to experiment with that approach,  you can find the
> relevant switch statement at line 201 of scanner.c (scanner):
>
>            switch (expression[current]) {
>            case '#':           /* DFK: add comments to gnuplot */
>                goto endline;   /* ignore the rest of the line */
>            case '^':
>            case '+':
>
> That isn't going to help with data files, however.
> Only with command lines that unexpectedly contain the BOM sequence.

I can catch BOM with the following code:

--- a/src/scanner.c
+++ b/src/scanner.c
@@ -114,8 +114,14 @@ scanner(char **expressionp, size_t *expressionlenp)
            /* leave space for dummy end token */
            extend_token_table();
        }
-       if (isspace((unsigned char) expression[current]))
+       if (isspace((unsigned char) expression[current])) {
            continue;           /* skip the whitespace */
+       } else if (((unsigned char)expression[current] == 0xef) &&
((unsigned char)expression[current+1] == 0xbb) && ((unsigned
char)expression[current+2] == 0xbf)) {
+           current += 2;
+           // optional warning
+           // int_warn(t_num, "Your file starts with a BOM character;
you might want to remove it.");
+           continue;
+       }
        token[t_num].start_index = current;
        token[t_num].length = 1;
        token[t_num].is_token = TRUE;   /* to start with... */

(NOTE 1: to avoid possible segmentation faults or other problems on
files with less than 3 characters one would probably want to test if
expression is long enough first. I didn't test if it really segfaults
or not though, but it is probably polite to check if
expression[current+2] is valid at all ...)

(NOTE 2: I'm not sure if that is a good idea or not; one might want to
set "utf-8" encoding by default in case that BOM is encountered. But
on the other hand doing that might encourage users to always use BOM
to avoid the need to set encoding.)

This would catch any of the following:
- gnuplot filewithbom.plt
- gluplot < filewithbom.plt
- load 'filewithboth.plt'

However it wouldn't catch problematic datafiles (as already
mentioned), but it might be enough to patch df_readascii in datafile.c
to account for those as well. I didn't play with that yet, but I would
like to know what you think about the patch mentioned above.

Mojca




Re: utf-8 description (specify to use utf-8 without BOM)

A portable, multi-platform, command-line driven graphing utility

Re: utf-8 description (specify to use utf-8 without BOM)