From: Tatsuro M. <tma...@ya...> - 2011-01-24 22:11:20
|
Hello gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM. I think it is better to mention it in a proper position in the manual The below is my proposal. ************************** --- gnuplot.orig.doc 2011-01-17 08:00:38 +0900 +++ gnuplot.doc 2011-01-25 07:05:29 +0900 @@ -7520,7 +7520,7 @@ cp1251 - codepage for 8-bit Russian, Serbian, Bulgarian, Macedonian cp1254 - codepage for MS Windows, Turkish (superset of Latin5) utf8 - variable-length (multibyte) representation of Unicode - entry point for each character + entry point for each character (use utf-8 without BOM(Byte Or der Mark)) The command `set encoding locale` is different from the other options. It attempts to determine the current locale from the runtime environment. ***************************** -------------------------------------- Get the new Internet Explorer 8 optimized for Yahoo! JAPAN http://pr.mail.yahoo.co.jp/ie8/ |
From: Mojca M. <moj...@gm...> - 2011-01-24 22:21:41
|
On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote: > Hello > > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM. > > I think it is better to mention it in a proper position in the manual Or even better: to fix the source code :) Mojca |
From: Tatsuro M. <tma...@ya...> - 2011-01-24 23:23:40
|
Hello --- Mojca Miklavec wrote: > On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote: > > Hello > > > > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM. > > > > I think it is better to mention it in a proper position �in the manual > > Or even better: to fix the source code :) > > Mojca When script saved in utf-8 with BOM , bit order marks are attached to the script contests. I think that it is not practical to rewrite gnuplot code to accept the script with the utf-8 with BOM. Regards Tatsuro -------------------------------------- Get the new Internet Explorer 8 optimized for Yahoo! JAPAN http://pr.mail.yahoo.co.jp/ie8/ |
From: Allin C. <cot...@wf...> - 2011-01-25 02:12:09
|
On Tue, 25 Jan 2011, Tatsuro MATSUOKA wrote: > --- Mojca Miklavec wrote: > > > On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote: > > > > > > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) > > > but not that with the BOM. > > > > > > I think it is better to mention it in a proper position > > > �in the manual > > > > Or even better: to fix the source code :) I'm not sure I'd call this a "fix". Wikipedia says of the BOM in UTF-8: "While Unicode standard allows BOM in UTF-8, it does not require or recommend it. Byte order has no meaning in UTF-8 so a BOM only serves to identify a text stream or file as UTF-8 or that it was converted from another format that has a BOM." Some MS Windows applications add these redundant bytes to UTF-8 files but "proper" UTF-8 gets by fine without them. Allin Cottrell |
From: Tatsuro M. <tma...@ya...> - 2011-01-25 02:48:20
|
Hello Allin --- Allin Cottrell wrote: > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in > UTF-8: > > "While Unicode standard allows BOM in UTF-8, it does not require > or recommend it. Byte order has no meaning in UTF-8 so a BOM only > serves to identify a text stream or file as UTF-8 or that it was > converted from another format that has a BOM." > > Some MS Windows applications add these redundant bytes to UTF-8 > files but "proper" UTF-8 gets by fine without them. Thanks for explanation. All text editors I have used in MS-windows seem to add byte when files are used in utf-8 with the BOM format. Script files with saved the BOM have not ever be able to use. Even if this phenomenon is specific to the MS-windows, this fact is to be better to mention in somewhere. I think it is better to mention it in gnuplot.doc (i.e. manual and help) . Another candidate is FAQ, I think. My preference is the gnuplot.doc but I'm not against that this issue is described in the FAQ or some other places. What is important is that users easy get to know that scripts written in the utf-8 with the BOM format cannot be used gnuplot on windows. Regards Tatsuro -------------------------------------- Get the new Internet Explorer 8 optimized for Yahoo! JAPAN http://pr.mail.yahoo.co.jp/ie8/ |
From: Mojca M. <moj...@gm...> - 2011-01-26 02:51:01
|
On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote: > > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in > UTF-8: > > "While Unicode standard allows BOM in UTF-8, it does not require > or recommend it. However ... this has to be read as: gnuplot is not required to *output* files with BOM (and thus doesn't need to be fixed to create BOM marks in output), but it should better support them when *opening* external files. Even if the marks are not required by the standard, they are still there. Even worse ... from what some people here say they are even there by default in some standard Windows tools. Mojca (But once again: I don't know the source good enough, so I have no idea how difficult it would be to fix that particular behaviour.) |
From: Ethan M. <merritt@u.washington.edu> - 2011-01-26 04:07:04
|
On Tuesday, January 25, 2011, Mojca Miklavec wrote: > On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote: > > > > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in > > UTF-8: > > > > "While Unicode standard allows BOM in UTF-8, it does not require > > or recommend it. That same Wikipedia paragraph goes on to say: The BOM will make a batch file not executable on Windows, so batch files must be saved as ANSI, not Unicode[...] On any platform, a UTF-8 BOM will interfere with the interpretation of source code for compiler and tools that don't recognise it but could otherwise handle UTF-8. > However ... this has to be read as: gnuplot is not required to > *output* files with BOM (and thus doesn't need to be fixed to create > BOM marks in output), but it should better support them when *opening* > external files. Even if the marks are not required by the standard, > they are still there. Even worse ... from what some people here say > they are even there by default in some standard Windows tools. It is worse than you may think. Notepad cannot even read _it's own files_ reliably. I'm sure you can find many discussions on Notepad and the BOM problem via Google; here are pointers to a couple: http://www.eeggs.com/items/48383.html http://www.datamystic.com/forums/viewtopic.php?t=586 Best to view it as some Windows-specific craziness that must be stripped from the file when transferring it to unix/linux, exactly the same as we must strip the extra ^M at the end of every line. I realize that may leave you with a problem if you are both creating and using the files on Windows, but I do not have a good solution for that. I did come across several recommendations to replace Notepad with Notepad++, which offers the option to edit and save UTF-8 files without adding a BOM. It's not just the script files, by the way. The same problem with presence or absence of a BOM applies to data files as well, including so far as I know binary files. So if you are unlucky enough to have a binary data file that just happens to contain the BOM bit pattern at the start, many Windows tools will handle it incorrectly. > (But once again: I don't know the source good enough, so I have no > idea how difficult it would be to fix that particular behaviour.) A check for BOM would have to be made every time a file is opened. So it might have to be handled in the readline library, and/or by providing a custom fopen() routine. But even that wouldn't help if you fed the input file to gnuplot via gnuplot < my-file-with-BOM.gp |
From: Tatsuro M. <tma...@ya...> - 2011-01-26 22:37:07
|
Hello Judging from posts in the perhaps no one want to implement functionality to use UTF-8 with the BOM otherwise Mojca himself will try to it. What is not good for users, fact that the utf-8 with the BOM cannot be used in the current gnuplot is not open to the public. In the first mail to this thread I have proposed manual modification (gnuplot.doc). If it is accepted, it is grateful for me. If the place I have proposed is not good, please suggest where is the appropriate place to write it. Regards Tatsuro --- Mojca Miklavec wrote: > On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote: > > > > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in > > UTF-8: > > > > "While Unicode standard allows BOM in UTF-8, it does not require > > or recommend it. > > However ... this has to be read as: gnuplot is not required to > *output* files with BOM (and thus doesn't need to be fixed to create > BOM marks in output), but it should better support them when *opening* > external files. Even if the marks are not required by the standard, > they are still there. Even worse ... from what some people here say > they are even there by default in some standard Windows tools. > > Mojca > > (But once again: I don't know the source good enough, so I have no > idea how difficult it would be to fix that particular behaviour.) > > ------------------------------------------------------------------------------ > Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! > Finally, a world-class log management solution at an even better price-free! > Download using promo code Free_Logger_4_Dev2Dev. Offer expires > February 28th, so secure your free ArcSight Logger TODAY! > http://p.sf.net/sfu/arcsight-sfd2d > _______________________________________________ > gnuplot-beta mailing list > gnu...@li... > https://lists.sourceforge.net/lists/listinfo/gnuplot-beta > -------------------------------------- Get the new Internet Explorer 8 optimized for Yahoo! JAPAN http://pr.mail.yahoo.co.jp/ie8/ |
From: Tatsuro M. <tma...@ya...> - 2011-01-26 22:45:45
|
Hello Mojca made a patch on this matter so that what I wrote is to be ignored. Regards Tatsuro --- Tatsuro MATSUOKA wrote: > Hello > > Judging from posts in the perhaps no one want to implement functionality to use UTF-8 with the > BOM > otherwise Mojca himself will try to it. > > What is not good for users, fact that the utf-8 with the BOM cannot be used in the current > gnuplot is > not open to the public. > > In the first mail to this thread I have proposed manual modification (gnuplot.doc). > If it is accepted, it is grateful for me. If the place I have proposed is not good, please > suggest > where is the appropriate place to write it. > > Regards > > Tatsuro > > --- Mojca Miklavec wrote: > > > On Tue, Jan 25, 2011 at 03:12, Allin Cottrell wrote: > > > > > > I'm not sure I'd call this a "fix". Wikipedia says of the BOM in > > > UTF-8: > > > > > > "While Unicode standard allows BOM in UTF-8, it does not require > > > or recommend it. > > > > However ... this has to be read as: gnuplot is not required to > > *output* files with BOM (and thus doesn't need to be fixed to create > > BOM marks in output), but it should better support them when *opening* > > external files. Even if the marks are not required by the standard, > > they are still there. Even worse ... from what some people here say > > they are even there by default in some standard Windows tools. > > > > Mojca > > > > (But once again: I don't know the source good enough, so I have no > > idea how difficult it would be to fix that particular behaviour.) > > > > ------------------------------------------------------------------------------ > > Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! > > Finally, a world-class log management solution at an even better price-free! > > Download using promo code Free_Logger_4_Dev2Dev. Offer expires > > February 28th, so secure your free ArcSight Logger TODAY! > > http://p.sf.net/sfu/arcsight-sfd2d > > _______________________________________________ > > gnuplot-beta mailing list > > gnu...@li... > > https://lists.sourceforge.net/lists/listinfo/gnuplot-beta > > > > > -------------------------------------- > Get the new Internet Explorer 8 optimized for Yahoo! JAPAN > http://pr.mail.yahoo.co.jp/ie8/ > > ------------------------------------------------------------------------------ > Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! > Finally, a world-class log management solution at an even better price-free! > Download using promo code Free_Logger_4_Dev2Dev. Offer expires > February 28th, so secure your free ArcSight Logger TODAY! > http://p.sf.net/sfu/arcsight-sfd2d > _______________________________________________ > gnuplot-beta mailing list > gnu...@li... > https://lists.sourceforge.net/lists/listinfo/gnuplot-beta > -------------------------------------- Get the new Internet Explorer 8 optimized for Yahoo! JAPAN http://pr.mail.yahoo.co.jp/ie8/ |
From: Mojca M. <moj...@gm...> - 2011-01-26 22:55:50
|
2011/1/26 Tatsuro MATSUOKA wrote: > Hello > > Mojca made a patch on this matter so that what I wrote is to be ignored. Did you manage to try it out? I created a file with BOM and I did some basic tests on mac (except with data files which need another patch), but I would be grateful if you could try it out and do some more tests to see if it is working properly in all the border cases. (In particular I would say that an additional "if" is desirable to check that varible "expression" is long enough.) >> otherwise Mojca himself will try to do it. (herself, actually) Best regards, Mojca |
From: Ethan A M. <sf...@us...> - 2011-01-24 23:16:05
|
On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote: > On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote: > > Hello > > > > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM. > > > > I think it is better to mention it in a proper position in the manual > > Or even better: to fix the source code :) You mean the source code for Notepad? |
From: Mojca M. <moj...@gm...> - 2011-01-26 02:38:53
|
On Mon, Jan 24, 2011 at 23:57, Ethan A Merritt wrote: > On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote: >> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote: >> > Hello >> > >> > gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM. >> > >> > I think it is better to mention it in a proper position in the manual >> >> Or even better: to fix the source code :) > > You mean the source code for Notepad? I understand that that was sarcasm, but still ... BOM is allowed by the standard. One could argue that Notepad could offer a few more advanced settings, but it is definitely not misbehaving, while gnuplot *is* misbehaving according to the standard if it doesn't accept and ignore the BOM mark. 2011/1/25 Tatsuro MATSUOKA wrote: > > When script saved in utf-8 with BOM , bit order marks are attached to the script contests. > I think that it is not practical to rewrite gnuplot code to accept the script with the utf-8 with BOM. I don't know enough about gnuplot's source, so I don't know how difficult it is to change it, but if there is no problem to support comments (in both data files and scripts), I don't see why ignoring the first two bytes would not be doable. I consider it "equally hard". It might be even less practical for users to do dirty tricks to remove BOM marks from their files. Source code needs to be fixed just once, while users need to repeat the process over and over again. I never had any problem with BOM marks, so I don't know how serious problem that presents in practice. Mojca PS: I definitely have to give a compliment about a really nice surprize to see unicode work almost satisfactory with the latest wxt terminal in windows (compared to the old one with its own console) ... It could still be improved (supporting the whole range of unicode as opposed to just a subset that corresponds to local codepage; and using unicode automatically/by default), but it is already lightyears ahead of what it was before that change. This tiny change with BOM seems nothing compared to the horrible zero-nonascii-support before the new terminal. |
From: sfeam (E. Merritt) <eam...@gm...> - 2011-01-26 04:16:20
|
On Tuesday, January 25, 2011, Mojca Miklavec wrote: > I don't know enough about gnuplot's source, so I don't know how > difficult it is to change it, but if there is no problem to support > comments (in both data files and scripts), I don't see why ignoring > the first two bytes would not be doable. I consider it "equally hard". If you want to experiment with that approach, you can find the relevant switch statement at line 201 of scanner.c (scanner): switch (expression[current]) { case '#': /* DFK: add comments to gnuplot */ goto endline; /* ignore the rest of the line */ case '^': case '+': That isn't going to help with data files, however. Only with command lines that unexpectedly contain the BOM sequence. |
From: Mojca M. <moj...@gm...> - 2011-01-26 22:37:49
|
On Wed, Jan 26, 2011 at 05:16, sfeam (Ethan Merritt) wrote: > On Tuesday, January 25, 2011, Mojca Miklavec wrote: >> I don't know enough about gnuplot's source, so I don't know how >> difficult it is to change it, but if there is no problem to support >> comments (in both data files and scripts), I don't see why ignoring >> the first two bytes would not be doable. I consider it "equally hard". > > If you want to experiment with that approach, you can find the > relevant switch statement at line 201 of scanner.c (scanner): > > switch (expression[current]) { > case '#': /* DFK: add comments to gnuplot */ > goto endline; /* ignore the rest of the line */ > case '^': > case '+': > > That isn't going to help with data files, however. > Only with command lines that unexpectedly contain the BOM sequence. I can catch BOM with the following code: --- a/src/scanner.c +++ b/src/scanner.c @@ -114,8 +114,14 @@ scanner(char **expressionp, size_t *expressionlenp) /* leave space for dummy end token */ extend_token_table(); } - if (isspace((unsigned char) expression[current])) + if (isspace((unsigned char) expression[current])) { continue; /* skip the whitespace */ + } else if (((unsigned char)expression[current] == 0xef) && ((unsigned char)expression[current+1] == 0xbb) && ((unsigned char)expression[current+2] == 0xbf)) { + current += 2; + // optional warning + // int_warn(t_num, "Your file starts with a BOM character; you might want to remove it."); + continue; + } token[t_num].start_index = current; token[t_num].length = 1; token[t_num].is_token = TRUE; /* to start with... */ (NOTE 1: to avoid possible segmentation faults or other problems on files with less than 3 characters one would probably want to test if expression is long enough first. I didn't test if it really segfaults or not though, but it is probably polite to check if expression[current+2] is valid at all ...) (NOTE 2: I'm not sure if that is a good idea or not; one might want to set "utf-8" encoding by default in case that BOM is encountered. But on the other hand doing that might encourage users to always use BOM to avoid the need to set encoding.) This would catch any of the following: - gnuplot filewithbom.plt - gluplot < filewithbom.plt - load 'filewithboth.plt' However it wouldn't catch problematic datafiles (as already mentioned), but it might be enough to patch df_readascii in datafile.c to account for those as well. I didn't play with that yet, but I would like to know what you think about the patch mentioned above. Mojca |
From: <ri...@pi...> - 2011-01-26 09:18:13
|
On 01/24/11 23:57, Ethan A Merritt wrote: > On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote: >> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote: >>> Hello >>> >>> gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM. >>> >>> I think it is better to mention it in a proper position in the manual >> >> Or even better: to fix the source code :) > > You mean the source code for Notepad? > > Is this just Notepad or other more general MS sillyness? Wouldn't it be simpler to just parse the data with awk or similar? Peter. |
From: Ethan A M. <sf...@us...> - 2011-01-27 19:21:17
|
On Wednesday, January 26, 2011 01:21:52 am ri...@pi... wrote: > On 01/24/11 23:57, Ethan A Merritt wrote: > > On Monday, January 24, 2011 02:21:34 pm Mojca Miklavec wrote: > >> On Mon, Jan 24, 2011 at 23:11, Tatsuro MATSUOKA wrote: > >>> Hello > >>> > >>> gnuplot only accepts utf-8 without the BOM (Byte Oder Mark) but not that with the BOM. > >>> > >>> I think it is better to mention it in a proper position in the manual > >> > >> Or even better: to fix the source code :) > > > > You mean the source code for Notepad? > > > > > > Is this just Notepad or other more general MS sillyness? I gather that other tools allow you to set a preference for +/- BOM, but Notepad gives you no such option. There is, I am told, an equivalent program called Notepad++ that does allow you to set a preference. Ethan |
From: Hans-Bernhard B. <HBB...@t-...> - 2011-01-27 21:22:30
|
On 27.01.2011 20:20, Ethan A Merritt wrote: > I gather that other tools allow you to set a preference for +/- BOM, > but Notepad gives you no such option. There is, I am told, > an equivalent program called Notepad++ that does allow you to set > a preference. Calling Notepad++ an equivalent to MS Notepad would be grievously unjust. It is *way* more than that. I haven't seen many better open-source, free programmers' text editors, and none of those as seamlessly native to the MS Windows look&feel as Notepad++. |
From: <mw...@gm...> - 2011-01-27 13:47:00
|
Hi, just my 2 cents: * as <http://unicode.org/faq/utf_bom.html> points out, utf-8 has no byte order (in contrast to utf-16 and utf-32) and thus does not need a byte order mark. The 3 byte sequens however serve as a hint to the encoding of the file. On the other hand U+FEFF is a valid and normal unicode character ("ZERO WIDTH NO-BREAK SPACE") even if it is at the beginning of a file. Treating it special is just a well educated guess. * The 3 byte sequence should only be skipped if it is at the beginning of a file or string. * Having an optional 3 byte sequence at the beginning of a file complicates things a lot. I think a script to "fix" damaged utf-8 files is probably the best solution: awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt # http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8 * Nevertheless being tolerant with respect to input is in general a good thing. * My approach would look like: diff --git a/src/misc.c b/src/misc.c index afe3967..ac8ddb4 100644 --- a/src/misc.c +++ b/src/misc.c @@ -213,6 +213,8 @@ load_file(FILE *fp, char *name, TBOOLEAN can_do_args) int more; int stop = FALSE; + bool start_of_file = true; + lf_push(fp, name, NULL); /* save state for errors and recursion */ do_load_arg_substitution = can_do_args; @@ -274,6 +276,24 @@ load_file(FILE *fp, char *name, TBOOLEAN can_do_args) } } + /* ignore "BOM" ([which is] "only an encoding signature to + * distinguish UTF-8 from other encodings - it has nothing to do + * with byte order [in the case of UTF-8]" + * <http://unicode.org/faq/utf_bom.html>) */ + if (start_of_file + && strlen(gp_input_line) >= 3 + && ((unsigned char)gp_input_line[0] == 0xef) + && ((unsigned char)gp_input_line[1] == 0xbb) + && ((unsigned char)gp_input_line[2] == 0xbf)) { + + int_warn(NO_CARET, "Your file starts with a BOM (byte order mark). UTF-8 has no byte order, please see <http://unicode.org/faq/utf_bom.html>. You also might want to remove it."); + + char *inlptr = gp_input_line + 3; + memmove(gp_input_line, inlptr, strlen(inlptr)); + gp_input_line[strlen(inlptr)] = NUL; + } + start_of_file = false; /* only check at once */ + /* process line */ if (strlen(gp_input_line) > 0) { if (can_do_args) -- GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit gratis Handy-Flat! http://portal.gmx.net/de/go/dsl |
From: Mojca M. <moj...@gm...> - 2011-01-27 16:14:24
|
On Thu, Jan 27, 2011 at 14:46, <mw...@gm...> wrote: > Hi, > > just my 2 cents: > > * as <http://unicode.org/faq/utf_bom.html> points out, utf-8 has no byte order > (in contrast to utf-16 and utf-32) and thus does not need a byte order mark. It definitely doesn't need it. The fact is that files do have it (if nothing else to signal that it is UTF-8 and not Latin1 encoding for example), by default when created with some tools. > The 3 byte sequens however serve as a hint to the encoding of the file. On > the other hand U+FEFF is a valid and normal unicode character ("ZERO WIDTH > NO-BREAK SPACE") even if it is at the beginning of a file However Wikipedia also says: If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "zero-width non-breaking space" (essentially a null character). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character, U+2060, strongly preferred. > * The 3 byte sequence should only be skipped if it is at the beginning of a > file or string. But in addition to the statement above ... unless one will have a super-advanced typographically-aware terminal with enabled hyphenation ... this character is supposed to be ignored anyway. > * Having an optional 3 byte sequence at the beginning of a file complicates > things a lot. I think a script to "fix" damaged utf-8 files is probably the > best solution: > > awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt > # http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8 Unless somebody is working on windows and awk comes preinstalled with the system ... :) :) :) > * Nevertheless being tolerant with respect to input is in general a good > thing. > > * My approach would look like: Your code works for me as well, with one exception: gnuplot < testscript.plt or cat testscript.plt | gnuplot breaks with your code while it works with the one I sent. Of course gnuplot testscript.plt still works. My personal preferences are: - I find it better to ignore BOM in any line to also support cases with piping (I don't see where it could break anything except in data file that are read with different routines anyway). Does that sequence represent anything sensible in any other encoding? - Either solution is better than no patch at all. - (I'm not sure if it is better to issue warnings or not. Or at least ... maybe one would want to issue it just once per gnuplot session, else it probably gets really annoying if one doesn't fix it, so the fix becomes just a better place to spot the message when compared to documentation, but one needs to fix it anyway.) Mojca |
From: <pl...@pi...> - 2011-01-27 16:10:08
|
On 01/27/11 14:46, mw...@gm... wrote: > * Having an optional 3 byte sequence at the beginning of a file complicates > things a lot. I think a script to "fix" damaged utf-8 files is probably the > best solution: > > awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt > #http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8 > Hi, thanks for the script, that is what I suggested dong a couple of days ago but I now find I sent from the wrong account so the list apparently dropped it. (Didn't it used to send a warning for that ??) Since it appears that this BOM is a valid uft-8 white space character isn't it conceivable that try to dance around MS non-standard stupidity could mess up interpretation of a valid input file or gnuplot script? regards |
From: Mojca M. <moj...@gm...> - 2011-01-27 16:40:52
|
On Thu, Jan 27, 2011 at 17:10, <pl...@pi...> wrote: > > Since it appears that this BOM is a valid uft-8 white space character > isn't it conceivable that try to dance around MS non-standard stupidity > could mess up interpretation of a valid input file or gnuplot script? My questions are: - What is the percentage of windows users who have no idea what BOM is and would want to run the script? (Imagine ... you are not even able to see it with any given editor apart from hex viewer.) I think that this is not neglegible. - What would you need the character for in gnuplot scripting? Can you give me an example of when you would want to use it? (I really cannot think of any. Maybe "set xlabel 'abc<zerowidthspace>def'", but what good does that do, even if the terminal supports the character?) Gnuplot is not supposed to do high-quality typography with hyphenation or to implement spell-checker for words ... Even if there are some obscure examples that do make sense, the percentage of people that would want to misuse the character in script is neglegible compared to the poor windows users with no control of Notepad behaviour. (Seriously: what could be the example?) - In what way exactly could "please ignore that character" instruction mess up with "valid input file"? To the contrary. Current implemention without BOM support that "doesn't dance around the stupidity" might at best reserve three extra character widths to fit that "zero width" character between "abc" and "def" in the above example, so that something that gets printed as "abcdef" would consume 9 character widths. (I didn't test what my patch would do with 'abc<zerowidthspace>def', but no matter whether it does or doesn't do anything, there is no harm being done if interpreter just ignores the <zerowidthspace>.) - The only valid reason when this would break something is when somebody using Latin1 encoding would want to type set xlabel 'abc\  def' What is the percentage of those users? Mojca |