From: <mw...@gm...> - 2011-01-27 13:47:00
|
Hi, just my 2 cents: * as <http://unicode.org/faq/utf_bom.html> points out, utf-8 has no byte order (in contrast to utf-16 and utf-32) and thus does not need a byte order mark. The 3 byte sequens however serve as a hint to the encoding of the file. On the other hand U+FEFF is a valid and normal unicode character ("ZERO WIDTH NO-BREAK SPACE") even if it is at the beginning of a file. Treating it special is just a well educated guess. * The 3 byte sequence should only be skipped if it is at the beginning of a file or string. * Having an optional 3 byte sequence at the beginning of a file complicates things a lot. I think a script to "fix" damaged utf-8 files is probably the best solution: awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' text.txt # http://www.linuxask.com/questions/how-to-remove-bom-from-utf-8 * Nevertheless being tolerant with respect to input is in general a good thing. * My approach would look like: diff --git a/src/misc.c b/src/misc.c index afe3967..ac8ddb4 100644 --- a/src/misc.c +++ b/src/misc.c @@ -213,6 +213,8 @@ load_file(FILE *fp, char *name, TBOOLEAN can_do_args) int more; int stop = FALSE; + bool start_of_file = true; + lf_push(fp, name, NULL); /* save state for errors and recursion */ do_load_arg_substitution = can_do_args; @@ -274,6 +276,24 @@ load_file(FILE *fp, char *name, TBOOLEAN can_do_args) } } + /* ignore "BOM" ([which is] "only an encoding signature to + * distinguish UTF-8 from other encodings - it has nothing to do + * with byte order [in the case of UTF-8]" + * <http://unicode.org/faq/utf_bom.html>) */ + if (start_of_file + && strlen(gp_input_line) >= 3 + && ((unsigned char)gp_input_line[0] == 0xef) + && ((unsigned char)gp_input_line[1] == 0xbb) + && ((unsigned char)gp_input_line[2] == 0xbf)) { + + int_warn(NO_CARET, "Your file starts with a BOM (byte order mark). UTF-8 has no byte order, please see <http://unicode.org/faq/utf_bom.html>. You also might want to remove it."); + + char *inlptr = gp_input_line + 3; + memmove(gp_input_line, inlptr, strlen(inlptr)); + gp_input_line[strlen(inlptr)] = NUL; + } + start_of_file = false; /* only check at once */ + /* process line */ if (strlen(gp_input_line) > 0) { if (can_do_args) -- GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit gratis Handy-Flat! http://portal.gmx.net/de/go/dsl |