From: Bastian M. <bma...@we...> - 2020-11-02 10:24:29
|
Allin, Thank you for this nice contribution. On Windows, we already replace fopen() (and many other functions), though. In particular, win_fopen() in winmain.c already handles encodings, including UTF-8. There are also two functions AnsiText() and UnicodeText() to convert to/from UTF16 according to gnuplot's encoding. They use the simple-to-use Windows functions WideCharToMultiByte() MultiByteToWideChar(), so no need for glib. That really poses the question if we should change the default internal encoding from "ANSI" (whatever that may be depends on the Windows locale) to UTF-8. I agree with that (and in fact my personal gnuplot.ini includes "set encoding utf8" since a long time). But how do we make this backward compatible since that will inevitably break load commands in old "ANSI" encoded user scripts? (as does your current patch btw) Possible solutions include a new command line option (like -u / --utf8) or a wgnuplot.ini setting. Bastian > -----Ursprüngliche Nachricht----- > Von: Allin Cottrell <cot...@wf...> > Gesendet: Sonntag, 1. November 2020 01:36 > An: gnuplot-beta <gnu...@li...> > Betreff: Re: filenames on MS Windows > > On Sun, 25 Oct 2020, Allin Cottrell wrote: > > > I tried googling this but didn't find an answer -- sorry if I should > > have just tried harder! My question is: can gnuplot on Windows handle > > a unicode filename argument passed in UTF-16? As in > > > > path/to/wgnuplot.exe <UTF-16 input filename> > > OK, that question was under-researched, but now I've done my homework. > Sorry, this is a bit long but I hope I can arouse some interest in the topic. > > Why bother with UTF-16 filename arguments? Nowadays a fair number of > Windows users construct paths (directory names or filenames) which are "out > of codepage" -- that is, unicode names which cannot be represented in the > (retro) "system codepage", which is typically just an 8-bit encoding. Since > Windows has supported unicode since NT came out, it's a reasonable > expectation that any filename one can construct on the platform should be > accessible via any program of interest. But a program that restricts itself to the > "ANSI" form of filenames simply cannot access files with out-of-codepage > paths. > (Sane modern OSes don't have this problem because they use UTF-8 > throughout.) > > So what about gnuplot? I may be wrong but it seems to me that gnuplot on > Windows is stuck with "ANSI" filenames at present. Even with UNICODE and > _UNICODE defined when compiling the program, the command-line arguments > are retrieved in winmain.c using either _argv or __argv (depending on the > compiler), and these get the ANSI-form arguments (as opposed to __wargv > which gets the arguments in UTF-16 form). > > It would be easy to swap out __argv for __wargv but by itself this would be > very disruptive. The subsequent code in winmain.c, and then the code in plot.c > (gnu_main) to which the args array is passed, all assumes the elements of argv > are plain "char *", not "wide char" > arrays. Handling UTF-16, which is chock-full of NUL bytes, would require lots of > messy "ifdefs". > > I have a proposal for fixing this. I realise it may not be acceptable as it stands > but maybe someone else might want to take it up. I'm attaching patches for > src/win/winmain.c and src/misc.c for reference but here I'll try to explain the > strategy. > > 1) In winmain.c, grab the command-line arguments as UTF-16 but immediately > convert them to UTF-8, so they can handled by the regular string.h APIs, both > here and in plot.c (gnu_main). > > 2) When we actually go to open a command-line file argument > (loadpath_fopen, in misc.c, called from gnu_main), we first try opening the file > using the filename as-is, but it that fails (and the filename validates as UTF-8) > we convert it to UTF-16 and try again. > > Since UTF-8 is a superset of ASCII, ASCII filename arguments should pass > through transparently. Within-codepage non-ASCII filenames should get > converted back to UTF-16 and opened OK. And the bonus is that out-of- > codepage arguments should also be converted and opened OK. > > I've tested this on Windows 10, with the system codepage set to Windows > 1252 ("Western Europe"), and have successfully opened files with names in > Russian and Greek. (I think this should also work if the user has the system > codepage set to UTF-8 (65001), which is a "beta" option on Windows.) > > My implementation uses GLib APIs (nice and simple) to convert from > UTF-16 to UTF-8 and back again (if needed). GLib is required anyway if one is > building the Cairo-based terminals. I suppose one could use native Windows > APIs to the same purpose but I suspect it would be a lot more bother. > > In my test setup this whole deal is triggered by the CFLAGS define > > -DWIDE_ARGS > > which is respected only when building for Windows -- and admittedly has only > been tested when cross-compiling for Windows from Linux using Mingw-w64. > In my mingw Makefile, I have: > > WIDE_ARGS = 1 > > ... > > ifdef WIDE_ARGS > CFLAGS += -DWIDE_ARGS > CFLAGS += $(shell pkg-config --cflags glib-2.0) endif > > -- > Allin Cottrell > Department of Economics > Wake Forest University |