From: Ethan A M. <me...@uw...> - 2020-11-01 17:44:10
|
On Saturday, 31 October 2020 17:36:03 PST Allin Cottrell wrote: > On Sun, 25 Oct 2020, Allin Cottrell wrote: > > > I tried googling this but didn't find an answer -- sorry if I should have > > just tried harder! My question is: can gnuplot on Windows handle a unicode > > filename argument passed in UTF-16? As in > > > > path/to/wgnuplot.exe <UTF-16 input filename> > > OK, that question was under-researched, but now I've done my > homework. Sorry, this is a bit long but I hope I can arouse some > interest in the topic. I don't have any direct insight into this issue other than to note that the filesytem itself may be an issue. The standard Windows filesystems impose an encoding on filenames, whereas linux filesystems are agnostic to encoding; any null-terminated byte sequence not containing '/' is a legal file name. The following entry from the R developer blog is of interest https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/ I gather from the discussion there that Windows-10 can be made to support UTF-8 as a native encoding, calling it "extended ASCII". In that mode R (and I suppose gnuplot) can use the existing generic linux code paths rather than multiple layers of text conversion. Ethan > > Why bother with UTF-16 filename arguments? Nowadays a fair number of > Windows users construct paths (directory names or filenames) which > are "out of codepage" -- that is, unicode names which cannot be > represented in the (retro) "system codepage", which is typically > just an 8-bit encoding. Since Windows has supported unicode since NT > came out, it's a reasonable expectation that any filename one can > construct on the platform should be accessible via any program of > interest. But a program that restricts itself to the "ANSI" form of > filenames simply cannot access files with out-of-codepage paths. > (Sane modern OSes don't have this problem because they use UTF-8 > throughout.) > > So what about gnuplot? I may be wrong but it seems to me that > gnuplot on Windows is stuck with "ANSI" filenames at present. Even > with UNICODE and _UNICODE defined when compiling the program, the > command-line arguments are retrieved in winmain.c using either _argv > or __argv (depending on the compiler), and these get the ANSI-form > arguments (as opposed to __wargv which gets the arguments in UTF-16 > form). > > It would be easy to swap out __argv for __wargv but by itself this > would be very disruptive. The subsequent code in winmain.c, and then > the code in plot.c (gnu_main) to which the args array is passed, all > assumes the elements of argv are plain "char *", not "wide char" > arrays. Handling UTF-16, which is chock-full of NUL bytes, would > require lots of messy "ifdefs". > > I have a proposal for fixing this. I realise it may not be > acceptable as it stands but maybe someone else might want to take it > up. I'm attaching patches for src/win/winmain.c and src/misc.c for > reference but here I'll try to explain the strategy. > > 1) In winmain.c, grab the command-line arguments as UTF-16 but > immediately convert them to UTF-8, so they can handled by the > regular string.h APIs, both here and in plot.c (gnu_main). > > 2) When we actually go to open a command-line file argument > (loadpath_fopen, in misc.c, called from gnu_main), we first try > opening the file using the filename as-is, but it that fails (and > the filename validates as UTF-8) we convert it to UTF-16 and try > again. > > Since UTF-8 is a superset of ASCII, ASCII filename arguments should > pass through transparently. Within-codepage non-ASCII filenames > should get converted back to UTF-16 and opened OK. And the bonus is > that out-of-codepage arguments should also be converted and opened > OK. > > I've tested this on Windows 10, with the system codepage set to > Windows 1252 ("Western Europe"), and have successfully opened files > with names in Russian and Greek. (I think this should also work if > the user has the system codepage set to UTF-8 (65001), which is a > "beta" option on Windows.) > > My implementation uses GLib APIs (nice and simple) to convert from > UTF-16 to UTF-8 and back again (if needed). GLib is required anyway > if one is building the Cairo-based terminals. I suppose one could > use native Windows APIs to the same purpose but I suspect it would > be a lot more bother. > > In my test setup this whole deal is triggered by the CFLAGS define > > -DWIDE_ARGS > > which is respected only when building for Windows -- and admittedly > has only been tested when cross-compiling for Windows from Linux > using Mingw-w64. In my mingw Makefile, I have: > > WIDE_ARGS = 1 > > ... > > ifdef WIDE_ARGS > CFLAGS += -DWIDE_ARGS > CFLAGS += $(shell pkg-config --cflags glib-2.0) > endif > > -- > Allin Cottrell > Department of Economics > Wake Forest University |