|
From: Allin C. <cot...@wf...> - 2020-11-01 00:36:31
|
On Sun, 25 Oct 2020, Allin Cottrell wrote:
> I tried googling this but didn't find an answer -- sorry if I should have
> just tried harder! My question is: can gnuplot on Windows handle a unicode
> filename argument passed in UTF-16? As in
>
> path/to/wgnuplot.exe <UTF-16 input filename>
OK, that question was under-researched, but now I've done my
homework. Sorry, this is a bit long but I hope I can arouse some
interest in the topic.
Why bother with UTF-16 filename arguments? Nowadays a fair number of
Windows users construct paths (directory names or filenames) which
are "out of codepage" -- that is, unicode names which cannot be
represented in the (retro) "system codepage", which is typically
just an 8-bit encoding. Since Windows has supported unicode since NT
came out, it's a reasonable expectation that any filename one can
construct on the platform should be accessible via any program of
interest. But a program that restricts itself to the "ANSI" form of
filenames simply cannot access files with out-of-codepage paths.
(Sane modern OSes don't have this problem because they use UTF-8
throughout.)
So what about gnuplot? I may be wrong but it seems to me that
gnuplot on Windows is stuck with "ANSI" filenames at present. Even
with UNICODE and _UNICODE defined when compiling the program, the
command-line arguments are retrieved in winmain.c using either _argv
or __argv (depending on the compiler), and these get the ANSI-form
arguments (as opposed to __wargv which gets the arguments in UTF-16
form).
It would be easy to swap out __argv for __wargv but by itself this
would be very disruptive. The subsequent code in winmain.c, and then
the code in plot.c (gnu_main) to which the args array is passed, all
assumes the elements of argv are plain "char *", not "wide char"
arrays. Handling UTF-16, which is chock-full of NUL bytes, would
require lots of messy "ifdefs".
I have a proposal for fixing this. I realise it may not be
acceptable as it stands but maybe someone else might want to take it
up. I'm attaching patches for src/win/winmain.c and src/misc.c for
reference but here I'll try to explain the strategy.
1) In winmain.c, grab the command-line arguments as UTF-16 but
immediately convert them to UTF-8, so they can handled by the
regular string.h APIs, both here and in plot.c (gnu_main).
2) When we actually go to open a command-line file argument
(loadpath_fopen, in misc.c, called from gnu_main), we first try
opening the file using the filename as-is, but it that fails (and
the filename validates as UTF-8) we convert it to UTF-16 and try
again.
Since UTF-8 is a superset of ASCII, ASCII filename arguments should
pass through transparently. Within-codepage non-ASCII filenames
should get converted back to UTF-16 and opened OK. And the bonus is
that out-of-codepage arguments should also be converted and opened
OK.
I've tested this on Windows 10, with the system codepage set to
Windows 1252 ("Western Europe"), and have successfully opened files
with names in Russian and Greek. (I think this should also work if
the user has the system codepage set to UTF-8 (65001), which is a
"beta" option on Windows.)
My implementation uses GLib APIs (nice and simple) to convert from
UTF-16 to UTF-8 and back again (if needed). GLib is required anyway
if one is building the Cairo-based terminals. I suppose one could
use native Windows APIs to the same purpose but I suspect it would
be a lot more bother.
In my test setup this whole deal is triggered by the CFLAGS define
-DWIDE_ARGS
which is respected only when building for Windows -- and admittedly
has only been tested when cross-compiling for Windows from Linux
using Mingw-w64. In my mingw Makefile, I have:
WIDE_ARGS = 1
...
ifdef WIDE_ARGS
CFLAGS += -DWIDE_ARGS
CFLAGS += $(shell pkg-config --cflags glib-2.0)
endif
--
Allin Cottrell
Department of Economics
Wake Forest University |