I have been giving thought to how best to handle text in Hermes
given the desire to produce versions for Linux and Mac. Broadly
there are three options, as follows:
(1) Handle all text as UTF-8 and, in Windows, convert to UTF-16 at a
low level when dealing with the Windows API.
(2) Handle all text as UTF-16 and, in Linux, convert to UTF-8 at a
low level when calling Linux APIs.
(3) Use MSVC's TCHAR mechanism which allows text to be either UTF-8
or UTF-16 depending on a compilation option. For Windows we would
use UTF-16 and for Linux UTF-8. I am unclear which would be better
for Mac.
I dislike (1) mostly because it would require the most extensive
changes for Windows and I think we should give priority to getting a
Windows version completed with as few changes as practical. I
dislike (2) because, while there can be no problem storing text as
UTF-16 on all platforms, facilities for manipulating text as UTF-16
may be limited on Linux (and possibly Mac). The bottom line is that
I find myself favouring (3). It makes for great consistency.
Everywhere text is stored in char's it is assumed to be UTF-8
(except of course in contexts where it is being converted from other
single- and multi-byte character sets to UTF-8) and everywhere text
is stored in wchar_t's it is assumed to be UTF-16.
Do we have any consensus on this? Before making a final decision, it
would be useful to know what WxWidgets works with -- I hope either.
Soren, could you answer this for us, please.
I am not a Mac programmer, but have been around long enough to have
some ideas....
Firstly, in what context are you wanting to handle text? Messages can
arrive in multiple encodings, and presumably will need to be left in
that form? Or are you thinking of text strings as part of the
program? Or even the source code itself?
As for what Mac supports, TextWrangler (a version of the very highly
regarded BBedit programmer's editor) supports all these encodings:
I believe UTF-8 is most common. Note also that there are two formats
of UTF-16 with bytes switched. Mac uses one, Windows uses the other
iirc. No idea what BOM above means.
Note also that line endings on Mac were originally a carriage return,
but when it switched to a Unix base with OS X, the default became a
line feed. Both are still supported. This compares with Windows which
I believe is a carriage return/line feed pair.
It is free - you just need an Apple ID. It has all the documentation
about Apple products. There are also forums where you can interact
with other developers to clarify techy issues. You may even be able
to recruit a Mac programmer to provide more practical input.
I am not a Mac programmer, but have been around long enough to have
some ideas....
Firstly, in what context are you wanting to handle text? Messages can
arrive in multiple encodings, and presumably will need to be left in
that form? Or are you thinking of text strings as part of the
program? Or even the source code itself?
As for what Mac supports, TextWrangler (a version of the very highly
regarded BBedit programmer's editor) supports all these encodings:
I believe UTF-8 is most common. Note also that there are two formats
of UTF-16 with bytes switched. Mac uses one, Windows uses the other
iirc. No idea what BOM above means.
Note also that line endings on Mac were originally a carriage return,
but when it switched to a Unix base with OS X, the default became a
line feed. Both are still supported. This compares with Windows which
I believe is a carriage return/line feed pair.
You might want to sign up to Apple's developer site
https://developer.apple.com/
It is free - you just need an Apple ID. It has all the documentation
about Apple products. There are also forums where you can interact
with other developers to clarify techy issues. You may even be able
to recruit a Mac programmer to provide more practical input.
Cheers
David