Hello all,
Yesterday, I took a good look at how the UTF-8 encoding works and
contemplating how BP console can support it. The encoding scheme of
UTF-8 is actually quite different than I had thought or assumed and that
turns out to be good for backwards compatibility with MacOS Roman, I
think. I was under the mistaken impression that most single-byte codes
in the range 128-255 were mapped to characters in the ISO/IEC 8859-1
encoding and that only a few codes were used to prefix multi-byte
characters. I was confused because the Unicode character map IS
compatible with ISO/IEC 8859-1 up to code point 255. However, as some
of you may know, UTF-8 re-encodes all characters with code points above
127 using 2-4 bytes. Wikipedia has a very nice explanation of how this
works and of the significant advantages that this encoding scheme has
for backwards compatibility and auto-detection when using Unix/C-based
software that operates on strings as sequences of single bytes.
One of the difficulties that I thought we would face in supporting both
UTF-8 and MacOS Roman input files was being able to tell the difference
between them for the "high ASCII" characters that are common in BP2
files (such as •, …, ¬, etc.). But I now think this won't be difficult
at all because of the clever design of UTF-8. NONE of the byte values
between 128-255 represent single characters in UTF-8. And the patterns
they form in multi-byte sequences to encode non-ASCII characters are
very unlikely to occur in normal text input. So, when BP console reads
in a file in MacOS Roman encoding it will be possible to detect that it
is not UTF-8 because those "high ASCII" characters will look like
invalid UTF-8 byte sequences! As long as we limit ourselves to
supporting UTF-8 and MacOS Roman I think it will be quite easy to
auto-detect which encoding the input files are in.
A couple of examples should help for those who are interested. (All
byte values are in hexadecimal).
Character MacRoman 8859-1 Unicode UTF-8
------------------------------------------------------
Bullet • A5 n/a(1) 20 22 E2 80 A2
Ellipsis … C9 n/a 20 26 E2 80 A6
Not ¬ C2 AC 00 AC C2 AC
(1) The closest character to the bullet in ISO/IEC 8859-1 is the
"interpunct" with a code point of B7. (We could support the interpunct
as an alternative to the bullet if desired).
The UTF-8 byte sequences for • and … appear as '‚Ä¢' and '‚Ķ' in MacOS
Roman so there is little chance of confusing them. I just now see the
coincidence that ¬ is C2 in MacOS Roman and C2 AC in UTF-8 which appears
as '¬¨' in MacOS Roman. There is some potential for confusion if ¬ is
the only non-ASCII character in an input file, but since BP2 uses ¬ to
represent a line continuation and (I believe) it must be immediately
followed by a newline sequence, we can probably use that fact to tell
the difference.
So, this is just a very long way of saying that there is good news on
the text encoding front and that I have a better idea of how to deal
with supporting two very different encodings. There are probably some
small hurdles in the code to get over -- one is that BP2 expects all of
the above characters to fit in a single-byte char type -- but those are
(hopefully) just implementation details.
Best regards,
Anthony
P.S. What I have learned about UTF-8 also explains why my test program
to output "high ASCII" UTF-8 characters failed to work as expected in
the Terminal app on OS X. I wasn't outputting valid UTF-8-encoded
characters! If I just "cat" a valid UTF-8 file, then I can see the
non-ASCII characters just fine in the terminal! :)
|