[bp-dev] UTF-8 support & backwards compatibility

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello all,

Yesterday, I took a good look at how the UTF-8 encoding works and 
contemplating how BP console can support it.  The encoding scheme of 
UTF-8 is actually quite different than I had thought or assumed and that 
turns out to be good for backwards compatibility with MacOS Roman, I 
think.  I was under the mistaken impression that most single-byte codes 
in the range 128-255 were mapped to characters in the ISO/IEC 8859-1 
encoding and that only a few codes were used to prefix multi-byte 
characters.  I was confused because the Unicode character map IS 
compatible with ISO/IEC 8859-1 up to code point 255.  However, as some 
of you may know, UTF-8 re-encodes all characters with code points above 
127 using 2-4 bytes.  Wikipedia has a very nice explanation of how this 
works and of the significant advantages that this encoding scheme has 
for backwards compatibility and auto-detection when using Unix/C-based 
software that operates on strings as sequences of single bytes.

One of the difficulties that I thought we would face in supporting both 
UTF-8 and MacOS Roman input files was being able to tell the difference 
between them for the "high ASCII" characters that are common in BP2 
files (such as •, …, ¬, etc.).  But I now think this won't be difficult 
at all because of the clever design of UTF-8.  NONE of the byte values 
between 128-255 represent single characters in UTF-8.  And the patterns 
they form in multi-byte sequences to encode non-ASCII characters are 
very unlikely to occur in normal text input.  So, when BP console reads 
in a file in MacOS Roman encoding it will be possible to detect that it 
is not UTF-8 because those "high ASCII" characters will look like 
invalid UTF-8 byte sequences!  As long as we limit ourselves to 
supporting UTF-8 and MacOS Roman I think it will be quite easy to 
auto-detect which encoding the input files are in.

A couple of examples should help for those who are interested.  (All 
byte values are in hexadecimal).

Character   MacRoman   8859-1   Unicode   UTF-8
------------------------------------------------------
Bullet •    A5         n/a(1)   20 22     E2 80 A2
Ellipsis …  C9         n/a      20 26     E2 80 A6
Not ¬       C2         AC       00 AC     C2 AC

(1) The closest character to the bullet in ISO/IEC 8859-1 is the 
"interpunct" with a code point of B7.  (We could support the interpunct 
as an alternative to the bullet if desired).

The UTF-8 byte sequences for • and … appear as '‚Ä¢' and '‚Ä¶' in MacOS 
Roman so there is little chance of confusing them.  I just now see the 
coincidence that ¬ is C2 in MacOS Roman and C2 AC in UTF-8 which appears 
as '¬¨' in MacOS Roman.  There is some potential for confusion if ¬ is 
the only non-ASCII character in an input file, but since BP2 uses ¬ to 
represent a line continuation and (I believe) it must be immediately 
followed by a newline sequence, we can probably use that fact to tell 
the difference.

So, this is just a very long way of saying that there is good news on 
the text encoding front and that I have a better idea of how to deal 
with supporting two very different encodings.  There are probably some 
small hurdles in the code to get over -- one is that BP2 expects all of 
the above characters to fit in a single-byte char type -- but those are 
(hopefully) just implementation details.

Best regards,

Anthony

P.S.  What I have learned about UTF-8 also explains why my test program 
to output "high ASCII" UTF-8 characters failed to work as expected in 
the Terminal app on OS X.  I wasn't outputting valid UTF-8-encoded 
characters!  If I just "cat" a valid UTF-8 file, then I can see the 
non-ASCII characters just fine in the terminal! :)

[bp-dev] UTF-8 support & backwards compatibility

A unique music composition and improvisation program using grammars

[bp-dev] UTF-8 support & backwards compatibility