Has anyone taken a look at BOCU-1? It's a method of encoding UTF
code-points. Apparently, IBM's using it in their (open-source, X-style
license) ICU code, for internal representation of UTF strings.
It's been looking like something that might be used for the UTF support in
Here's some info about it, which explains it better than I can:
"BOCU : Binary-Ordered Compression for Unicode"
"BOCU-1: MIME-Compatible Unicode Compression"
"Compact Encodings of Unicode"
..and yet, none of that seems to give a complete explanation of the BOCU-1
encoding algorithm[s] -- just as a warning there.
An example C implementation, kind of hairy, is here:
CVS module: icuhtml/design/conversion/bocu1
(the password is in the CVSROOT, there)
I haven't checked the ICU sources, yet, to see what code they've been
using for the BOCU-1 encoding; I may get to that, sometime..
I'd been working on some *early* preliminary support, myself, trying to
translate the CVS'd BOCU-1 code (from the icuhtml docs) from C into Common
Lisp; it's kind of ugly, frankly -- a bunch of repetitive stuff to get
typoed, a bunch of C #define's, and I'll be darned if i get what
all of it's supposed to do, at once, yet. Maybe someone will
find a way of making it "neater" in the Common Lisp (I'm trying to do so,
myself, but I'm still new with the thing); either way, here's mention of
Looks like a 'keeper?