How about making each Scheme implementation provide the
customized procedure that converts a character reference
to a list of characters? The default of such a procedure
can be simply (lambda (code) (list (integer->char code))).
If the implementation doesn't support wide characters but
treats every bytes of the character string as a 'character',
then the procedure may return a sequence of bytes which is an
utf-8 encoding of the given Unicode character.
If the implementation uses UCS-2 in its internal encoding,
the procedure still might return more than one (UCS-2) characters,
when the referenced character is represented by surrogated pair.
If the implementation uses an internal character encoding
other than Unicode, the given procedure can convert
from Unicode codepoint to the internal character. In Gauche,
a procedure ucs->char does that, and I'd rather use it instead
of integer->char (which assumes the given integer is in the
internal encoding) to resolve the character reference.
From: MJ Ray <markj@...>
Subject: [ssax-sxml] UTF-8 in XML and other animals
Date: Fri, 11 Oct 2002 14:00:28 GMT
> Here's an interesting problem, I think. I'm using SSAX to read into SXML
> a UTF-8 XML file. The Scheme being used is PLT Scheme, which doesn't handle
> UTF-8 natively yet. When SSAX hits an entity of the form ’ or
> similar, it calls (integer->char 8217) and MzScheme dies:
> integer->char: expects argument of type <exact in [0, 255]>; given 8217
> What is the *right* solution here? Adapt the ssax module for PLT to use its
> own integer->char routine? Move to an implementation which does support
> UTF-8? Write UTF-8 routines for PLT and make ssax.plt use them?
> The UTF-8 XML files are reality, and not going to go away.
> (The "other animals" are Scheme implementations.)
> MJR| v
> ---|--[ Luminas internet applications http://www.luminas.co.uk/ ]-----|
> `--[ http://mjr.towers.org.uk/ ]---------[ slef at jabber.at ]-----'
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> ssax-sxml mailing list