[ CCed all over the place to people who might be interested. ]
Terje Bless <link@...> wrote:
>Peter Newcomb <peter.newcomb@...> wrote:
>>This says that code points 9-10, 13, 32-126, 160-55295, 57344-65533,
>>and 65536-1114111 (0x10000-0x10FFFF) are defined. If you need to use
>>code points above 0x10FFFF, just add a line to the DESCSET... say:
>>1114112 14680064 1114112
>A grand idea. However, it falls down in practice. :-(
Everyone, please forgive me. I'm an idiot! :-(
It's not an issue with characters > 0x10FFFF at all! I was fooled by the
message about "characters above 65536 not supported" to such a degree that
I never actually tested my assumption. The relevant messages emitted
when attempting to parse the XHTML+MathML DTD are:
"cannot convert character reference to number 120171 because
character not in internal character set"
As you can see, 120171 (0x01D56B) is clearly < 0x10FFFF.
What appears to be happening is either one or both of the following:
1) OpenSP uses UCS-2 as it's internal character set when operating
in Fixed Charset Mode. The UCS-2 repertoire only covers the Basic
Multilingual Plane (Plane 0) of Unicode; i.e. only characters with
code points up to 2^16 or 0xFFFF (65535) are supported.
"UCS-4 allows 2^31 code points, while UCS-2 only allows 2^16.
XML allows characters in the first 17 Planes of UCS, i.e. 2^20,
so if you specify ISO-IR 176 as the BASESET, you effectively
exclude characters beyond the Basic Multilingual Plane (BMP),
thus, MathML Plane 1 characters cannot be used, for example." 
2) The Unicode tables included with OpenSP are very badly dated.
The current ones in the CVS are from a version before any code
points outside the Basic Multilingual Plane were assigned. MathML2
is the first SGML Application from the W3C to actually use any
Plane 1 (first Supplemental Multilingual Plane) characters, which
is probably why this issue has never surfaced before.
For issue #1 I've updated the included sp/pubtext/xml.dcl to refer to
ISO-IR 177, "ISO/IEC 10646:1993, UCS-4, Level 3" . But AFAICT this does
not affect what OpenSP uses internally when operating in Fixed Charset
Mode. It only affects the behaviour when not operating in Fixed Charset
Mode and that file is given as the SGML System Declaration.
In fact, I'm having big trouble actually finding out where OpenSP's implied
default (internal) SGML Declaration comes from. Is it read in from a file
somewhere at compile time? IOW, is e.g. sp/unicode/unicode.sd read at
compile time to determine the internal SGML Declaration? Or is it read at
run-time? Or maybe even not specified in an SGML Declaration at all;
perhaps it's just a bunch of variables set in a header file somewhere?
My experiments along those lines have so far yielded little or no result.
Modifying the files I mentioned -- both at compile-time and at run-time --
seems to have no effect.
Issue #2 may of course also be the culprit here. If the Unicode tables
OpenSP uses do not contain any assigned code points outside Plane 0, it
follows that the code point above, 0x01D56B, cannot be assigned. It may
well be that this is what is triggering the error message from onsgmls. But
where does OpenSP get it's notion of what code points are assigned? Is this
the function of the sp/unicode/unicode.syn file?
That seems to be auto-generated from (the ancient version of) the Unicode
tables from unicode.org by the gensyntax.pl script. Hacking that script
to update the unicode.syn file to match the latest version of the Unicode
tables seems within reach. But it's also a PITA to do because the code is
grotty and badly documented; and, of course, the format of the Unicode
tables on unicode.org seem to have changed significantly in the mean time.
Can anyone shed /any/ light on this for me? I'm getting desperate here and
not getting anywhere. I might even offer to _pay_ someone to fix this for
me! Though any pay would come out of my own pocket so adjust expectations
 - Proving once again that:
"Assumption is the mother of all fuckups!"
 - XHTML+MathML DTD:
MathML 2.0 modular DTDs:
 - Sample XHTML+MathML Document Instance test case:
 - Quote from private email on the subject.
 - This is an unfortunate mistake inherited from James Clark's
original version, and which is present in both the W3C "Note
on SGML and XML" and the XHTML 1.0 Reccomendation. The
correct SGML Declaration for XML should make reference to
ISO-IR 177, "ISO/IEC 10646:1993, UCS-4, Level 3", and not
ISO-IR 176, "ISO/IEC 10646:1993, UCS-2, Level 3".
Well... Or so I've been told, anyway. :-)
 - <URL:http://www.unicode.org/Public/UNIDATA/>
 - <URL:http://www.w3.org/TR/NOTE-sgml-xml>
 - <URL:http://www.w3.org/TR/xhtml1>
We've gotten to a point where a human-readable, human-editable text format
for structured data has become a complex nightmare where somebody can safely
say "As many threads on xml-dev have shown, text-based processing of XML is
hazardous at best" and be perfectly valid in saying it. -- Tom Bradford