From: <je...@fo...> - 2025-07-08 12:33:00
|
On 2025-07-07 20:27, John Selverian wrote: > I've used Unicode successfully in the past. For instance for a > subscripted 't' I used: > > UNICODE_SUBSCRIPT_t_SYMBOL = unescape("\\u209c"); > > I want to now use a subscripted 'y' and found this: > > IE05F > > CYRILLIC SUBSCRIPT SMALL LETTER U > > <sub> 0443 y > > But this code: > > unescape(\\u1E05F [1]) > > Does not work. How do I encode a Unicode character with a 5 digit > code? Unescape needs to take in a so-called surrogate pair. A surrogate pair is two 16-bit wide characters encoding for a 32-bit wide character. For historic reasons, unicode started out as 16-bit wide characters, but as these things go, the number of code-points eventually outgrew the 16-bit space, and surrogate pairs were needed to encode characters > 16-bit. 32-bit (UCS32) character -> 16-bit (UCS16): CH = (U >> 10) + 0xD800 CL = (U & 0x03FF) + 0xDC00 16-bit UCS16 -> 32-bit UCS32: U = (0x10000-(0xD800<<10)-0xDC00) + (CH<<10) + CL The magic term in the above is to compensate for the fact that two constants were added for CL and CH to package the 32-bit character into the at the time not yet assigned code space of the 16-bit Unicode system. -- JVZ |