From: Arthur N. <ac...@ca...> - 2014-03-20 12:16:22
|
I just checked in a bunch of updates that I believe work on both the PSL and CSL version and provide what is at present limited support for use of characters outside simple ASCII. Source files should still all be in simple ASCII and for a while should not contain characters whose codes are over 0x7f. Ie accented characters, currency marks etc. However there is now a notation #word; (word is an HTML5 name for an entity) or #hexdigits; or #Udecimaldigits; that lets you express a reasonable range of Unicode characters. The code packs these in utf-8 and they then go into Lisp-level strings and symbol names that way. When items are displayed via a terminal that supports utf-8 and that has a suitable font in use you get nice displays. As a concrete example I can go (#alpha; + #beta;)^3; One thing to be aware of is that the above special treatment of "#" happens ahead of use of "!" as an escape character, so !#alpha; is an escaped alpha character not an escaped hash followed by the word alpha. The code in packages/rlisp/tok.red has rather more commentary. There are a lot of very substantial limitations and dangers in trying to use this at present - but people who do not put anything that goes "#" then characters that is a recognised word or a hex or decimal number and then a ";" should never be hurt. So in particular in case of worry please put whitespace next to the "#" or before the ";" to avoid this. Here are some of the special problems, in no particular order. Some of them can only be fixed by work within the Lisp systems... (1) PSL just plain crashes is you use EXPLODE on a symbol or string that contains characters whose codes are too large. There are functions like id2list and wideid2list (in tok.red) that may sometimes be useful alternatives. (2) if you use prin2 on an item with utf-8 encoded data all is well, but print may insert exclamation marks before each byte of a multibyte utf-8 sequence thus messing things up. (3) position across tle line (posn and linelength) can be messed up by utf-8 sequences by counting bytes not characters, so output will end up badly formatted. (4) the CSL gui does not understand utf-8 at all so will bot display things at all nicely. The same goes for parts of reduce that try to generate TeX. (5) As well as EXPLODE being an issue COMPRESS will be. All uses of it will really need review. (6) utf-8 input from files is not supported. ====== Right now I have tested this at least a little with CSL on Windows (using a cygwin terminal), Linux and Mac. I have tried PSL on Windows and and Linux but for reasons that at present look impossible the build on a mac fails. Specifically if I trace id2string I see *** Function `token' has been redefined *** (token): base 16#100351970, length 10#21 bytes token< id2string being entered a1: ! >< id2string = 4295002707 and id2string hands back something that is not a string. Its argument ought to be either a letter "s" or a newline I think! ====== An issue this all introduces is that strings (and by extension the names of identifiers) become things that can either be considered as sequences of bytes or as sequences of characters. Everything that looks inside them may need two variants to cope with the two interpretations! Arthur |