From: Hoehle, Joerg-C. <Joe...@t-...> - 2003-08-07 14:43:32
|
Sam wrote: >> Really, I consider the 8/16/32 bit character arrays in CLISP a >> problematic issue when interfacing to the foreign world. >why? >these strings can be converted to 32-bit string at any time you want. Please explain. I know that CLISP will upgrade the strings (and fix all pointers at the next GC I guess) but ignore the details. If my hypothetical --with-unicode module gets passed a 8/16 bit string, what should it do (and especially how)? a- create (and later forget) a 32-bit version of that string? b- create the 32-bit version and have the rest of CLISP forget the 8/16-bit version, including the caller? Also, I'm still unsure whether there wouldn't be an advantage in letting programmers create strings of known width, e.g. for better interfacing with UTF-16) (cf. (sys::string-info (make-string 1)) vs (make-array 'character)). In particular, I wonder if the following would be a useful thing to have: (= 16 (sys::string-info (make-array x :element-type charset:utf-16))) [An encoding is an acceptable argument to the array element type according to CLHS, and it works in CLISP] Once that is clear, creating the regexp module is trivial. - get one of the newer regexp codes known to work not only on 8bit char - compile it with chartype=[unsigned]long (to hopefully get 32bits) - write a few adequate wrappers. - verify the module indeed works with 32bit characters - measure and report speed improvements (or decrease because newer regexp maybe slower than current 5 year old one with less features). - compare against CL-PPCRE (speed and testcases) Thanks for enlightment, Jorg Hohle. |