From: Hoehle, Joerg-C. <Joe...@t-...> - 2002-05-03 11:17:21
|
Sam Steingold wrote: > take a lisp string. > convert it to bytes using any encoding. > append a NULL. This is misdefined, as explained previously. A single 8bit 0 is not enough. > process as a C (ASCII, 8bit byte) string. [what use for C is > the #(0 0 0 65)?!] That's multibyte at work. #\A can be represented using 16 or even 32bits. This example shows UNICODE-32-BIG-ENDIAN. C/C++/Java/xyz code which does UTF-16/UCS-2 processes strings in terms of UINT16 (or UINT32). A terminating UINT16 0 is the natural extension to 16 bits of the old 8bit C 0-termination convention for strings. That's what I believe MS-Windows and all wide-char manipulating programs etc. pp. use. A single UINT8 0 is not enough, as the example shows. In other words, such code does access in terms of (C-ARRAY-PTR UINT16), which expects a trailing 16bit 0. C-STRING is equivalent to (C-ARRAY-PTR CHARACTER), somewhat to (C-ARRAY-PTR UINT8). That's suitable for classic 8bit strings. (C-ARRAY-PTR UINT16) is of a different kind, which necessitates own manipulating functions, e.g. strlenW(). What I'm still looking for is: how many 8bit zeroes to add to correctly 0-terminate my string and tell the actual byte-count to the programmer? > It can't be that simple. Sadly, min_bytes_per_char still looks like the solution - if the slots were correct for all encodings, which they are obviously not for UTF-16 (must be 2;2, but is 1;8 nonsense for UTF-16). > you must strip the NULL before converting the foreign string back to a > LISP string I know, but here I'm still at Lisp->C, not the converse. But I'll get the same problem again if trying to implement LispWorks convert-from-foreign-string :null-terminated-p. The 0 to detect is 8, 16 or 32 bits wide... (CONVERT-FROM-FOREIGN-STRING #<foreign 0 0 0 65 0 0 0 0)> :encoding charset:ucs-4 :null-terminated-p T) must yield "A", not "" because 8bit zero was found first). That's the character equivalent of (C-ARRAY-PTR UINT32). Regards, Jorg Hohle. |
From: Sam S. <sd...@gn...> - 2002-05-03 13:05:49
|
> * In message <DFD875E85664D3118FA6080006277DE705822BDD@U8PN2.blf01.telekom.de> > * On the subject of "AW: [clisp-list] help about [ clisp-Bugs-550603 ] UTF-16 min/max_ bytes_per_char mu st be 2" > * Sent on Fri, 3 May 2002 13:17:04 +0200 > * Honorable "Hoehle, Joerg-Cyril" <Joe...@t-...> writes: > > What I'm still looking for is: how many 8bit zeroes to add to > correctly 0-terminate my string and tell the actual > byte-count to the programmer? play safe: append max_bytes_per_char (8) NULL bytes, give the character count as the length of the original LISP string +1. > > It can't be that simple. > Sadly, min_bytes_per_char still looks like the solution - if > the slots were correct for all encodings, which they are obviously > not for UTF-16 (must be 2;2, but is 1;8 nonsense for UTF-16). this cannot be done properly for iconv-based encodings since there is no way to query libiconv for this information. I think these are set correctly for built-in encodings. Bruno? -- Sam Steingold (http://www.podval.org/~sds) running RedHat7.2 GNU/Linux <http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/> <http://www.palestine-central.com/> <http://www.mideasttruth.com/> Lisp: it's here to save your butt. |
From: Bruno H. <ha...@il...> - 2002-05-06 11:17:46
|
Sam writes: > > Sadly, min_bytes_per_char still looks like the solution - if > > the slots were correct for all encodings, which they are obviously > > not for UTF-16 (must be 2;2, but is 1;8 nonsense for UTF-16). > > this cannot be done properly for iconv-based encodings since there is > no way to query libiconv for this information. There is no direct way, but an indirect one that works for all encodings except UTF-7 is as follows. (defun number-of-zero-bytes (encoding) (if (eq encoding charset:utf-7) 1 (- (length (convert-string-to-bytes (coerce (list #\Null #\Null) 'string) encoding)) (length (convert-string-to-bytes (coerce (list #\Null) 'string) encoding))))) UTF-7 is special: When you zero-terminate an UTF-7 encoded string, it is not valid UTF-7 any more. Just forget about this obsolete encoding. In the function I return 1 for it because this is what most people will expect. Bruno |