Thread: AW: [clisp-list] help about [ clisp-Bugs-550603 ] UTF-16 min/max_ bytes_per_char mu st be 2

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Sam Steingold wrote:
> take a lisp string.
> convert it to bytes using any encoding.
> append a NULL.
This is misdefined, as explained previously. A single 8bit 0 is not enough.

> process as a C (ASCII, 8bit byte) string. [what use for C is 
> the #(0 0 0 65)?!]
That's multibyte at work. #\A can be represented using 16 or even 32bits.
This example shows UNICODE-32-BIG-ENDIAN.

C/C++/Java/xyz code which does UTF-16/UCS-2 processes strings
in terms of UINT16 (or UINT32). A terminating UINT16 0 is
the natural extension to 16 bits of the old 8bit C 0-termination
convention for strings. That's what I believe MS-Windows and all wide-char manipulating programs etc. pp. use.
A single UINT8 0 is not enough, as the example shows.

In other words, such code does access in terms of (C-ARRAY-PTR UINT16),
which expects a trailing 16bit 0.

C-STRING is equivalent to (C-ARRAY-PTR CHARACTER), somewhat
to (C-ARRAY-PTR UINT8). That's suitable for classic 8bit strings.
(C-ARRAY-PTR UINT16) is of a different kind, which necessitates
own manipulating functions, e.g. strlenW().

What I'm still looking for is: how many 8bit zeroes to add to
correctly 0-terminate my string and tell the actual
byte-count to the programmer?

> It can't be that simple.
Sadly, min_bytes_per_char still looks like the solution - if
the slots were correct for all encodings, which they are obviously
not for UTF-16 (must be 2;2, but is 1;8 nonsense for UTF-16).

> you must strip the NULL before converting the foreign string back to a
> LISP string
I know, but here I'm still at Lisp->C, not the converse.
But I'll get the same problem again if trying to implement LispWorks convert-from-foreign-string :null-terminated-p. The 0 to detect is 8, 16 or 32 bits wide...
(CONVERT-FROM-FOREIGN-STRING #<foreign 0 0 0 65 0 0 0 0)> :encoding charset:ucs-4 :null-terminated-p T) must yield "A", not "" because 8bit zero was found first).
That's the character equivalent of (C-ARRAY-PTR UINT32).

Regards,
	Jorg Hohle.