Just Launched: You can now import projects and releases from Google Code onto SourceForge
We are excited to release new functionality to enable a 1-click import from Google Code onto the Allura platform on SourceForge. You can import tickets, wikis, source, releases, and more with a few simple steps. Read More
>could somebody please comment whether I'm doing something wrong here?
>o (setq custom:*foreign-encoding* charset:utf-16)
Further tests reveal that
1. the FFI does not seem to have received much testing with encodings like UTF-16, while encodings for which strlen() basically works seem to work (e.g. typically those where the 8th bit signals a multibyte sequence, while the pure 7bit codes are identical to ASCII).
2. Furthermore, with UTF-16 etc., programmers might get unexpected results because of the BOM that CLISP takes pleasure in adding in front of the bytes.
-- room(s) <> 2*(LENGTH s) because of the BOM.
I haven't seen something in impnotes that CLISP might be restricted to these strlen() compatible encodings??
WITH-FOREIGN-STRING is specified to support all encodings. That's when I had to check those unreliable encoding_min/max_bytes and introduce encoding_zeroes() etc. (remember?). I would have expected at least similar complications when making the rest of the FFI support those.
is there some reason or mnemotechnic way to understand why Linux mblen() says it returns a number of bytes, while Encoding_mblen() yields a count in character entities?!?
Distinctions such as these cause me headaches when reviewing foreign.d w.r.t. encoding issues. I'm quite fed up with it and have headaches now.
So far, all I have found is that:
0. lispbibl.d and encoding.d aren't quite explicit enough about the precise interface to the encoding functions. As a result I waste time (and remember having spent a lot of time when I implemented with-foreign-string just to ensure that I got all those mbs/wbs unpack_string_alooca_and_or_ro() right.
1. conversion from c-string outside of 1:1 is bogus because it defers to asciz_to_string(), which assumes a single 0 byte terminator.
2. there seem to be no wmbsh*t_len() that works on unbounded buffers, like strlen() does. They all expect a buffer limit. Of course, one could throw in an artificial max_array_or_string_index_limit * sizeof(character/byte) or what would you suggest??
3. I'm convinced that what I reported a few days ago under this subject are bugs in CLISP, not on my side.
4. It looks like ENCODING-ZEROES raises its head again.
5. Beside C-STRING, C-ARRAY-MAX also depends on a correct discovery of the end of a string. I believe convert_from_foreign:c_arrray_max to be broken for strings because of this (or at least, it does not do what I would expect).
(c-array-max #([ff fe] 65 0 66 0 67 0 68 0 0 0):utf-16) -> "abcd"
Currently, it gives an error.
6. Conversion from c-array-ptr is broken for the same reasons.
7. When the FFI will correctly support arbitrary encodings, the string "must be an ASCII-compatible encoding" shall be omitted from impnotes:with-foreign-string.
8. A work-around (= status quo) may be to declare that the FFI does not support arbitrary encodings, but only ASCII-compatible ones...
UTF-8 is in, UTF-16 is out.