Thread: [Sbcl-devel] string/octets and alien string conversion

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I've got an implementation of string->octets and octets->string which
handles errors in UTF-8 encoding according to the utf decoder test at
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (almost; it
doesn't catch the utf-16 surrogates or "other illegal code positions"
from 5.1, 5.2, or 5.3 in order to ensure that octets->string is an
exact inverse of string->octets with :external-format :utf-8; not sure
that's right, and it's easy to fix if necessary, just another check or
two in the 3-byte path of bytes-per-utf8-character).

My patches to source files (build-order.lisp-expr and target-c-call.lisp)
are at:
  http://www.rojoma.com/octets/sbcl-octets-patch.diff
and the added "octets.lisp" file is at
  http://www.rojoma.com/octets/octets.lisp

A few random thoughts: Currently the implementation of
%naturalize-utf8-string in target-c-call.lisp unconditionally uses a
#xfffd character for replacing bad UTF-8 sequences (this is what
Mozilla does) on the grounds that it'd be really annoying to have to
wrap every call out to alienland in a handler-bind of some kind;
perhaps octet-{en,de}coding-error ought not to be errors at all, so
user code can catch them if it feels like it, but if it chooses not to
one of the various possible "right things" is done?  I'm not sure it
makes sense for external-format to be a non-required parameter except
for symmetry with streams.  I'd like to make the external
representation of #\Newline part of the external-format.
-- 
Robert Macomber
sf-...@ro...

Thread: [Sbcl-devel] string/octets and alien string conversion

Common Lisp compiler and runtime

sbcl-devel