From: Robert J. M. <sf-...@ro...> - 2005-01-02 10:07:53
|
I've got an implementation of string->octets and octets->string which handles errors in UTF-8 encoding according to the utf decoder test at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (almost; it doesn't catch the utf-16 surrogates or "other illegal code positions" from 5.1, 5.2, or 5.3 in order to ensure that octets->string is an exact inverse of string->octets with :external-format :utf-8; not sure that's right, and it's easy to fix if necessary, just another check or two in the 3-byte path of bytes-per-utf8-character). My patches to source files (build-order.lisp-expr and target-c-call.lisp) are at: http://www.rojoma.com/octets/sbcl-octets-patch.diff and the added "octets.lisp" file is at http://www.rojoma.com/octets/octets.lisp A few random thoughts: Currently the implementation of %naturalize-utf8-string in target-c-call.lisp unconditionally uses a #xfffd character for replacing bad UTF-8 sequences (this is what Mozilla does) on the grounds that it'd be really annoying to have to wrap every call out to alienland in a handler-bind of some kind; perhaps octet-{en,de}coding-error ought not to be errors at all, so user code can catch them if it feels like it, but if it chooses not to one of the various possible "right things" is done? I'm not sure it makes sense for external-format to be a non-required parameter except for symmetry with streams. I'd like to make the external representation of #\Newline part of the external-format. -- Robert Macomber sf-...@ro... |
From: Robert J. M. <sf-...@ro...> - 2005-01-06 20:15:40
|
New version of the new .lisp file (the patch hasn't changed): http://www.rojoma.com/octets/octets3.lisp Differences: Each utf-8 decoding error is a different class of condition CHECK-TYPEs are gone Names changed to OCTETS-TO-STRING and STRING-TO-OCTETS Internal macros changed from MAKE-{internal function name} to DEFINE-{ifn} Added :null-terminate keyword argument to STRING-TO-OCTETS -- Robert Macomber sf-...@ro... |
From: Christophe R. <cs...@ca...> - 2005-01-09 00:16:24
|
"Robert J. Macomber" <sf-...@ro...> writes: > New version of the new .lisp file (the patch hasn't changed): > http://www.rojoma.com/octets/octets3.lisp Thank you. I've merged this, with a few extra hacks, into sbcl-0.8.18.21. You might want to review the diff between your octets3.lisp and what I eventually committed for possible problems. > Added :null-terminate keyword argument to STRING-TO-OCTETS Yeah. I'm not convinced this is right, for what it's worth, though I left it in. If it is, then I think that callers (such as SB-MD5:MD5SUM-STRING) should probably have such an argument too. In any case, with this (and the adjustments I made to the sb-md5 contrib) we now have restored the lost functionality from before the Unicode merge. Now all we have to do is have less OAOOM in the code... In any case, thank you again. Cheers, Christophe |
From: Robert J. M. <sf-...@ro...> - 2005-01-09 18:44:11
Attachments:
octets-bugfix.diff
octets.pure.lisp
|
Attached are two files: a diff which fixes bugs in ascii and latin-9 conversion which prevented them from working at all and another couple in latin-9 and utf-8 which were more subtle (and which were both present in octets3.lisp), and a fairly comprehensive testfile to ensure that future changes don't regress in these or other areas. Speaking of future changes, I think that getting rid of octets-to-string* and string-to-octets* would be a good idea. I originally put them in because they fell out of the "find destination length, then convert" design that I started with, but when I abandoned it they became nothing more than extra complexity and duplicated logic. `Non-consing' versions of o-t-s and s-t-o might be nice to have if in fact they were actually non-consing, but they're not. -- Robert Macomber sf-...@ro... |