From: Pascal J. B. <pj...@in...> - 2009-09-09 00:48:58
|
On Sep 7, 2009, at 9:24 AM, Christophe Rhodes wrote: > John Fremlin <jf...@ms...> writes: > >> On CCL the implementation checks for invalid code points by itself >> but >> on SBCL, the encoder lets them through. I agree that it is nice to >> deal with it, but the sensible default option is not `error'. > > I strongly disagree, and here's why: in a dynamic scope, with > sufficiently expressive restarts (as I'm trying to provide in the > patches you're not apparently commenting on) the programmer can > specify > the recovery strategy; when an `error' is not in fact the same as a > program `crash', there is no particular need to fear it, and because > "UTF-8" has a standardized meaning and specifies certain conditions as > error situations, it seems reasonable to model those conditions as > Common Lisp errors. > > As an example of a situation where one recovery strategy does not fit > all, imagine a user deciding that, when reading files corresponding to > source code, a decoding error while reading a string literal should > cause Unicode replacement characters to be substituted, but a decoding > error in other contexts should be an error that demands human > intervention -- for a simpler example of that, consider how sbcl deals > with decoding errors within comments. > > The exception, of course, is when presentation of error information > and > the error recovery strategies available would cause a further error: > such as when attempting to write a string with a noncharacter in it to > the same low-level stream as would be used for the debugger. As I > said > in the message you replied to, my aim is to provide external formats > with, effectively, the recovery strategy predetermined for such cases. > > If the OUTPUT-REPLACEMENT restart I've implemented, along with the > analogous INPUT-REPLACEMENT restart for decoding errors, is not > sufficient to express most useful recovery strategies, then clearly > I'm > going down the wrong path. But I think it is sufficient for many > purposes; for example, output of #\uFFFD for each encoding error is > (handler-bind ((encoding-error > (lambda (c) > (invoke-restart 'output-replacement #\uFFFD)))) > ...) Does the encoding-error condition include a slot with the erroneous code sequence? Could we provide several characters as output-replacement? Given a lisp string, how could we output mostly a utf-8 byte sequence, but with some invalid codes interspersed (ie. to reproduce the original byte stream)? It seems to me that in a number of situation, it would be desirable to transparently transmit the "error" in utf-8 data. One way to do so would be to encode invalid utf-8 byte sequences as a sequence of "non- character codepoints" (U+FDD0..U+FDEF) when reading, and of course, to do the reverse transformation when writing, assuming these "non- character codepoints" are Lisp CHARACTERs. Or better, some other Lisp CHARACTER, if there exist characters beyond the unicode set. Concerning the use of conditions, perhaps efficiency considerations would call for a more proactive mechanism. For example, in clisp, the handling of invalid code sequences may be specified in the encoding structure (which can be used as external-format). http://clisp.cons.org/impnotes/encoding.html#make-encoding -- __Pascal Bourguignon__ http://www.informatimago.com |