On Sep 7, 2009, at 9:24 AM, Christophe Rhodes wrote:
> John Fremlin <jf@...> writes:
>> On CCL the implementation checks for invalid code points by itself
>> on SBCL, the encoder lets them through. I agree that it is nice to
>> deal with it, but the sensible default option is not `error'.
> I strongly disagree, and here's why: in a dynamic scope, with
> sufficiently expressive restarts (as I'm trying to provide in the
> patches you're not apparently commenting on) the programmer can
> the recovery strategy; when an `error' is not in fact the same as a
> program `crash', there is no particular need to fear it, and because
> "UTF-8" has a standardized meaning and specifies certain conditions as
> error situations, it seems reasonable to model those conditions as
> Common Lisp errors.
> As an example of a situation where one recovery strategy does not fit
> all, imagine a user deciding that, when reading files corresponding to
> source code, a decoding error while reading a string literal should
> cause Unicode replacement characters to be substituted, but a decoding
> error in other contexts should be an error that demands human
> intervention -- for a simpler example of that, consider how sbcl deals
> with decoding errors within comments.
> The exception, of course, is when presentation of error information
> the error recovery strategies available would cause a further error:
> such as when attempting to write a string with a noncharacter in it to
> the same low-level stream as would be used for the debugger. As I
> in the message you replied to, my aim is to provide external formats
> with, effectively, the recovery strategy predetermined for such cases.
> If the OUTPUT-REPLACEMENT restart I've implemented, along with the
> analogous INPUT-REPLACEMENT restart for decoding errors, is not
> sufficient to express most useful recovery strategies, then clearly
> going down the wrong path. But I think it is sufficient for many
> purposes; for example, output of #\uFFFD for each encoding error is
> (handler-bind ((encoding-error
> (lambda (c)
> (invoke-restart 'output-replacement #\uFFFD))))
Does the encoding-error condition include a slot with the erroneous
Could we provide several characters as output-replacement?
Given a lisp string, how could we output mostly a utf-8 byte sequence,
but with some invalid codes interspersed (ie. to reproduce the
original byte stream)?
It seems to me that in a number of situation, it would be desirable to
transparently transmit the "error" in utf-8 data. One way to do so
would be to encode invalid utf-8 byte sequences as a sequence of "non-
character codepoints" (U+FDD0..U+FDEF) when reading, and of course, to
do the reverse transformation when writing, assuming these "non-
character codepoints" are Lisp CHARACTERs. Or better, some other Lisp
CHARACTER, if there exist characters beyond the unicode set.
Concerning the use of conditions, perhaps efficiency considerations
would call for a more proactive mechanism. For example, in clisp, the
handling of invalid code sequences may be specified in the encoding
structure (which can be used as external-format). http://clisp.cons.org/impnotes/encoding.html#make-encoding