On Sun, 13 Feb 2011 04:41:27 -0500
Matthew Mondor <mm_lists@...> wrote:
> Yes I think that supporting that encoding would be very easy too. The
> only possibly tricky part is for users of that encoding to as necessary
> output a more conventional utf-8 stream to some streams, such as for
> display, possibly with bad sequences converted to latin-1. But it
> could read data from an UTF-8B exernal format stream and write it back
> to another UTF-8B stream and be sure that the original data was
> transparently copied as-is, and not be bothered with decoding/encoding
> errors on streams with that external format.
> I'm not sure if ECL should itself treat those invalid octets
> transparently as LATIN-1 if doing the output on an UTF-8
> external-format stream, however. It's possible that without this some
> problems occur in the debugger, slime, etc, which would be presented
> with invalid UTF-8 characters in the UTF-16 surrogate range.
So I had some time tonight and wanted to write a test implementation.
However, there indeed is a problem at decoding time because more than
one bytes might be invalid octets in a row in which case more than one
UTF-16 surrogates must be used to represent multiple litteral octets.
You said that the streams lacked push/pop buffers, and it seems that at
least a minimal one would be necessary to implement this (i.e. the
decoding_error() function would instead of signaling an error with the
octets, insert those into the push-buffer if the stream has an UTF-8B
external format). The stream reading routine would then have to first
issue the contents of that buffer before processing more characters...
My first idea was to replace the ecl_read_byte8(stream, buffer+1,
nbytes) < nbytes) call by a one-byte reading one in the following loop,
but that would still lose bytes if the second, third, etc octet was
invalid, and could only really return a character for that last one.
Attached is the attempt, but it's by no means complete.
Thanks and good night,