On Sun, Aug 28, 2011 at 11:53 AM, Matthew Mondor <mm_lists@pulsar-zone.net> wrote:
On Sun, 28 Aug 2011 11:40:29 +0200
Juan Jose Garcia-Ripoll <juanjose.garciaripoll@googlemail.com> wrote:

> What is the use of writing bytes to a string? In many cases you may end up
> with corrupt sequences, since the bytes produced by a external format do not
> need to correspond to valid strings in the latin-1 and ucs4 formats which
> are used internally by ECL.

What I would have expected is for bytes to be decoded to characters as
per the specified external-format, just like when reading from a file
or network stream.  This way, the ECL unicode encoders and decoders are
no longer a black box and no external resources are needed when custom
encoders/decoders are used in user code...

For instance in this case, URLs are encoded as UTF-8 octets with % HEX
HEX for every needed and non-ASCII byte.  I then could perform the
needed decoding to bytes in a custom function and read the bytes as
UTF-8 characters.

Of course, if that worked, any decoding error would be expected to also
signal an error just like when reading bytes from a file or socket.

I think we have two different models in mind. This is how I see it

* READ/WRITE-BYTE do not have external formats. period. They are binary I-O functions and the only customizable thing they have is the word size and the endinanness, but they do not know about characters.

* Binary sequence streams are streams built on arrays of integers. They are interpreted as collection of octets. The way these octets are handled is determined by two things:
   - The array element type influences the byte size of READ/WRITE-BYTE.
   - The external format determines how to read characters *from the octets*, independently of the byte size.
This means that if you construct a binary sequence stream you will have to pay attention to the endianness of your data!

* Common Lisp strings have a fixed external format: latin-1 and unicode in ECL. This can not be changed and I do not want to change it in the foreseeable future. In consequence with the previous statement, my code did not contemplate reinterpretation of strings with different external formats. I still feel uneasy about this idea, because this is only a signature that you got your data wrong. Nevertheless I have made the following changes.

- If no external format is supplied, a sequence stream based on a string works just like a string stream. Stop reading here then.

- Otherwise, if the string is a base-char strings work like binary streams with a byte size of 8 bits. This means they can be used for converting to and from different external formats by reinterpreting the same characters as octets.

- If the string contains extended characters, this fact is ignored and the string is interpreted as if it contained just 8-bit characters for external encodings. This means that now you can recode strings and ignore whether the string was stored in a Unicode or a base-char string.

Your examples modified:
;; Encode a string in UTF-8 binary stream (this is the safest alternative)
(setf *bytes*
     (make-array 16
                 :element-type '(unsigned-byte 8)
                 :fill-pointer 0))

   (s (ext:make-sequence-output-stream *bytes*
                                       :external-format :utf-8))
 (write-string "Héhéhéhé~%" s)
 (print *bytes*))

   (s (ext:make-sequence-input-stream *bytes*
       :start 0
       :external-format :utf-8))
 (print (read-line s)))

;; Convert the binary representation into a string
(setf *string* (make-array 16
                           :element-type 'character
                           :fill-pointer 0
                           :adjustable t))
(with-open-stream (s (ext:make-sequence-output-stream *string*))
    for b across *bytes*
    do (write-byte b s)
  (print *string*)))

;; Encode directly into a string

(setf (fill-pointer *string*) 0)
   (s (ext:make-sequence-output-stream *string*
                                       :external-format :utf-8))
 (write-string "Héhéhéhé~%" s)
 (print *string*))

;; Decode the UTF-8 string
(with-open-stream (s (ext:make-sequence-input-stream *string*
                         :external-format :utf-8))
  (print (read-line s)))

Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)