Re: [clisp-list] ((SETF STREAM-ELEMENT-TYPE) new-element-type stream)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Sun, 2002-09-08 at 12:43, Sam Steingold wrote:
> > * In message <1031440965.1510.43.camel@work>
> > * On the subject of "Re: [clisp-list] ((SETF STREAM-ELEMENT-TYPE) new-element-type stream)"
> > * Sent on 08 Sep 2002 11:22:44 +1200
> > * Honorable Adam Warner <li...@co...> writes:
> >
> > The kludge is that I currently have to simulate a binary stream using
> > with-output-to-string, convert that to a vector and then convert that
> > to a UTF-8 encoded string. If I had a vector stream of (unsigned-byte
> > 8) I could bypass the conversion from the simulated binary stream to
> > the actual vector stream (plus writing to a vector stream should be
> > faster than a string stream).
> 
> why use streams at all?
> doesn't VECTOR-PUSH-EXTEND do what you want?

Why yes it does. Thanks for pointing out this functionality!

Speed wise--since all non-escaped (%xx encoded) string elements are
simply printed to the stream using write-string--any byte-by-byte
approach would probably be slower (since I can't compile to native
code).

The way to make it faster would be to avoid converting the plain ASCII
portions to vectors in the first place. The output string would a
concatenation of the plain ASCII portions and the VECTOR-PUSH-EXTENDed
portions converted back to strings.

Thanks for the advice Sam. It was much appreciated.

CLISP's great Unicode support was one of the clear reasons for choosing
to develop using CLISP instead of CMUCL.

Regards,
Adam

(Note from the impnotes: UTF-8 only supports Unicode-16 encoding up to
three bytes. Longer UTF-8 sequences do not yet appear to be supported:

"UTF-8, the 16-bit UNICODE character set. Every character is represented
as one to three bytes. ASCII characters represent themselves and need
one byte per character. Most Latin/Greek/Cyrillic/Hebrew characters need
two bytes per character, and the remaining characters need three bytes
per character. This is therefore, in general, the most space-efficient
encoding of all of Unicode-16."

UTF-8 encoded characters may be up to six bytes long. The 16-bit
character subset is up to three bytes long as stated above.)