From: Adam W. <li...@co...> - 2002-09-08 01:50:21
|
On Sun, 2002-09-08 at 12:43, Sam Steingold wrote: > > * In message <1031440965.1510.43.camel@work> > > * On the subject of "Re: [clisp-list] ((SETF STREAM-ELEMENT-TYPE) new-element-type stream)" > > * Sent on 08 Sep 2002 11:22:44 +1200 > > * Honorable Adam Warner <li...@co...> writes: > > > > The kludge is that I currently have to simulate a binary stream using > > with-output-to-string, convert that to a vector and then convert that > > to a UTF-8 encoded string. If I had a vector stream of (unsigned-byte > > 8) I could bypass the conversion from the simulated binary stream to > > the actual vector stream (plus writing to a vector stream should be > > faster than a string stream). > > why use streams at all? > doesn't VECTOR-PUSH-EXTEND do what you want? Why yes it does. Thanks for pointing out this functionality! Speed wise--since all non-escaped (%xx encoded) string elements are simply printed to the stream using write-string--any byte-by-byte approach would probably be slower (since I can't compile to native code). The way to make it faster would be to avoid converting the plain ASCII portions to vectors in the first place. The output string would a concatenation of the plain ASCII portions and the VECTOR-PUSH-EXTENDed portions converted back to strings. Thanks for the advice Sam. It was much appreciated. CLISP's great Unicode support was one of the clear reasons for choosing to develop using CLISP instead of CMUCL. Regards, Adam (Note from the impnotes: UTF-8 only supports Unicode-16 encoding up to three bytes. Longer UTF-8 sequences do not yet appear to be supported: "UTF-8, the 16-bit UNICODE character set. Every character is represented as one to three bytes. ASCII characters represent themselves and need one byte per character. Most Latin/Greek/Cyrillic/Hebrew characters need two bytes per character, and the remaining characters need three bytes per character. This is therefore, in general, the most space-efficient encoding of all of Unicode-16." UTF-8 encoded characters may be up to six bytes long. The 16-bit character subset is up to three bytes long as stated above.) |