From: R. M. <rm...@mh...> - 2005-04-06 14:05:33
|
Hello list, is there a way to create strings with different encodings in recent sbcls? I'm running 0.8.20 with unicode support but need to convert strings to ISO-8859-1 at some places in my program. Is there any support for this at all? Is there syntactic sugar available (like, for example an :character-encoding keyword for with-output-to-string). TIA Ralf Mattes |
From: Christophe R. <cs...@ca...> - 2005-04-06 14:30:37
|
"R. Mattes" <rm...@mh...> writes: > is there a way to create strings with different encodings in recent > sbcls? I'm running 0.8.20 with unicode support but need to convert strings > to ISO-8859-1 at some places in my program. Is there any support for this > at all? Is there syntactic sugar available (like, for example an > :character-encoding keyword for with-output-to-string). Strings are not encoded; they are sequences of characters; as such, a :character-encoding or :external-format for string output streams makes no sense, because a character is fundamentally not an integer. So, what are you actually trying to do? I'm guessing you want an iso-8859-1 encoding of your string, maybe to write to a file, or to pass to a foreign function; for that, you'd need a sequence of bytes. To do that, look at the sb-ext:string-to-octets function. Cheers, Christophe |
From: R. M. <rm...@mh...> - 2005-04-06 16:14:46
|
On Wed, 06 Apr 2005 15:22:25 +0100, Christophe Rhodes wrote: > "R. Mattes" <rm...@mh...> writes: > >> is there a way to create strings with different encodings in recent >> sbcls? I'm running 0.8.20 with unicode support but need to convert strings >> to ISO-8859-1 at some places in my program. Is there any support for this >> at all? Is there syntactic sugar available (like, for example an >> :character-encoding keyword for with-output-to-string). > > Strings are not encoded; they are sequences of characters; as such, a > :character-encoding or :external-format for string output streams > makes no sense, because a character is fundamentally not an integer. Yes, i think my terminology was a bit unclear and i haven't had a closer look at the implementation details for :UNICODE. So a string isn't coupled with it's external representation? Nice. > So, what are you actually trying to do? I'm guessing you want an > iso-8859-1 encoding of your string, maybe to write to a file, or to pass > to a foreign function; for that, you'd need a sequence of bytes. To do > that, look at the sb-ext:string-to-octets function. My final task is to create a valid URL to be send to a webserver. The server assumes the URL to be encoded in ISO-8859-1 (well, implicitly, the PHP-programmers have no idea about character->code point mapping). So, after looking at octets.lisp: i need to convert my string into a vector of octets, map a function that URL-escapes all non-ASCII chars as sequences of ASCII chars and then convert back with (octet-to-string ... :external-format :ascii) so that i can send it over a socket? Thanks a lot RalfD > Cheers, > > Christophe > > > ------------------------------------------------------- SF email is > sponsored by - The IT Product Guide Read honest & candid reviews on > hundreds of IT Products from real users. Discover which products truly > live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click |
From: Christophe R. <cs...@ca...> - 2005-04-06 16:29:35
|
"R. Mattes" <rm...@mh...> writes: > On Wed, 06 Apr 2005 15:22:25 +0100, Christophe Rhodes wrote: > >> So, what are you actually trying to do? I'm guessing you want an >> iso-8859-1 encoding of your string, maybe to write to a file, or to pass >> to a foreign function; for that, you'd need a sequence of bytes. To do >> that, look at the sb-ext:string-to-octets function. > > My final task is to create a valid URL to be send to a webserver. The > server assumes the URL to be encoded in ISO-8859-1 (well, implicitly, the > PHP-programmers have no idea about character->code point mapping). > So, after looking at octets.lisp: i need to convert my string into a > vector of octets, map a function that URL-escapes all non-ASCII chars as > sequences of ASCII chars and then convert back with > (octet-to-string ... :external-format :ascii) so that i can send it over > a socket? I wouldn't do it that way. Rather, I would do something like (with-output-to-string (s) (dotimes (i (length string)) (let ((char (char string i))) (cond ((safe-char-p char) (write-char char s)) (t (write-char #\% s) (write-string (format nil "~X" (char-code char)))))))) and use the return value of that. Cheers, Christophe |
From: R. M. <rm...@mh...> - 2005-04-06 18:46:18
|
On Wed, 06 Apr 2005 17:27:17 +0100, Christophe Rhodes wrote: > I wouldn't do it that way. Rather, I would do something like > (with-output-to-string (s) > (dotimes (i (length string)) > (let ((char (char string i))) > (cond > ((safe-char-p char) (write-char char s)) > (t (write-char #\% s) > (write-string (format nil "~X" (char-code char)))))))) > and use the return value of that. Hmm, maybe last night was way too short for me, but where in your code would the actual conversion from UTF-8 to ISO-8859-1 happen? doesn't (char-code char) return the character code of char in unicode? Cheers RalfD > > Cheers, > > Christophe > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click |
From: Christophe R. <cs...@ca...> - 2005-04-06 18:56:17
|
"R. Mattes" <rm...@mh...> writes: > On Wed, 06 Apr 2005 17:27:17 +0100, Christophe Rhodes wrote: > >> I wouldn't do it that way. Rather, I would do something like >> (with-output-to-string (s) >> (dotimes (i (length string)) >> (let ((char (char string i))) >> (cond >> ((safe-char-p char) (write-char char s)) >> (t (write-char #\% s) >> (write-string (format nil "~X" (char-code char)))))))) >> and use the return value of that. > > Hmm, maybe last night was way too short for me, but where in your code > would the actual conversion from UTF-8 to ISO-8859-1 happen? doesn't > (char-code char) return the character code of char in unicode? Well, yes, but the Unicode people were clever, in that the first 256 code points in Unicode encode the same characters as the 256 code points of ISO-8859-1. UTF-8 is a different encoding, and never enters the picture at all. Cheers, Christophe |
From: R. M. <rm...@mh...> - 2005-04-06 19:10:39
|
On Wed, 06 Apr 2005 19:54:05 +0100, Christophe Rhodes wrote: > "R. Mattes" <rm...@mh...> writes >> Hmm, maybe last night was way too short for me, but where in your code >> would the actual conversion from UTF-8 to ISO-8859-1 happen? doesn't >> (char-code char) return the character code of char in unicode? > > Well, yes, but the Unicode people were clever, in that the first 256 > code points in Unicode encode the same characters as the 256 code > points of ISO-8859-1. UTF-8 is a different encoding, and never enters > the picture at all. Ok, i was afraid of that anwer - i think i'd prefer a more generic solution where i have control over the output encoding (might be ISO-8859-9 or some sort of MS-Windows code page. None of the web folks was able to tell ....). Cheers RalfD > Cheers, > > Christophe > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click |
From: Thomas F. B. <tfb@OCF.Berkeley.EDU> - 2005-04-06 19:30:04
|
R. Mattes writes: > On Wed, 06 Apr 2005 19:54:05 +0100, Christophe Rhodes wrote: > > > "R. Mattes" <rm...@mh...> writes > >> Hmm, maybe last night was way too short for me, but where in your code > >> would the actual conversion from UTF-8 to ISO-8859-1 happen? doesn't > >> (char-code char) return the character code of char in unicode? > > > > Well, yes, but the Unicode people were clever, in that the first 256 > > code points in Unicode encode the same characters as the 256 code > > points of ISO-8859-1. UTF-8 is a different encoding, and never enters > > the picture at all. > > Ok, i was afraid of that anwer - i think i'd prefer a more generic > solution where i have control over the output encoding (might be > ISO-8859-9 or some sort of MS-Windows code page. None of the web folks > was able to tell ....). Well, in that case, when you do find the encoding you need, make a string listing the characters in order, and use (find char +ms-windows-page-blah+) instead of (code-char char). Or make an array such that you can do (aref +ms-code-page-blah+ (char-code char)) |
From: R. M. <rm...@mh...> - 2005-04-06 20:06:07
|
On Wed, 06 Apr 2005 12:29:53 -0700, Thomas F. Burdick wrote: > R. Mattes writes: > > On Wed, 06 Apr 2005 19:54:05 +0100, Christophe Rhodes wrote: > > > > > "R. Mattes" <rm...@mh...> writes > > >> Hmm, maybe last night was way too short for me, but where in your code > > >> would the actual conversion from UTF-8 to ISO-8859-1 happen? doesn't > > >> (char-code char) return the character code of char in unicode? > > > > > > Well, yes, but the Unicode people were clever, in that the first 256 > > > code points in Unicode encode the same characters as the 256 code > > > points of ISO-8859-1. UTF-8 is a different encoding, and never enters > > > the picture at all. > > > > Ok, i was afraid of that anwer - i think i'd prefer a more generic > > solution where i have control over the output encoding (might be > > ISO-8859-9 or some sort of MS-Windows code page. None of the web folks > > was able to tell ....). > > Well, in that case, when you do find the encoding you need, make a > string listing the characters in order, and use > (find char +ms-windows-page-blah+) > instead of (code-char char). Or make an array such that you can do > (aref +ms-code-page-blah+ (char-code char)) Yes , of course - coming back to my original question then: since input/output en/recoding seems to be a rather generic task (and one that's hunt us for the next several years), does SBCL provide such functionality? I've no problem writing my own character recoders but it wuold hurt since it's all there, in octets.lisp - just not public :-/ So, maybe a humble request: can we get the functionality of octets.lisp (and maybe some syntactic sugar) into a public API? Also, how can we humble web/xml/desktop application programmers get a "stringish" datatype that can be fed to format/write-string et al. (unfortunately one can't subtype build-in classes). Thanks RalfD > > ------------------------------------------------------- SF email is > sponsored by - The IT Product Guide Read honest & candid reviews on > hundreds of IT Products from real users. Discover which products truly > live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click |
From: Nikodemus S. <nik...@ra...> - 2005-04-06 20:12:50
|
On Wed, 6 Apr 2005, R. Mattes wrote: > So, maybe a humble request: can we get the functionality of octets.lisp > (and maybe some syntactic sugar) into a public API? Also, how can we > humble web/xml/desktop application programmers get a "stringish" datatype > that can be fed to format/write-string et al. (unfortunately one can't > subtype build-in classes). You're confused here. Strings are exactly what you feed those. The output encoding to the _stream_ is determined by the used external format. If you need to interface with funky external formats via FFI as opposed to streams the same applies, but to STRING-TO-OCTETS. Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |
From: R. M. <rm...@mh...> - 2005-04-06 20:37:31
|
On Wed, 06 Apr 2005 23:12:42 +0300, Nikodemus Siivola wrote: > On Wed, 6 Apr 2005, R. Mattes wrote: > >> So, maybe a humble request: can we get the functionality of octets.lisp >> (and maybe some syntactic sugar) into a public API? Also, how can we >> humble web/xml/desktop application programmers get a "stringish" datatype >> that can be fed to format/write-string et al. (unfortunately one can't >> subtype build-in classes). > > You're confused here. Strings are exactly what you feed those. The > output encoding to the _stream_ is determined by the used external format. > If you need to interface with funky external formats via FFI as opposed > to streams the same applies, but to STRING-TO-OCTETS. Well, but that model breaks in certain use cases. Both mime and HTTP support (sometimes even require) streams with changing output encoding (i.e. the mapping between characters you put in and their representation in the stream isn't the same for all characters put or read). Any proper first line in a HTTP request _must_ be ASCII (hence the restriction on URLs) but the body of the request can be in any encoding whatsoever. In a multipart file upload there might be a different encoding in each part of the mime message. Some goes for SMTP. Of course, we might get away on the terminology level by just declaring the underlying stream as of type octet - but then: will we loose such niceties as 'format'? Cheers, RalfD And thanks for all your input on this! > Cheers, > > -- Nikodemus Schemer: "Buddha is small, clean, and > serious." > Lispnik: "Buddha is big, has hairy armpits, and > laughs." > > > ------------------------------------------------------- SF email is > sponsored by - The IT Product Guide Read honest & candid reviews on > hundreds of IT Products from real users. Discover which products truly > live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click |
From: Christophe R. <cs...@ca...> - 2005-04-06 20:51:19
|
"R. Mattes" <rm...@mh...> writes: > Of course, we might get away on the terminology level by just declaring > the underlying stream as of type octet - but then: will we loose such > niceties as 'format'? I wouldn't normally pick up on this, but this sentence actually means the opposite of what you meant: "loose" is not the same as "lose". You are quite right that changing the external-format, and indeed probably element-type, of streams, is a desireable feature eventually. It will be implemented just as soon as someone finds the time to do it properly. Cheers, Christophe |
From: R. M. <rm...@mh...> - 2005-04-06 21:02:25
|
On Wed, 06 Apr 2005 21:44:08 +0100, Christophe Rhodes wrote: > > I wouldn't normally pick up on this, but this sentence actually means > the opposite of what you meant: "loose" is not the same as "lose". Oh, please do so. > You are quite right that changing the external-format, and indeed > probably element-type, of streams, is a desireable feature eventually. > It will be implemented just as soon as someone finds the time to do > it properly. Would this be possible with gray-streams? This issue seems to pop up quite often recently (esp. in Web/LISP related mailing lists). Cheers RalfD > Cheers, > > Christophe > > > ------------------------------------------------------- SF email is > sponsored by - The IT Product Guide Read honest & candid reviews on > hundreds of IT Products from real users. Discover which products truly > live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click |
From: Christophe R. <cs...@ca...> - 2005-04-06 20:18:46
|
"R. Mattes" <rm...@mh...> writes: > So, maybe a humble request: can we get the functionality of octets.lisp > (and maybe some syntactic sugar) into a public API? Not in the form you're asking for, because you are still confused. Characters are characters. Octets are octets. When you want to work with the abstract character data type, for instance to url-encode something, operate on characters. When you want to work with octets, for instance to compute a transfer length, operate on octets. To encode characters to octets, or decode characters from octets, use the public functionality that has already been advertised: the :external-format argument to character stream creation, or sb-ext:octets-to-string / sb-ext:string-to-octets. If you take anything away from this message, please take away the information contained in my second and third sentences. Cheers, Christophe |
From: Nikodemus S. <nik...@ra...> - 2005-04-06 20:12:55
|
On Wed, 6 Apr 2005, R. Mattes wrote: > Ok, i was afraid of that anwer - i think i'd prefer a more generic > solution where i have control over the output encoding (might be > ISO-8859-9 or some sort of MS-Windows code page. None of the web folks > was able to tell ....). If it's an URL, then it's ASCII with ranges 00-1F, 7F, 80-FF encoded as Christophe showed you. No encodings above FF in URLs following RFC 1738, (see RFC 2397 if you're curious). Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |
From: R. M. <rm...@mh...> - 2005-04-06 20:51:54
|
On Wed, 06 Apr 2005 22:38:20 +0300, Nikodemus Siivola wrote: > On Wed, 6 Apr 2005, R. Mattes wrote: > >> Ok, i was afraid of that anwer - i think i'd prefer a more generic >> solution where i have control over the output encoding (might be >> ISO-8859-9 or some sort of MS-Windows code page. None of the web folks >> was able to tell ....). > > If it's an URL, then it's ASCII with ranges 00-1F, 7F, 80-FF > encoded as Christophe showed you. No encodings above FF in URLs > following RFC 1738, (see RFC 2397 if you're curious). Geeez, belief me, i've read RFC2397. Unfortunately that exact RFC is rather muddy in certain spots. While it claims that an URI is "as sequence of characters" but "... sequence of characters may be used to represent a sequence of octets". Now, this is all fine for an RFC that only deals with the format of an URI and not with it's semantic. But even the authors realize that there " ... is a second translation for some resources: the sequence of octets defined by a component of the URI is subsequently used to represent a sequence of characters." The text continues: For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification. For a more humorous (humourous?) report on the resulting problems read Daniel Barlow's blog entry from March 17. 2005 and http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html I really try to read and understand your posts. I'm trying to not be stubborn (hard at my age) but i do see a problem here. Thanks, again for your input RalfD |