From: Don G. <do...@ge...> - 2011-08-13 01:20:21
|
I'm working with cmucl (20B unicode) and sbcl (1.0.50.0.debian) on both Debian and Ubuntu. I've got some code that takes strings, and finds a "best fit" ascii representation for them: http://don.geddis.org/lisp/asciify.lisp This code works fine in cmucl: ------------------------------------------------------------------------------- unix> cmucl CMU Common Lisp Debian build (20B Unicode), running on yoda With core: /usr/lib/cmucl/lisp-sse2.core Dumped on: Tue, 2011-08-09 08:56:17-07:00 on yoda See <http://www.cons.org/cmucl/> for support information. Loaded subsystems: Unicode 1.8.4.1 with Unicode version 5.1.0 Python 1.1, target Intel x86/sse2 CLOS based on Gerd's PCL 2010-03-19 15:19:03 * (load "asciify.lisp") ; Loading #P"/home/geddis/www/don/lisp/asciify.lisp". T * (asciify "José árbol niño") "Jose arbol nino" * (quit) ------------------------------------------------------------------------------- But I can't even load it in sbcl, with the literal unicode characters in the strings in the code: ------------------------------------------------------------------------------- unix> sbcl This is SBCL 1.0.50.0.debian, an implementation of ANSI Common Lisp. More information about SBCL is available at <http://www.sbcl.org/>. SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. * (load "asciify.lisp") STYLE-WARNING: Character decoding error in a ;-comment at position 619 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. STYLE-WARNING: Character decoding error in a ;-comment at position 1147 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. STYLE-WARNING: Character decoding error in a ;-comment at position 1170 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. STYLE-WARNING: Character decoding error in a ;-comment at position 1173 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. STYLE-WARNING: Character decoding error in a ;-comment at position 1207 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. STYLE-WARNING: Character decoding error in a ;-comment at position 1259 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. STYLE-WARNING: Character decoding error in a ;-comment at position 1314 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #<THREAD "initial thread" RUNNING {AA73821}>: decoding error on stream #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}> (:EXTERNAL-FORMAT :ASCII): the octet sequence (225) cannot be decoded. Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ATTEMPT-RESYNC ] Attempt to resync the stream at a character boundary and continue. 1: [FORCE-END-OF-FILE] Force an end of file. 2: [INPUT-REPLACEMENT] Use string as replacement input, attempt to resync at a character boundary and continue. 3: [ABORT ] Abort loading file "/home/geddis/opt/lisp/asciify.lisp". 4: Exit debugger, returning to top level. (SB-INT:STREAM-DECODING-ERROR #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}> (225)) 0] (quit) ------------------------------------------------------------------------------- I note also that sbcl has :SB-UNICODE on *features*: ------------------------------------------------------------------------------- * *features* (:SB-FUTEX :ANSI-CL :COMMON-LISP :SBCL :SB-DOC :SB-TEST :SB-LDB :SB-PACKAGE-LOCKS :SB-UNICODE :SB-EVAL :SB-SOURCE-LOCATIONS :IEEE-FLOATING-POINT :X86 :UNIX :ELF :LINUX :SB-THREAD :LARGEFILE :GENCGC :STACK-GROWS-DOWNWARD-NOT-UPWARD :C-STACK-IS-CONTROL-STACK :COMPARE-AND-SWAP-VOPS :UNWIND-TO-FRAME-AND-CALL-VOP :RAW-INSTANCE-INIT-VOPS :STACK-ALLOCATABLE-CLOSURES :STACK-ALLOCATABLE-VECTORS :STACK-ALLOCATABLE-LISTS :STACK-ALLOCATABLE-FIXED-OBJECTS :ALIEN-CALLBACKS :CYCLE-COUNTER :INLINE-CONSTANTS :MEMORY-BARRIER-VOPS :LINKAGE-TABLE :OS-PROVIDES-DLOPEN :OS-PROVIDES-DLADDR :OS-PROVIDES-PUTWC :OS-PROVIDES-BLKSIZE-T :OS-PROVIDES-SUSECONDS-T :OS-PROVIDES-GETPROTOBY-R :OS-PROVIDES-POLL) ------------------------------------------------------------------------------- I also note that the documentation http://sbcl-internals.cliki.net/Unicode suggests that there's a close connection between the cmucl unicode code, and the sbcl unicode code. So I assume this difference was on purpose. Obviously I'm missing something in my understanding about how this is supposed to work. Can anyone give me advice for how I can get my code working in sbcl? Thanks, -- Don _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... Minds are like parachutes. Just because you've lost yours doesn't mean you can borrow mine. -- Despair.com |
From: Stas B. <sta...@gm...> - 2011-08-13 02:06:28
|
Don Geddis <do...@ge...> writes: > I'm working with cmucl (20B unicode) and sbcl (1.0.50.0.debian) on both > Debian and Ubuntu. I've got some code that takes strings, and finds a > "best fit" ascii representation for them: > http://don.geddis.org/lisp/asciify.lisp > But I can't even load it in sbcl, with the literal unicode characters in > the strings in the code: Make sure that your locale uses UTF-8 encoding (LANG environment variable). -- With best regards, Stas. |
From: Don G. <do...@ge...> - 2011-08-13 03:08:42
|
Stas Boukarev <sta...@gm...> wrote on Sat, 13 Aug 2011: > Don Geddis <do...@ge...> writes: >> I'm working with cmucl (20B unicode) and sbcl (1.0.50.0.debian) on both >> Debian and Ubuntu. I've got some code that takes strings, and finds a >> "best fit" ascii representation for them: >> http://don.geddis.org/lisp/asciify.lisp >> But I can't even load it in sbcl, with the literal unicode characters in >> the strings in the code: > Make sure that your locale uses UTF-8 encoding (LANG environment variable). Doesn't seem to help. I did "dpkg-reconfigure locales", and selected UTF-8. Same error when attempting to load the code in sbcl: ------------------------------------------------------------------------------- unix> echo $LANG en_US.UTF-8 unix> sbcl This is SBCL 1.0.50.0.debian, an implementation of ANSI Common Lisp. More information about SBCL is available at <http://www.sbcl.org/>. SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. * (load "asciify.lisp") STYLE-WARNING: Character decoding error in a ;-comment at position 619 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. STYLE-WARNING: Character decoding error in a ;-comment at position 1147 reading source stream #<FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}>, resyncing. [...] debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #<THREAD "initial thread" RUNNING {AA73821}>: decoding error on stream #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}> (:EXTERNAL-FORMAT :ASCII): the octet sequence (225) cannot be decoded. Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ATTEMPT-RESYNC ] Attempt to resync the stream at a character boundary and continue. 1: [FORCE-END-OF-FILE] Force an end of file. 2: [INPUT-REPLACEMENT] Use string as replacement input, attempt to resync at a character boundary and continue. 3: [ABORT ] Abort loading file "/home/geddis/opt/lisp/asciify.lisp". 4: Exit debugger, returning to top level. (SB-INT:STREAM-DECODING-ERROR #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" {AA7C181}> (225)) 0] :4 * (quit) unix> ------------------------------------------------------------------------------- (Same effect on Ubuntu 11.04, running 1.0.45.0.debian, BTW.) Any other suggestions? Thanks, -- Don _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... |
From: Don G. <do...@ge...> - 2011-08-13 03:10:21
|
Don Geddis <do...@ge...> wrote on Fri, 12 Aug 2011: > I'm working with cmucl (20B unicode) and sbcl (1.0.50.0.debian) on both > Debian and Ubuntu. I've got some code that takes strings, and finds a > "best fit" ascii representation for them: > http://don.geddis.org/lisp/asciify.lisp [...] > But I can't even load it in sbcl, with the literal unicode characters in > the strings in the code: Louis Turk <lo...@da...> wrote: > I'm no expert, but perhaps seeing how I work with utf-8 will help (note > :external-format :utf-8) in the code below: > (with-open-file (stream-out "/home/lat/lisp/interl/8-lines.tmp" > :external-format :utf-8 > :direction :output > :if-exists :supersede) It's a good hint, and I saw things like this on the web. In fact, reading from a file is where I initially saw the problem. I just thought it made a simpler test case to load literal strings in code. But I've failed with your suggestion too. I put a test case here: http://don.geddis.org/other/unicode/ There's a short function, and a data file. The lisp function (defun last-line (&optional (file "play.log")) (with-open-file (f file :direction :input :external-format :utf-8) (loop with last-line = "" for line = (read-line f nil nil) while line do (setq last-line line) finally (return last-line) ))) is just trying to duplicate the unix "tail -1" functionality. Adding ":external-format :utf-8" doesn't seem to help (me). CMUCL works with or without it. SBCL fails, with or without it. (I do find it interesting that CMUCL seems to autocompile some magic source code about a UTF-8 external-format, while executing this code!) FWIW, SBCL gives a _different_ octet sequence for its "decoding error", depending on whether I add the :external-format or not. But no matter what, I can't seem to get SBCL to read the characters into a string. -- Don ------------------------------------------------------------------------------- unix:~> echo $LANG en_US.UTF-8 unix> ls -l total 12K -rw-r--r-- 1 geddis geddis 251 Aug 12 19:26 last.lisp -rw-r--r-- 1 geddis geddis 7.9K Aug 12 19:22 play.log unix> cat last.lisp (defun last-line (&optional (file "play.log")) (with-open-file (f file :direction :input :external-format :utf-8) (loop with last-line = "" for line = (read-line f nil nil) while line do (setq last-line line) finally (return last-line) ))) unix> tail -1 play.log Zepplin - Kashmir -- Pickin On (5m8s, 5699 KB, 44 kHz, 160 kbps, 10/0) unix> cmucl CMU Common Lisp Debian build (20B Unicode), running on yoda With core: /usr/lib/cmucl/lisp-sse2.core Dumped on: Tue, 2011-08-09 08:56:17-07:00 on yoda See <http://www.cons.org/cmucl/> for support information. Loaded subsystems: Unicode 1.8.4.1 with Unicode version 5.1.0 Python 1.1, target Intel x86/sse2 CLOS based on Gerd's PCL 2010-03-19 15:19:03 * (load "last.lisp") ; Loading #P"/home/geddis/www/don/other/unicode/last.lisp". T * (last-line) ; Comment: $Header: /project/cmucl/cvsroot/src/pcl/simple-streams/external-formats/utf-8.lisp,v 1.14.4.1 2010-08-14 23:51:08 rtoy Exp $ ; Compiling DEFINE-EXTERNAL-FORMAT UTF-8: ; Compiling DEFINE-EXTERNAL-FORMAT UTF-8: ; Byte Compiling Top-Level Form: ; In: LAMBDA (STREAM::%SLOTS%) ; (STREAM::OCTETS-TO-CHAR :UTF-8 STREAM::STATE ; (AREF STREAM::OCOUNT STREAM::K) ; (IF # # #) ; ...) ; --> LET IF LET STREAM::OCTETS-TO-CODEPOINT MULTIPLE-VALUE-BIND ; --> MULTIPLE-VALUE-CALL LABELS BLOCK LET DOTIMES DO BLOCK LET TAGBODY LET ; --> TAGBODY LET IF SETF LET* MULTIPLE-VALUE-BIND LET LET ; ==> ; (SETQ #:G29 #:G40) ; Note: Doing signed word to integer coercion (cost 20) to #:G29. ; ; Compilation unit finished. ; 1 note "Zepplin - Kashmir -- Pickin On (5m8s, 5699 KB, 44 kHz, 160 kbps, 10/0)" * (quit) unix> sbcl This is SBCL 1.0.50.0.debian, an implementation of ANSI Common Lisp. More information about SBCL is available at <http://www.sbcl.org/>. SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. * (load "last.lisp") T * (last-line) debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #<THREAD "initial thread" RUNNING {AA73909}>: decoding error on stream #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/other/unicode/play.log" {AAD65B9}> (:EXTERNAL-FORMAT :UTF-8): the octet sequence (252 114 32 69) cannot be decoded. Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ATTEMPT-RESYNC ] Attempt to resync the stream at a character boundary and continue. 1: [FORCE-END-OF-FILE] Force an end of file. 2: [INPUT-REPLACEMENT] Use string as replacement input, attempt to resync at a character boundary and continue. 3: [ABORT ] Exit debugger, returning to top level. (SB-INT:STREAM-DECODING-ERROR #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/other/unicode/play.log" {AAD65B9}> (252 114 32 69)) 0] :0 debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #<THREAD "initial thread" RUNNING {AA73909}>: decoding error on stream #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/other/unicode/play.log" {AAD65B9}> (:EXTERNAL-FORMAT :UTF-8): the octet sequence (252 114 32 69) cannot be decoded. Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ATTEMPT-RESYNC ] Attempt to resync the stream at a character boundary and continue. 1: [FORCE-END-OF-FILE] Force an end of file. 2: [INPUT-REPLACEMENT] Use string as replacement input, attempt to resync at a character boundary and continue. 3: [ABORT ] Exit debugger, returning to top level. (SB-INT:STREAM-DECODING-ERROR #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/other/unicode/play.log" {AAD65B9}> (252 114 32 69)) 0] (quit) unix> ------------------------------------------------------------------------------- _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... Filament magazine: "At what age is it best to crush a child's dreams so that they have an easier time stepping in to the status quo?" |
From: Paul K. <pv...@pv...> - 2011-08-13 03:34:49
|
In article <877...@ma...>, Don Geddis <do...@ge...> wrote: > Adding ":external-format :utf-8" doesn't seem to help (me). CMUCL works > with or without it. SBCL fails, with or without it. [...] > > debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #<THREAD > "initial > thread" RUNNING > {AA73909}>: > decoding error on stream > #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/other/unicode/play.log" > {AAD65B9}> > (:EXTERNAL-FORMAT :UTF-8): > the octet sequence (252 114 32 69) cannot be decoded. Does that log include "F\"{u}r Elise"? Looks like your file is in something like ISO-8859-1, not UTF-8. Paul Khuong |
From: Don G. <do...@ge...> - 2011-08-13 06:08:30
|
Paul Khuong <pv...@pv...> wrote on Fri, 12 Aug 2011: > In article <877...@ma...>, Don Geddis <do...@ge...> wrote: >> Adding ":external-format :utf-8" doesn't seem to help (me). CMUCL works >> with or without it. SBCL fails, with or without it. > > Does that log include "F\"{u}r Elise"? Yes! :-) > Looks like your file is in something like ISO-8859-1, not UTF-8. That's it! You solved my problem. Adding :external-format :iso-8859-1 to WITH-OPEN-FILE allows me to read strings from my data file. I had been unable to find a list of possible external-formats in sbcl, so I couldn't just iterate through the choices. It also doesn't quite solve my secondary problem, of loading lisp code that has literal strings (which apparently are in ISO-8859-1 also). Although I guess LOAD takes an EXTERNAL-FORMAT keyword also, so I suppose I could change all my LOADs. Too bad there doesn't seem to be anything I can do inside the source code that is being loaded. But gosh darn, if you didn't figure why I couldn't get anything to work, and how to fix it. Thanks very much! -- Don _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... |
From: Christophe R. <cs...@ca...> - 2011-08-13 06:51:20
|
Don Geddis <do...@ge...> writes: > I had been unable to find a list of possible external-formats in sbcl, > so I couldn't just iterate through the choices. For next time, you can find a list of possible external-formats in the manual, at <http://www.sbcl.org/manual/index.html#External-Formats>. Cheers, Christophe |
From: Pascal J. B. <pj...@in...> - 2011-08-13 15:36:06
|
Don Geddis <do...@ge...> writes: > Paul Khuong <pv...@pv...> wrote on Fri, 12 Aug 2011: >> In article <877...@ma...>, Don Geddis <do...@ge...> wrote: >>> Adding ":external-format :utf-8" doesn't seem to help (me). CMUCL works >>> with or without it. SBCL fails, with or without it. >> >> Does that log include "F\"{u}r Elise"? > > Yes! :-) > >> Looks like your file is in something like ISO-8859-1, not UTF-8. > > That's it! You solved my problem. Adding > :external-format :iso-8859-1 > to WITH-OPEN-FILE allows me to read strings from my data file. > > I had been unable to find a list of possible external-formats in sbcl, > so I couldn't just iterate through the choices. This is not a good way to determine the encoding of a file, because a lot of byte sequence can be decoded with multiple encoding systems. You could add an heuristic testing for the decoded characters, but this would still leave ambiguous results, on non-word data. > It also doesn't quite solve my secondary problem, of loading lisp code > that has literal strings (which apparently are in ISO-8859-1 also). > Although I guess LOAD takes an EXTERNAL-FORMAT keyword also, so I > suppose I could change all my LOADs. The default encoding in sbcl is set from the locale environment variables. > Too bad there doesn't seem to be anything I can do inside the source > code that is being loaded. For files that have a non-default encoding, have a look at https://gitorious.org/com-informatimago/com-informatimago/blobs/master/tools/make-depends.lisp there's code to read the Emacs File Local Variables, such as "-*- coding:utf-8 -*-" (or "Local Variables:" at the end), using com.informatimago.common-lisp.cesarum.file:safe-text-file-to-string-list and com.informatimago.common-lisp.cesarum.character-sets:emacs-encoding-to-lisp-external-format which allows scan-source-file to open the source files with the correct encoding. Notably, see the definition of com.informatimago.common-lisp.cesarum.character-sets::*lisp-encodings* for the implementation specific code used to collect the coding systems (external formats) in various implementations. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. |
From: Teemu L. <tli...@ik...> - 2011-08-13 04:49:45
|
* 2011-08-12T17:28:00-07:00 * Don Geddis wrote: > I've got some code that takes strings, and finds a "best fit" ascii > representation for them: http://don.geddis.org/lisp/asciify.lisp Not answering to your question but just pointing out that in GNU/Linux systems transliterating to ASCII can be done with ICONV package. Most likely it uses the libiconv library. CL-USER> (let ((string "áàäąâãāåǎăạȧ")) (babel:octets-to-string (iconv:iconv "" "US-ASCII//TRANSLIT" (babel:string-to-octets string)))) "aaaaaaaaaaaa" |
From: Christophe R. <cs...@ca...> - 2011-08-13 06:43:45
|
Don Geddis <do...@ge...> writes: > But I can't even load it in sbcl, with the literal unicode characters in > the strings in the code: OK, firstly: when you're talking about LOAD, as well as the normal READ and EVAL phases there is a phase converting from bytes on the disk into Lisp characters, and it's at this stage that things (lots of things) can go wrong. > decoding error on stream > #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" > {AA7C181}> > (:EXTERNAL-FORMAT :ASCII): > the octet sequence (225) cannot be decoded. Your external format for this stream is :ASCII, meaning that code points (bytes) 0-127 are mapped to characters, and everything else is an error: ASCII is a 128-character repertoire. The first step in fixing this problem is going to be opening asciify.lisp in a non-ASCII external format. (I'm surprised that you're getting ASCII as the default, even after you did something explicitly to select a UTF-8 locale; checking with "locale -k LC_CTYPE | grep charmap" might be worthwhile). SBCL takes its default external format from the environment; you can check that it's doing that correctly by looking at sb-impl::*default-external-format*, once your Unix environment is sorted out. The second step is going to be to identify the actual encoding of your source file: is it stored on the disk in ISO-8859-1, UTF-8 or something else? From looking at the file that's served up from your webserver, I suspect ISO-8859-1 (an 8-bit encoding that I think we can just about call "legacy" these days), which means that even if you have a UTF-8 locale you will have to tell sbcl explicitly to load it with that external format -- or else you'll have to set up your Unix to use an ISO-8859-1 LC_CTYPE. (I don't recommend this unless you really know what you're doing). > I note also that sbcl has :SB-UNICODE on *features*: This means that the sbcl has the capability to represent all Unicode characters in memory. It doesn't mean that it can guess from all the different ways of converting in-memory characters to or from a stream of bytes to pick the one the user wants, sadly. > I also note that the documentation > http://sbcl-internals.cliki.net/Unicode > suggests that there's a close connection between the cmucl unicode code, > and the sbcl unicode code. Hm, I'm not sure that's in fact true; there were in the end divergences of philosophy. > So I assume this difference was on purpose. Obviously I'm missing > something in my understanding about how this is supposed to work. Can > anyone give me advice for how I can get my code working in sbcl? I hope that the above gives you some ideas; the important thing is to understand character encodings, at which point it should all become blindingly obvious. Best, Christophe |
From: Don G. <do...@ge...> - 2011-08-13 16:08:37
|
Christophe Rhodes <cs...@ca...> wrote on Sat, 13 Aug 2011: > Don Geddis <do...@ge...> writes: >> decoding error on stream >> #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" >> {AA7C181}> >> (:EXTERNAL-FORMAT :ASCII): >> the octet sequence (225) cannot be decoded. > Your external format for this stream is :ASCII, meaning that code > points (bytes) 0-127 are mapped to characters, and everything else is > an error: ASCII is a 128-character repertoire. The first step in > fixing this problem is going to be opening asciify.lisp in a non-ASCII > external format. (I'm surprised that you're getting ASCII as the > default, even after you did something explicitly to select a UTF-8 > locale; checking with "locale -k LC_CTYPE | grep charmap" might be > worthwhile). SBCL takes its default external format from the > environment; you can check that it's doing that correctly by looking > at sb-impl::*default-external-format*, once your Unix environment is > sorted out. Just to complete this report: unix:~> echo $LANG en_US.UTF-8 unix:~> locale -k LC_CTYPE | grep charmap charmap="ANSI_X3.4-1968" unix> sbcl This is SBCL 1.0.50.0.debian, an implementation of ANSI Common Lisp. More information about SBCL is available at <http://www.sbcl.org/>. SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. * sb-impl::*default-external-format* :ANSI_X3.4-1968 * (quit) > The second step is going to be to identify the actual encoding of your > source file: is it stored on the disk in ISO-8859-1, UTF-8 or something > else? From looking at the file that's served up from your webserver, I > suspect ISO-8859-1 (an 8-bit encoding that I think we can just about > call "legacy" these days), which means that even if you have a UTF-8 > locale you will have to tell sbcl explicitly to load it with that > external format -- or else you'll have to set up your Unix to use an > ISO-8859-1 LC_CTYPE. (I don't recommend this unless you really know > what you're doing). And, obviously, I barely know what I'm doing. It seems like I would have more luck if my source file were UTF-8 instead of ISO-8859-1. I think it came from typing a text file into emacs, perhaps copying particular characters from usenet postings read in GNUS (again, with emacs). Or possibly from a web browser, into an emacs text file. Does't seem to help, though. I did iconv -f ISO-8859-1 -t UTF-8 asciify.lisp > a2.lisp but an attempt to (load "a2.lisp") in sbcl fails in the same way, for the obvious reason: as you noted above, my sbcl seems to default to :external-format :ascii so all of these encodings are going to fail (by default). > I hope that the above gives you some ideas; the important thing is to > understand character encodings, at which point it should all become > blindingly obvious. Thanks for everybody's help. I now understand this topic (a little) better. It still seems ... unhelpful? ... for sbcl to default to :ASCII as an external format. You seem surprised by this. But I have the same behavior on both a Debian box and also a Ubuntu box (running different sbcl versions), so it seems intentional. Nonetheless, I have a workaround: I can add an :EXTERNAL-FORMAT argument to all my LOADs and COMPILE-FILEs. Thanks, all. -- Don _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... Parents seem to think their kids are like clay, that you mould them into the right shape when they're wet. A better metaphor is that kids are like flexible plastic -- they respond to pressure, but when you release the pressure they tend to pop back to their original shape. -- Bryan Caplan |
From: Christophe R. <cs...@ca...> - 2011-08-13 16:21:28
|
Don Geddis <do...@ge...> writes: > Just to complete this report: > > unix:~> echo $LANG > en_US.UTF-8 > > unix:~> locale -k LC_CTYPE | grep charmap > charmap="ANSI_X3.4-1968" ANSI_X3.4-1968 is code for ASCII -- I suspect that this is telling you that although you've set that LANG on your system, that locale isn't in fact supported. What does locale -a say? FWIW, on my Debian system it says csr21@omega:~$ locale -a C en_GB.utf8 fr_FR.utf8 POSIX csr21@omega:~$ echo $LANG en_GB.UTF-8 csr21@omega:~$ locale -k LC_CTYPE | grep charmap charmap="UTF-8" And my sbcl defaults to a utf-8 external format. If you're root, and if no plausible locales are showing up for "locale -a", you could run "dpkg-reconfigure locales" to choose some plausible ones to generate for your system. > It still seems ... unhelpful? ... for sbcl to default to :ASCII as an > external format. You seem surprised by this. But I have the same > behavior on both a Debian box and also a Ubuntu box (running different > sbcl versions), so it seems intentional. What is intentional is to default to whatever the Unix/nl_langinfo() LC_CTYPE character map is. On your systems, that's ASCII, for an unknown reason; on mine it's definitely not... Good luck! Cheers, Christophe |
From: Stas B. <sta...@gm...> - 2011-08-13 17:02:25
|
Don Geddis <do...@ge...> writes: > Christophe Rhodes <cs...@ca...> wrote on Sat, 13 Aug 2011: >> Don Geddis <do...@ge...> writes: >>> decoding error on stream >>> #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" >>> {AA7C181}> >>> (:EXTERNAL-FORMAT :ASCII): >>> the octet sequence (225) cannot be decoded. >> Your external format for this stream is :ASCII, meaning that code >> points (bytes) 0-127 are mapped to characters, and everything else is >> an error: ASCII is a 128-character repertoire. The first step in >> fixing this problem is going to be opening asciify.lisp in a non-ASCII >> external format. (I'm surprised that you're getting ASCII as the >> default, even after you did something explicitly to select a UTF-8 >> locale; checking with "locale -k LC_CTYPE | grep charmap" might be >> worthwhile). SBCL takes its default external format from the >> environment; you can check that it's doing that correctly by looking >> at sb-impl::*default-external-format*, once your Unix environment is >> sorted out. > > Just to complete this report: > > unix:~> echo $LANG > en_US.UTF-8 > > unix:~> locale -k LC_CTYPE | grep charmap > charmap="ANSI_X3.4-1968" > > unix> sbcl > This is SBCL 1.0.50.0.debian, an implementation of ANSI Common Lisp. > More information about SBCL is available at <http://www.sbcl.org/>. > > SBCL is free software, provided as is, with absolutely no warranty. > It is mostly in the public domain; some portions are provided under > BSD-style licenses. See the CREDITS and COPYING files in the > distribution for more information. > * sb-impl::*default-external-format* > > :ANSI_X3.4-1968 > * (quit) What is echo $LC_CTYPE? -- With best regards, Stas. |
From: Don G. <do...@ge...> - 2011-08-13 18:08:34
|
Stas Boukarev <sta...@gm...> wrote on Sat, 13 Aug 2011: > Don Geddis <do...@ge...> writes: >> Christophe Rhodes <cs...@ca...> wrote on Sat, 13 Aug 2011: >>> Don Geddis <do...@ge...> writes: >>>> decoding error on stream >>>> #<SB-SYS:FD-STREAM for "file /home/geddis/www/don/lisp/asciify.lisp" >>>> {AA7C181}> >>>> (:EXTERNAL-FORMAT :ASCII): >>>> the octet sequence (225) cannot be decoded. >>> Your external format for this stream is :ASCII, meaning that code >>> points (bytes) 0-127 are mapped to characters, and everything else is >>> an error: ASCII is a 128-character repertoire. The first step in >>> fixing this problem is going to be opening asciify.lisp in a non-ASCII >>> external format. (I'm surprised that you're getting ASCII as the >>> default, even after you did something explicitly to select a UTF-8 >>> locale; checking with "locale -k LC_CTYPE | grep charmap" might be >>> worthwhile). SBCL takes its default external format from the >>> environment; you can check that it's doing that correctly by looking >>> at sb-impl::*default-external-format*, once your Unix environment is >>> sorted out. >> >> Just to complete this report: >> >> unix:~> echo $LANG >> en_US.UTF-8 >> >> unix:~> locale -k LC_CTYPE | grep charmap >> charmap="ANSI_X3.4-1968" >> >> unix> sbcl >> This is SBCL 1.0.50.0.debian, an implementation of ANSI Common Lisp. >> More information about SBCL is available at <http://www.sbcl.org/>. >> >> SBCL is free software, provided as is, with absolutely no warranty. >> It is mostly in the public domain; some portions are provided under >> BSD-style licenses. See the CREDITS and COPYING files in the >> distribution for more information. >> * sb-impl::*default-external-format* >> >> :ANSI_X3.4-1968 >> * (quit) > What is echo $LC_CTYPE? "Undefined variable", on both my Debian and Ubuntu servers (running tcsh). _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... |
From: Don G. <do...@ge...> - 2011-08-13 18:10:21
|
Stas Boukarev <sta...@gm...> wrote on Sat, 13 Aug 2011: > Don Geddis <do...@ge...> writes: >> unix:~> echo $LANG >> en_US.UTF-8 >> unix:~> locale -k LC_CTYPE | grep charmap >> charmap="ANSI_X3.4-1968" > What is echo $LC_CTYPE? I don't seem to have this variable, but (reading more docs) I see that the LC_* environment variables are supposed to be determining this stuff. On both my Debian and Ubuntu boxes, I get: unix> setenv | grep LC LC_ALL=C NLSPATH=/usr/share/locale/%L/LC_MESSAGES/%N.cat Perhaps my problem has something to do with $LC_ALL being "C", and $LC_CTYPE being missing? Might that somehow override $LANG? -- Don _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... |
From: Don G. <do...@ge...> - 2011-08-13 18:15:19
|
Stas Boukarev <sta...@gm...> wrote on Sat, 13 Aug 2011: > Don Geddis <do...@ge...> writes: >> unix:~> echo $LANG >> en_US.UTF-8 >> unix:~> locale -k LC_CTYPE | grep charmap >> charmap="ANSI_X3.4-1968" > What is echo $LC_CTYPE? Ah, that was the clue that solved it for me. I didn't have an $LC_CTYPE variable. But it turns out that, something like ten years ago, I put the following in my .tcshrc: setenv LC_ALL C The associated comment says # Fix the sort order of "ls" Who knows what was happening at the time, and what kind of filename sorting in "ls" I was seeing, and how I thought this would fix it. However, that line has been in my .tcshrc shell init file ever since. And, of course, I copied my old shell init file to my new server when I built it too, a month ago. I've commented out that line in the init file, and everything works now just like everybody here has been saying. locale -k LC_CTYPE | grep charmap now returns "UTF-8", sbcl correctly loads source files with such encodings, etc. (As you all explained to me earlier in this thread, I had a second problem in that it appears my actual source files were ISO-8859-1. But, using "iconv", I changed them all to UTF-8, and now everything works great.) Thanks everyone, for the advice and suggestions, and for the patience to help me track this down. -- Don _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... Hard work never killed anybody, but why take a chance? -- Charlie McCarthy (ventriloquist puppet) |
From: Christophe R. <cs...@ca...> - 2011-08-13 20:11:19
|
Don Geddis <do...@ge...> writes: > Ah, that was the clue that solved it for me. \o/ Hooray! | / \ > I didn't have an $LC_CTYPE variable. But it turns out that, something > like ten years ago, I put the following in my .tcshrc: > setenv LC_ALL C > The associated comment says > # Fix the sort order of "ls" > Who knows what was happening at the time, and what kind of filename > sorting in "ls" I was seeing, and how I thought this would fix it. Well, another aspect of the Unix locale is collation: how lists are alphabetized. In the C/POSIX locale, you basically go in codepoint order, so lower-case and uppercase letters are separate; in more "natural language" locales, you go alphabetically. Compare: csr21@aleph-null:~$ ls Desktop Documents Downloads games Music Pictures Public src teclo Templates Videos VirtualBox VMs csr21@aleph-null:~$ LC_COLLATE=C ls Desktop Documents Downloads Music Pictures Public Templates Videos VirtualBox VMs games src teclo (In other words, you can preserve both a UTF-8 default character set and `standard' collation, using LANG=en_US.UTF-8, as you currently do, and LC_COLLATE=C. Again, there are excitingly exotic possible collation orders to do with different alphabetical orders in different natural languages, none of which should be enabled unless you really want them.) Cheers, Christophe |
From: Teemu L. <tli...@ik...> - 2011-08-14 04:04:43
|
* 2011-08-13T10:48:45-07:00 * Don Geddis wrote: > I didn't have an $LC_CTYPE variable. But it turns out that, something > like ten years ago, I put the following in my .tcshrc: > setenv LC_ALL C > The associated comment says > # Fix the sort order of "ls" > Who knows what was happening at the time, and what kind of filename > sorting in "ls" I was seeing, and how I thought this would fix it. Glibc checks LC_ALL first and uses its value if it exists. If LC_ALL is empty then the appropriate LC_* (e.g., LC_CTYPE, LC_COLLATE) variable is inspected. If that is empty then LANG variable is inspected. > locale -k LC_CTYPE | grep charmap > now returns "UTF-8", sbcl correctly loads source files with such > encodings, etc. You can just type "locale charmap" to get the effective character encoding. |
From: Don G. <do...@ge...> - 2011-08-13 18:20:20
|
Christophe Rhodes <cs...@ca...> wrote on Sat, 13 Aug 2011: > Don Geddis <do...@ge...> writes: > >> Just to complete this report: >> >> unix:~> echo $LANG >> en_US.UTF-8 >> >> unix:~> locale -k LC_CTYPE | grep charmap >> charmap="ANSI_X3.4-1968" > > ANSI_X3.4-1968 is code for ASCII -- I suspect that this is telling you > that although you've set that LANG on your system, that locale isn't in > fact supported. What does locale -a say? FWIW, on my Debian system it > says > > csr21@omega:~$ locale -a > C > en_GB.utf8 > fr_FR.utf8 > POSIX > > csr21@omega:~$ echo $LANG > en_GB.UTF-8 > > csr21@omega:~$ locale -k LC_CTYPE | grep charmap > charmap="UTF-8" > > And my sbcl defaults to a utf-8 external format. My "locale -a" seems reasonable. On my Debian server: debian:~> locale -a C C.UTF-8 POSIX en_US.utf8 and on (newly installed) Ubuntu 11.04: ubuntu:~> locale -a C POSIX en_AG en_AG.utf8 en_AU.utf8 en_BW.utf8 en_CA.utf8 en_DK.utf8 en_GB.utf8 en_HK.utf8 en_IE.utf8 en_IN en_IN.utf8 en_NG en_NG.utf8 en_NZ.utf8 en_PH.utf8 en_SG.utf8 en_US.utf8 en_ZA.utf8 en_ZW.utf8 zh_CN.utf8 zh_SG.utf8 And, just to repeat, on both systems I get the same results for the previous tests: unix:~> echo $LANG en_US.UTF-8 unix:~> locale -k LC_CTYPE | grep charmap charmap="ANSI_X3.4-1968" > If you're root, and if no plausible locales are showing up for "locale > -a", you could run "dpkg-reconfigure locales" to choose some plausible > ones to generate for your system. The plausible locales seem to be already there. Nonetheless, I have done the dpkg-reconfigure. That's how I (yesterday) set my $LANG to UTF-8 (on the Debian server). > What is intentional is to default to whatever the Unix/nl_langinfo() > LC_CTYPE character map is. On your systems, that's ASCII, for an > unknown reason; on mine it's definitely not... Very odd. So apparently, the behavior I'm seeing with SBCL is not the behavior that everybody else is seeing, because something is strange about my locale setup? And yet, one of the systems is a brand-new Ubuntu install within the last month. I don't recall doing anything strange with configuring locales. I wonder why I seem to be getting an ASCII charmap, but nobody else is? -- Don _______________________________________________________________________________ Don Geddis http://don.geddis.org/ do...@ge... |