From: edgar <edg...@we...> - 2010-04-27 19:32:07
|
Multiplatform encoding solutions I hope that this mail now did not get filed under some other topic. To make you know that I can not only write nag-messages here some results from the encoding research of the last weekend. Originally I had prepared a rather long list what works and what not in what language and encoding unter what operation system until I realized that I can also write it in in a single short sentence: CLISP produces exactly the same quirks like the GNU 'iconv' program (the gettext encoding converter) This prooves me that this has nothing to do with CLISP itself but is just simply the usual multiplatform encoding quirks. So I decided to go another direction. Below is a very simple binary pattern-matcher that reads the first up-to 16k of a text file and searches for typical umlaut patterns. This is how web-browsers like Firefox etc. try to determine the encoding if the related declaration in the HTML header is missing. This is for international use of course a too simple program because in Germany we have only 7 umlauts where the sets of binary hexnums of the different encodings do not even intersect with each other. With the code from below I can read ISO-8859-1 and UTF-8 encoded Lisp code on Windows and Linux with no scrambled umlauts or any other CLISP conversion errors. I would be interested if there had been similar attempts in the past. Here's the code: ;; ------------------------------ encoding ----------------------------- (defun get-file-encoding (file &optional verbose) (let ((encoding custom:*default-file-encoding*) (iso-umlauts (list #xe4 #xf6 #xfc #xc4 #xd6 #xdc #xdf)) (utf-umlauts (list #xa4 #xb6 #xbc #x84 #x96 #x9c #x9f)) (utf-marker #xc3) ; first byte of a two-byte UTF-8 encoding (iso 0) (utf 0) (ascii 0) (unknown 0) ; character counters (bytes 0) (current-byte 0) (previous-byte 0)) ;; binary pattern matcher (with-open-file (input-stream file :direction :input :element-type '(unsigned-byte 8)) (setq bytes (min (file-length input-stream) 16384)) (dotimes (i bytes) (setq current-byte (read-byte input-stream)) (cond ((member current-byte iso-umlauts) ; iso umlaut (incf iso)) ((and (eql previous-byte utf-marker) ; utf umlaut (member current-byte utf-umlauts)) (incf utf)) ((< current-byte 128) (incf ascii)) ; 7-bit ascii ((not (eql current-byte utf-marker)) ; unknown byte (incf unknown))) (setq previous-byte current-byte))) ;; print the numbers (when verbose (format t "~&;; iso:~a utf:~a ascii:~a unknown:~a bytes:~a~%" iso utf ascii unknown bytes)) ;; the highest match determines the charset #+unicode (let* ((charset (cond ((and (eql iso 0) (eql utf 0)) charset:ascii) ((> iso utf) charset:iso-8859-1) ((> utf iso) charset:utf-8) (t :default)))) (setq encoding (ext:make-encoding :charset charset))) #-unicode (format t "~&;; get-file-encoding: no :UNICODE support found.~%") ;; print the choosen charset (when verbose (let ((charset #+unicode (ext:encoding-charset encoding) #-unicode "ISO-8859-1")) (format t "~&;; charset -> ~a~%" charset))) ;; return the encoding encoding)) ;; examples: (defmacro xload (file &rest args) "Load a Lisp file in its native encoding." `(let ((encoding (get-file-encoding ,file))) (load ,file :external-format encoding ,@args))) (defun cat (file) "Read a text file in its native encoding and print it to the screen." (let ((encoding (get-file-encoding file)) (last-line-was-empty-p nil)) (with-open-file (in-stream (open file :if-does-not-exist nil)) ;; (stream-external-format ...) must be set to the encoding ;; because READ-LINE has no :external-format keyword (setf (stream-external-format in-stream) encoding) (loop for line = (read-line in-stream nil) while line do (format t "~&~a~%" line))))) ;; ---------------------------- end of code ---------------------------- Herewith this code is declared as public domain. Have fun, - edgar -- The author of this email does not necessarily endorse the following advertisements, which are the sole responsibility of the advertiser: |