From: Andreas Y. <an...@ya...> - 2006-08-28 10:58:34
|
I'm working on a program that calls exiftool, a command-line script, to pull metadata from image files. I'm using SBCL 0.9.15 installed via Fink on OS X and running Emacs in Apple's X11 environment. The metadata text contains extended characters that I believe are UTF-8; I launch my xterms with -en utf-8 and everything displays fine when I run exiftool from the command line. The code calling exiftool looks like: (let ((output (with-output-to-string (stream (sb-ext:run-program "exiftool" (list filename) :output stream))) Unfortunately, the text I get back in output looks nothing like what I'd expect when printed to the screen. I've tried specifying an :external-format with format and using various combinations of sb-ext:octets-to-string and sb-ext:string-to-octets on output. I also tried a different settings of :element-type for with-output-to-string. I've also tried Carbon Emacs (v 22.0.50.1) and the Fink version of Emacs (v 21.2.1) in a terminal window. Both produce the same character errors that I see in X11. The output when I run exiftool from the command line is correct in both a terminal window and an xterm. Any suggestions would be much appreciated. Cheers, Andreas |
From: Nikodemus S. <nik...@ra...> - 2006-08-28 11:16:42
|
Andreas Yankopolus <an...@ya...> writes: > Unfortunately, the text I get back in output looks nothing like what > I'd expect when printed to the screen. I've tried specifying an > :external-format with format and using various combinations of > sb-ext:octets-to-string and sb-ext:string-to-octets on output. I > also tried a different settings of :element-type for > with-output-to-string. Unfortunately the external-format support of RUN-PROGRAM is currently quite lacking, though it will hopefully get fixed sooner then later. As hack, I'd try something like this: (let ((string (with-output-to-string (s) (run-program ... :output s)))) (octets-to-string (string-to-octets string :external-format :latin-1) :utf-8)) Hope this helps. If it does -- or doesn't -- let us know. Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |
From: Andreas Y. <an...@ya...> - 2006-08-28 11:57:42
|
Nikodemus, > Unfortunately the external-format support of RUN-PROGRAM is > currently quite > lacking, though it will hopefully get fixed sooner then later. > > As hack, I'd try something like this: > > (let ((string (with-output-to-string (s) > (run-program ... :output s)))) > (octets-to-string (string-to-octets string :external- > format :latin-1) > :utf-8)) > > Hope this helps. If it does -- or doesn't -- let us know. Thanks for the quick reply. This was similar to one of my stabs at solving the problem (I also added an :external-format key prior to the :utf-8) and results in an error I'm starting to see frequently: Illegal :UTF-8 character starting at byte position 622. [Condition of type SB-IMPL::INVALID-UTF8-STARTER-BYTE] The bizarre part is that Adobe XMP information is supposed to be UTF-8 encoded. The file I'm trying to read also contains the tag: <?xml version="1.0" encoding="UTF-8"?> Is it possible that the data isn't coming back from run-program in latin-1? Cheers, Andreas |
From: Andreas Y. <an...@ya...> - 2006-08-28 13:30:27
|
Nikodemus, > No, not the way you are doing it. > > However, perhaps this is the way forward: > > (with-open-file (f "/tmp/exif.data" :direction :output :element- > type '(unsigned-byte 8)) > (run-program ... :output f)) Very clever! > Now: > > 1. Confirm that the file contains the same data you see in xterm > when running exiftool from the terminal. If so, then That's the case. > 2a. Read it in as :UTF-8. If it breaks, then the output isn't UTF-8 > really, but magic binary soup, in which case you need to figure out > which parts are UTF-8. Just to verify, I'm doing this as follows: (with-open-file (f "/tmp/exif.data" :direction :input :external- format :utf-8) (do ((line (read-line f nil 'eof) (read-line f nil 'eof))) ((eql line 'eof)) (format t "~A~%" line))) I'm once again getting an error that an octet sequence cannot be decoded. decoding error on stream #<SB-SYS:FD-STREAM for "file /tmp/exif.data" {119CF479}> (:EXTERNAL- FORMAT :UTF-8): the octet sequence (142) cannot be decoded. [Condition of type SB-INT:STREAM-DECODING-ERROR] Any idea as to how the terminal interpret these octets correctly? Cheers, Andreas |
From: Andreas Y. <an...@ya...> - 2006-08-28 16:59:38
Attachments:
exif.data
|
Harald, > A bare 142 (#x8e) is certainly not allowed in UTF-8: That is a > continuation octet (high bits 10). How about if you send us a copy > of exif.data? It's probably not huge. Make sure your mail program > treats it as binary data, not text. That way it is more likely to > survive intact. I've posted the file to http://www.yank.to/exif.data I'm not sure what OS X mail will do to the file, but it's only a few KB, so here goes. Cheers, Andreas |
From: Harald Hanche-O. <ha...@ma...> - 2006-08-28 16:19:10
|
+ Andreas Yankopolus <an...@ya...>: | > (with-open-file (f "/tmp/exif.data" :direction :output :element- | > type '(unsigned-byte 8)) | > (run-program ... :output f)) | | Very clever! | | (with-open-file (f "/tmp/exif.data" :direction :input :external- | format :utf-8) | (do ((line (read-line f nil 'eof) | (read-line f nil 'eof))) | ((eql line 'eof)) | (format t "~A~%" line))) | | I'm once again getting an error that an octet sequence cannot be | decoded. | | decoding error on stream | #<SB-SYS:FD-STREAM for "file /tmp/exif.data" {119CF479}> (:EXTERNAL- | FORMAT | :UTF-8): | the octet sequence (142) cannot be decoded. | [Condition of type SB-INT:STREAM-DECODING-ERROR] A bare 142 (#x8e) is certainly not allowed in UTF-8: That is a continuation octet (high bits 10). How about if you send us a copy of exif.data? It's probably not huge. Make sure your mail program treats it as binary data, not text. That way it is more likely to survive intact. - Harald |
From: Andreas Y. <an...@ya...> - 2006-08-28 17:08:27
|
After posting the file, I hit on the idea of experimenting with Firefox's View->Character Encoding option. It appears that the file contains a combination of Western (MacRoman) and UTF-8 characters. Is this the case? If so, no wonder SBCL is getting confused! Cheers, Andreas |
From: Harald Hanche-O. <ha...@ma...> - 2006-08-28 17:13:13
|
+ Andreas Yankopolus <an...@ya...>: | I'm not sure what OS X mail will do to the file, but it's only a few | KB, so here goes. Same as what you put on the web. It is not even consistent. I find the text "The Cité des Sciences et de l’Industrie (City of Sciences and Industry) at the Parc de la Villette." in several places: In the Caption-Abstract field it is in Mac Roman. (That initial é is what your program choked on). But in the Description and Image Description fields, the same text is coded as UTF-8. There are similary problems with the copyright symbol, which appears in both encodings as well. It seems that exiftags produces junk. Or maybe the contents of the exif tags in the image file is junk, and exiftags faithfully reproduces the junk. That is more likely I think: AFAIK, the Exif standard does not specify a character set. - Harald |
From: Harald Hanche-O. <ha...@ma...> - 2006-08-28 17:15:09
|
+ Harald Hanche-Olsen <ha...@ma...>: | It seems that exiftags produces junk. Sorry, that was exiftool. And I see you got to the same conclusion I did. - Harald |
From: Andreas Y. <an...@ya...> - 2006-08-28 17:35:43
|
Harald, Thanks for your help. I'll have to check on the Adobe forums and see if Photoshop and Bridge are supposed to save metadata using a variety of character formats. In the mean time, I'll tell ExifTool to just grab the UTF-8 fields. Cheers, Andreas |
From: Nikodemus S. <nik...@ra...> - 2006-08-28 11:25:40
|
Nikodemus Siivola <nik...@ra...> writes: > Andreas Yankopolus <an...@ya...> writes: > >> Unfortunately, the text I get back in output looks nothing like what >> I'd expect when printed to the screen. I've tried specifying an Forgot to mention this, but if you are using Slime, be sure to do something like (setq slime-net-coding-system 'utf-8-unix) in your .emacs, and run SBCL in an UTF-8 locale. (Neither of these will fix the underlying RUN-PROGRAM issue, but might be party to making a workable workaround seem like it doesn't work. Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |
From: Nikodemus S. <nik...@ra...> - 2006-08-28 12:21:33
|
Andreas Yankopolus <an...@ya...> writes: > Thanks for the quick reply. This was similar to one of my stabs at > solving the problem (I also added an :external-format key prior to > the :utf-8) and results in an error I'm starting to see frequently: Sorry, that (missing :external-format) was my bad. > Illegal :UTF-8 character starting at byte position 622. > [Condition of type SB-IMPL::INVALID-UTF8-STARTER-BYTE] > > The bizarre part is that Adobe XMP information is supposed to be > UTF-8 encoded. The file I'm trying to read also contains the tag: > > <?xml version="1.0" encoding="UTF-8"?> > > Is it possible that the data isn't coming back from run-program in > latin-1? No, not the way you are doing it. However, perhaps this is the way forward: (with-open-file (f "/tmp/exif.data" :direction :output :element-type '(unsigned-byte 8)) (run-program ... :output f)) Now: 1. Confirm that the file contains the same data you see in xterm when running exiftool from the terminal. If so, then 2a. Read it in as :UTF-8. If it breaks, then the output isn't UTF-8 really, but magic binary soup, in which case you need to figure out which parts are UTF-8. 2b. If the file looks different, then exif probably notices that it isn't talking to a terminal and alters its output format somehow. Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |