From: Harald Hanche-O. <ha...@ma...> - 2007-07-27 22:22:13
|
My build today of 1.0.8.8 ended with WARNING! Some of the contrib modules did not build successfully or pass their self-tests. Failed contribs:" asdf-install sb-bsd-sockets sb-posix sb-simple-streams So naturally, I went looking through the build output and found this error associated with each of the above: Error during processing of --eval option (LOAD #P"../asdf-stub.lisp"): The value #\LATIN_CAPITAL_LETTER_A_WITH_TILDE is not of type BASE-CHAR. In the backtrace I find these (abstracted for your reading pleasure): 14: (SB-KERNEL:VECTOR-TO-VECTOR* "foo=some string containing a UTF-8 character whose first byte looks like à when read as latin-1" SIMPLE-BASE-STRING) 15: (SB-IMPL::STRING-LIST-TO-C-STRVEC (... the entire environment as a list of strings ...) - Harald |
From: Marijn S. (hkBst) <hk...@ge...> - 2007-07-28 10:46:02
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Harald Hanche-Olsen wrote: > My build today of 1.0.8.8 ended with > > WARNING! Some of the contrib modules did not build successfully or pass > their self-tests. Failed contribs:" > asdf-install > sb-bsd-sockets > sb-posix > sb-simple-streams > > So naturally, I went looking through the build output and found this > error associated with each of the above: > > Error during processing of --eval option (LOAD #P"../asdf-stub.lisp"): > > The value #\LATIN_CAPITAL_LETTER_A_WITH_TILDE is not of type BASE-CHAR. > > In the backtrace I find these (abstracted for your reading pleasure): > > 14: (SB-KERNEL:VECTOR-TO-VECTOR* > "foo=some string containing a UTF-8 character whose first byte > looks like � when read as latin-1" > SIMPLE-BASE-STRING) > 15: (SB-IMPL::STRING-LIST-TO-C-STRVEC > (... the entire environment as a list of strings ...) > > - Harald This problem seems to occur whenever there is a non-ASCII string in the environment. In Gentoo we currently prefix the make.sh command with "env - " to clear the environment. Nasty, but it works. Marijn PS This bug has been present since at least 1.0.6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGqx6lp/VmCx0OL2wRAut2AJ97I3BTaIfKvscqW3S4iDsAHjsp4wCeOyW1 zqxfXyQvSVTob79yZuSsywo= =tuPL -----END PGP SIGNATURE----- |
From: Harald Hanche-O. <ha...@ma...> - 2007-08-02 08:07:59
|
A bit more on this problem, after I rooted around in the sources a bit: It appears that the environment is imported using the default encoding, which is utf-8 on macosx. So far so good [1]. But exporting the environment assumes that everything is ASCII, which clearly it is not necessarily. There is even a FIXME comment on line 488 of run-program.lisp that shows somebody is aware of the problem. So I am planning to write some replacement code to export the environment using the default encoding, that is unless someone wants to tell me it's the wrong thing to do (and unless someone else is working on it). It seems a tiny enough project for me to wrap my puny brain around. [1] Experiment: Put a non-utf-8 string of octets "foo=x\270x" in the environment and fire up sbcl: Then running (posix-environ) or (posix-getenv "foo") lands me in the debugger. - Harald |
From: Nikodemus S. <nik...@ra...> - 2007-08-02 12:36:56
|
On 8/2/07, Harald Hanche-Olsen <ha...@ma...> wrote: > A bit more on this problem, after I rooted around in the sources a > bit: It appears that the environment is imported using the default > encoding, which is utf-8 on macosx. So far so good [1]. But > exporting the environment assumes that everything is ASCII, which > clearly it is not necessarily. There is even a FIXME comment on line > 488 of run-program.lisp that shows somebody is aware of the problem. > > So I am planning to write some replacement code to export the > environment using the default encoding, that is unless someone wants > to tell me it's the wrong thing to do (and unless someone else is > working on it). It seems a tiny enough project for me to wrap my puny > brain around. Patches are always welcome! :) Cheers, -- Nikodemus |
From: Harald Hanche-O. <ha...@ma...> - 2007-08-02 17:40:55
Attachments:
run-program.lisp.diff
|
+ "Nikodemus Siivola" <nik...@ra...>: | > So I am planning to write some replacement code to export the | > environment using the default encoding, [...] | | Patches are always welcome! :) That turned out to be easy enough. This replacement function in run-program.lisp exports the environment according to the default encoding. The attached diff patches in this definition and also fixes the docstring for run-program. It works fine as far as I can tell, but I am not very used to messing around with these sap-thingies so a bit of critical review may be a good thing. So here is the new function, for your reading pleasure: (defun string-list-to-c-strvec (string-list) (let* ((i #1=#.(/ sb-vm:n-machine-word-bits sb-vm:n-byte-bits)) ;; We need an extra for the null, and an extra 'cause exect ;; clobbers argv[-1]. (vec-bytes (* #1# (+ (length string-list) 2))) (octet-vector-list (mapcar (lambda (s) (string-to-octets s :null-terminate t)) string-list)) (string-bytes (reduce #'+ octet-vector-list :key (lambda (s) (round-bytes-to-words (length s))))) (total-bytes (+ string-bytes vec-bytes)) ;; Now allocate the memory and fill it in. (vec-sap (sb-sys:allocate-system-memory total-bytes)) (string-sap (sap+ vec-sap vec-bytes))) (declare (fixnum string-bytes vec-bytes) (type (and unsigned-byte fixnum) total-bytes i) (type sb-sys:system-area-pointer vec-sap string-sap)) (dolist (s octet-vector-list) (declare (type (simple-array (unsigned-byte 8) 1) s)) (let ((n (length s))) ;; Blast the string into place. (sb-kernel:copy-byte-vector-to-system-area s string-sap 0) ;; Blast the pointer to the string into place. (setf (sap-ref-sap vec-sap i) string-sap) (setf string-sap (sap+ string-sap (round-bytes-to-words (1+ n)))) (incf i #1#))) ;; Blast in the last null pointer. (setf (sap-ref-sap vec-sap i) (int-sap 0)) (values vec-sap (sap+ vec-sap #1#) total-bytes))) - Harald |
From: Nikodemus S. <nik...@ra...> - 2007-08-02 20:28:31
|
On 8/2/07, Harald Hanche-Olsen <ha...@ma...> wrote: I think this looks sane, I think. Just a few things for completeness, sake: * Document that the default external format is used for environment. * Test-case. * (Not absolutely required, but would be good to have): use same encoding rules for command-line arguments. Cheers, -- Nikodemus |
From: Harald Hanche-O. <ha...@ma...> - 2007-08-03 13:24:32
Attachments:
run-program-env-and-args.diff
|
+ "Nikodemus Siivola" <nik...@ra...>: | I think this looks sane, I think. Oh, goodie. | Just a few things for completeness, sake: | | * Document that the default external format is used for environment. Sure, can do that. | * (Not absolutely required, but would be good to have): use same | encoding rules for command-line arguments. Already happens, since command-line args and environment are both encoded using the function I changed. | * Test-case. Done. Patch attached. - Harald |
From: Marijn S. (hkBst) <hk...@ge...> - 2007-08-27 09:23:17
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Harald Hanche-Olsen wrote: > + "Nikodemus Siivola" <nik...@ra...>: > > | I think this looks sane, I think. > > Oh, goodie. > > | Just a few things for completeness, sake: > | > | * Document that the default external format is used for environment. > > Sure, can do that. > > | * (Not absolutely required, but would be good to have): use same > | encoding rules for command-line arguments. > > Already happens, since command-line args and environment are both > encoded using the function I changed. > > | * Test-case. > > Done. Patch attached. I am still seeing the same failure for 1.0.9. Wasn't this patch applied? - -- Marijn Schouten (hkBst), Gentoo Lisp project <http://www.gentoo.org/proj/en/lisp/>, #gentoo-lisp -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG0pVQp/VmCx0OL2wRAveKAKCdBzl4FAiV+liblnc9qvyo2BnnAwCfQx/Z aLmmvyppqFWq/MWRtxCF2Dc= =heLK -----END PGP SIGNATURE----- |
From: Harald Hanche-O. <ha...@ma...> - 2007-08-27 09:39:58
|
+ "Marijn Schouten (hkBst)" <hk...@ge...>: | I am still seeing the same failure for 1.0.9. | Wasn't this patch applied? No. I've been too busy to whine about it, or to ask why not. Perhaps because someone raised the question of using the utf-8b encoding, but I think that is a somewhat orthogonal issue. - Harald |
From: Nikodemus S. <nik...@ra...> - 2007-12-09 18:12:05
|
On Aug 27, 2007 9:39 AM, Harald Hanche-Olsen <ha...@ma...> wrote: > + "Marijn Schouten (hkBst)" <hk...@ge...>: > > | I am still seeing the same failure for 1.0.9. > | Wasn't this patch applied? > > No. I've been too busy to whine about it, or to ask why not. Perhaps > because someone raised the question of using the utf-8b encoding, but > I think that is a somewhat orthogonal issue. Finally merged as 1.0.12.21, thank you! -- it's better then what we had before. I suspect we need to rethink this somehow, but let's see how this does in the wild first. Cheers, -- Nikodemus |
From: Harald Hanche-O. <ha...@ma...> - 2007-08-03 11:06:52
|
While working out a test case for the environment stuff, I came across this behavior: The file /tmp/aa contains the single letter "å" in UTF-8 encoding. Good: (let* ((sb-impl::*default-external-format* :utf-8) (process (run-program "/bin/cat" '("/tmp/aa") :output :stream :wait nil))) (prog1 (read-line (process-output process)) (process-wait process) (process-close process))) ==> "å" Not so good: (let ((sb-impl::*default-external-format* :utf-8)) (with-output-to-string (s) (run-program "/bin/cat" '("/tmp/aa") :environment '("foo=ā") :output s :wait t))) ==> "Ã¥" Not at all sure what is happening under the hood here. Except, of course, for the trivial observation that "Ã¥" is the utf-8 encoding of "å", interpreted as latin-1. - Harald |
From: Nikodemus S. <nik...@ra...> - 2007-12-09 18:28:21
|
T24gQXVnIDMsIDIwMDcgMTE6MDYgQU0sIEhhcmFsZCBIYW5jaGUtT2xzZW4gPGhhbmNoZUBtYXRo Lm50bnUubm8+IHdyb3RlOgoKPiBXaGlsZSB3b3JraW5nIG91dCBhIHRlc3QgY2FzZSBmb3IgdGhl IGVudmlyb25tZW50IHN0dWZmLAo+IEkgY2FtZSBhY3Jvc3MgdGhpcyBiZWhhdmlvcjoKClZlcnkg c2hvcnQgZXhwbGFuYXRpb246IG91dHB1dCB0byBleGlzdGluZyBzdHJlYW1zIGRvZW5zJ3QgZGVh bCB3aXRoCmV4dGVybmFsIGZvcm1hdHMuCgpDaGVlcnMsCgogLS0gTmlrb2RlbXVzCgo+Cj4gVGhl IGZpbGUgL3RtcC9hYSBjb250YWlucyB0aGUgc2luZ2xlIGxldHRlciAiw6UiIGluIFVURi04IGVu Y29kaW5nLgo+Cj4gR29vZDoKPgo+IChsZXQqICgoc2ItaW1wbDo6KmRlZmF1bHQtZXh0ZXJuYWwt Zm9ybWF0KiA6dXRmLTgpCj4gICAgICAgIChwcm9jZXNzIChydW4tcHJvZ3JhbSAiL2Jpbi9jYXQi ICcoIi90bXAvYWEiKQo+ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgOm91dHB1dCA6c3Ry ZWFtCj4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICA6d2FpdCBuaWwpKSkKPiAgIChwcm9n MSAocmVhZC1saW5lIChwcm9jZXNzLW91dHB1dCBwcm9jZXNzKSkKPiAgICAgKHByb2Nlc3Mtd2Fp dCBwcm9jZXNzKQo+ICAgICAocHJvY2Vzcy1jbG9zZSBwcm9jZXNzKSkpCj4gPT0+ICLDpSIKPgo+ IE5vdCBzbyBnb29kOgo+Cj4gKGxldCAoKHNiLWltcGw6OipkZWZhdWx0LWV4dGVybmFsLWZvcm1h dCogOnV0Zi04KSkKPiAgICh3aXRoLW91dHB1dC10by1zdHJpbmcgKHMpCj4gICAgIChydW4tcHJv Z3JhbSAiL2Jpbi9jYXQiICcoIi90bXAvYWEiKQo+ICAgICAgICAgICAgICAgICAgOmVudmlyb25t ZW50ICcoImZvbz3EgSIpCj4gICAgICAgICAgICAgICAgICA6b3V0cHV0IHMKPiAgICAgICAgICAg ICAgICAgIDp3YWl0IHQpKSkKPiA9PT4gIsODwqUiCj4KPiBOb3QgYXQgYWxsIHN1cmUgd2hhdCBp cyBoYXBwZW5pbmcgdW5kZXIgdGhlIGhvb2QgaGVyZS4KPiBFeGNlcHQsIG9mIGNvdXJzZSwgZm9y IHRoZSB0cml2aWFsIG9ic2VydmF0aW9uIHRoYXQgIsODwqUiIGlzIHRoZSB1dGYtOAo+IGVuY29k aW5nIG9mICLDpSIsIGludGVycHJldGVkIGFzIGxhdGluLTEuCj4KPiAtIEhhcmFsZAo+Cj4gLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLQo+IFRoaXMgU0YubmV0IGVtYWlsIGlzIHNwb25zb3JlZCBieTogU3BsdW5r IEluYy4KPiBTdGlsbCBncmVwcGluZyB0aHJvdWdoIGxvZyBmaWxlcyB0byBmaW5kIHByb2JsZW1z PyAgU3RvcC4KPiBOb3cgU2VhcmNoIGxvZyBldmVudHMgYW5kIGNvbmZpZ3VyYXRpb24gZmlsZXMg dXNpbmcgQUpBWCBhbmQgYSBicm93c2VyLgo+IERvd25sb2FkIHlvdXIgRlJFRSBjb3B5IG9mIFNw bHVuayBub3cgPj4gIGh0dHA6Ly9nZXQuc3BsdW5rLmNvbS8KPiBfX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fXwo+IFNiY2wtZGV2ZWwgbWFpbGluZyBsaXN0Cj4g U2JjbC1kZXZlbEBsaXN0cy5zb3VyY2Vmb3JnZS5uZXQKPiBodHRwczovL2xpc3RzLnNvdXJjZWZv cmdlLm5ldC9saXN0cy9saXN0aW5mby9zYmNsLWRldmVsCj4K |
From: Richard M K. <kr...@pr...> - 2007-12-10 22:12:04
Attachments:
sbcl-run-program-external-format.patch
|
"Nikodemus Siivola" writes: > On Aug 3, 2007 11:06 AM, Harald Hanche-Olsen <ha...@ma...> wrote: >=20 > > While working out a test case for the environment stuff, > > I came across this behavior: >=20 > Very short explanation: output to existing streams doens't deal with > external formats. A while back I wrote some code to do this, but Juho Snellman was uncomfortable with adding more complexity to RUN-PROGRAM's interface. People seem to keep asking for external-format support in RUN-PROGRAM, however, so I've updated the patch to do transcoding, but only according to the default external format, and without adding any new options to the interface. So in the presumably-common case where SBCL's default external format agrees with the character encoding an external program uses, we'll successfully encode data going to the process and decode data coming from the process. For example: (with-output-to-string (o) (with-input-from-string (i "=C3=80=C3=81=C3=82=C3=83=C3=84=C3=85") (run-program "gawk" '("{print tolower($0)}") :search t :input i :output o))) "=C3=A0=C3=A1=C3=A2=C3=A3=C3=A4=C3=A5 " Most of the pathological cases I can think of involving an external program that doesn't obey system locale settings would seem to be doable with something like iconv(1) in a pipeline, so it might be that having RUN-PROGRAM only do transcoding according to the default external format will be adequate (for Unicode-enabled builds, anyway). Is there any opposition offering this kind of support for external-formats in RUN-PROGRAM? If not, I'll install these changes. (If you look at the attached patch, you'll see I rewrote the temporary file generating code and de-forked the #+win32 and #-win32 versions of the RUN-PROGRAM function. Output to string-streams continues not to work under Windows, but that's not new with this change.) -- RmK |
From: Nikodemus S. <nik...@ra...> - 2007-12-11 11:55:19
|
On Dec 10, 2007 10:10 PM, Richard M Kreuter <kr...@pr...> wrote: > Is there any opposition offering this kind of support for > external-formats in RUN-PROGRAM? If not, I'll install these changes. Not from me, but I've always been fairly feature-happy, so... > (If you look at the attached patch, you'll see I rewrote the temporary > file generating code and de-forked the #+win32 and #-win32 versions of > the RUN-PROGRAM function. Output to string-streams continues not to > work under Windows, but that's not new with this change.) Hooray for refactoring! Apropos, one RUN-PROGRAM related thing that I tried at one point to address was dealing sanely with job control (ie. childs that suspend themselves, and making SBCL a good shell citizen so that it knows about job control signals re. going to the background, etc). It was educational, but horrible. Intrepid would-be sbcl-hackers are encourages to look there... (Both the horror and the educational value come from the unix way of doing things, not so much the SBCL bits, and the SBCL bits one has to deal with are reasonably simple.) Cheers, -- Nikodemus |
From: Juho S. <js...@ik...> - 2007-12-11 12:14:37
|
Richard M Kreuter <kr...@pr...> writes: > Is there any opposition offering this kind of support for > external-formats in RUN-PROGRAM? If not, I'll install these changes. Ok by me. -- Juho Snellman |
From: James Y K. <fo...@fu...> - 2007-08-02 20:55:25
|
On Aug 2, 2007, at 4:07 AM, Harald Hanche-Olsen wrote: > [1] Experiment: Put a non-utf-8 string of octets "foo=x\270x" in the > environment and fire up sbcl: Then running (posix-environ) or > (posix-getenv "foo") lands me in the debugger. Yeah -- it's pretty clear the environment isn't _actually_ in the default encoding. It's just binary junk which often but not always contains some text encoded in some arbitrary superset of ASCII. Just like command line arguments (and filenames on linux). The hard part is that users expect command line arguments, filenames, and environment values to be strings (because they normally do contain text-like things), when strictly they cannot be because there is no reliable encoding. James |
From: Harald Hanche-O. <ha...@ma...> - 2007-08-03 10:01:57
|
+ James Y Knight <fo...@fu...>: | The hard part is that users expect command line arguments, | filenames, and environment values to be strings (because they | normally do contain text-like things), when strictly they cannot be | because there is no reliable encoding. Indeed. So one could argue that the program arguments and environment should really be kept as byte strings throughout, and one should leave the task of converting to and from strings using whatever encoding they wish to the user. I think users who mostly wish to deal with strings will not be happy with that. As an alternative, one could let RUN-PROGRAM accept a mixture of (character) strings and byte strings in the args and environment, use the latter as delivered, and convert the former. That way we can have our cake and eat it too. But then, for reasons of symmetry, we should also provide low level access to the environment, so users can grab any binary junk in there and do useful things with it. While all this might be useful, I am not sure I feel inclined to do it, though. After all, the environment and program args are traditionally text, not binary junk, and I believe that posix doesn't even admit non-ASCII data in the environment. Not that we would want to enforce that; users may wish to have their full name in an environment variable for example. - Harald |
From: James Y K. <fo...@fu...> - 2007-08-08 00:12:30
|
On Aug 2, 2007, at 4:55 PM, James Y Knight wrote: > On Aug 2, 2007, at 4:07 AM, Harald Hanche-Olsen wrote: >> [1] Experiment: Put a non-utf-8 string of octets "foo=x\270x" in the >> environment and fire up sbcl: Then running (posix-environ) or >> (posix-getenv "foo") lands me in the debugger. > > Yeah -- it's pretty clear the environment isn't _actually_ in the > default encoding. It's just binary junk which often but not always > contains some text encoded in some arbitrary superset of ASCII. Just > like command line arguments (and filenames on linux). > > The hard part is that users expect command line arguments, filenames, > and environment values to be strings (because they normally do > contain text-like things), when strictly they cannot be because there > is no reliable encoding. A good alternative to this is for SBCL to use the UTF8b encoding to decode unix environment gunk (filenames, env vars, command line args) which are *probably* in utf8, but might not be. utf8b has the nice property that any arbitrary bytestring can be decoded into unicode, and then round-tripped back to the same bytes. Valid utf8 sequences turns into the same unicode characters as with the utf8 codec. Invalid utf8 sequences turn into invalid surrogate pair sequences in the unicode string. Thus, SBCL can return strings, and never throw an error. If you actually wanted the random binary, you can losslessly convert the unicode string back to binary. Win win. Some references: Original mail: http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html Blog entry: http://bsittler.livejournal.com/10381.html Python implementation: http://hyperreal.org/~est/libutf8b/ James |