From: dman <ds...@ri...> - 2001-11-25 01:57:55
|
On one of my Debian woody boxes jythonc stopped working a while ago. Jython still worked, but running 'jythonc' would give no output. I have now solved the problem, but I think it involves a bug in jython. I traced through how jythonc was supposed to be run -- it is pretty straightforward : jython is run with /usr/share/jython/Tools/jythonc/jythonc.py as the first argument (and any other arguments are passed to the script). I added a print to the top of jythonc.py, but it wouldn't get printed. It was really strange because I could create a "hello world" program and it would work. As I took a deeper look, looking at main.py I noticed that there were several Form Feed characters in it. I removed those (from the other source files as well) but those had no bearing on my problem. (I don't think there is a reason to have form feeds anyways, unless perhaps one intends to "cat <source> > /dev/lp0" with an old printer) The solution, as it turned out, was to open each of the source files, convert them to utf-8 and save them again. What difference does it make to jython whether a (python) source file is saved in latin1 or utf-8? In any case, I think it is a gross error to simply terminate with no message when encountering a file that it doesn't like. I started the conversion to utf-8 from main.py, and tried running jythonc after each file was changed. It would give me "ImportError" or "AttributeError" when an import of a non-converted file was encountered. Once I had converted all files jythonc worked properly. The interesting thing about jythonc's source files is that they all have the copyright symbol in a comment at the top of the file. In 'latin1' this is character 0xa9. I use (g)vim 6.0 as my editor. As you may already know it has two variables, 'enc' and 'fenc'. 'enc' is the global encoding specifier. I can set it to "latin1" or "utf-8" (and probably others, but I haven't tried them). 'fenc' is a setting that is local to the current buffer and specifies what encoding the file should be written as. I can set that to "latin1" or "utf-8" also. I created 4 files containing only the copyright symbol, each file with a different combination of 'enc' and 'fenc' settings. Interestingly enough, both files with 'fenc' set to "latin1" contained only 0xa9 0xa0 (when viewed with a hex editor). The file with enc=latin1, fenc=utf-8 contained 0xc2 0xa9 0xa0. The file with enc=utf-8, fenc=utf-8 contained 0x00 0x70 0xa0. I think this copyright character and its encoding may be the source of the whole problem. I'll check with the vim folks too regarding the differences in the two utf-8 files. Hmm, when I open them again, the utf8-utf8 file is messed up (shows ^@p) but the latin1-utf8 file is correct. I used latin1-utf8 as the settings when I converted the jythonc sources. -D |
From: <bc...@wo...> - 2001-11-26 15:55:11
|
[dman] >On one of my Debian woody boxes jythonc stopped working a while ago. >Jython still worked, but running 'jythonc' would give no output. I >have now solved the problem, but I think it involves a bug in jython. > >I traced through how jythonc was supposed to be run -- it is pretty >straightforward : jython is run with >/usr/share/jython/Tools/jythonc/jythonc.py as the first argument (and >any other arguments are passed to the script). I added a print to the >top of jythonc.py, but it wouldn't get printed. It was really strange >because I could create a "hello world" program and it would work. As >I took a deeper look, looking at main.py I noticed that there were >several Form Feed characters in it. I removed those (from the other >source files as well) but those had no bearing on my problem. (I >don't think there is a reason to have form feeds anyways, unless >perhaps one intends to "cat <source> > /dev/lp0" with an old printer) >The solution, as it turned out, was to open each of the source files, >convert them to utf-8 and save them again. > >What difference does it make to jython whether a (python) source file >is saved in latin1 or utf-8? In any case, I think it is a gross error >to simply terminate with no message when encountering a file that it >doesn't like. Sure. Normally jython doesn't. So what is special about woody? >I started the conversion to utf-8 from main.py, I have now removed the latin-1 copyright character in the CVS version. >... >The interesting thing about jythonc's source files is that they all >have the copyright symbol in a comment at the top of the file. In >'latin1' this is character 0xa9. The python source files is read as text files with a InputStreamReader using the default encoding for the platform. Normally that is a good way to read text files but a sideeffect is that python source programs with non-ascii characters isn't portable to other platforms with a different encoding. I don't know what the cause is, but these experiments might help shed light on it. What file encoding is used in your setup of woody? >>> import java >>> java.lang.System.getProperty("file.encoding") 'Cp1252' >>> Whatever the encoding used is, it may be unable to handle 0xA9 correctly: >>> from java import io >>> s = io.FileOutputStream("foo") >>> s.write("\xA9") >>> s.close() >>> s = io.FileReader("foo") >>> print hex(s.read()) 0xa9 >>> s.close() >>> >I use (g)vim 6.0 as my editor. As >you may already know it has two variables, 'enc' and 'fenc'. You could change the file encoding of the source files. You would then have to change the encoding used by java as well. But I strongly doubt that you want to go there. If latin1 is suitable for your country and language, stick with that. regards, finn |
From: dman <ds...@ri...> - 2001-11-26 17:11:56
Attachments:
hello_latin1.py
|
On Mon, Nov 26, 2001 at 03:58:39PM +0000, Finn Bock wrote: | [dman] ( Short version : jython gives no result when running scripts encoded in latin1 with non-ASCII chars in them. ) | >What difference does it make to jython whether a (python) source file | >is saved in latin1 or utf-8? In any case, I think it is a gross error | >to simply terminate with no message when encountering a file that it | >doesn't like. | | Sure. Normally jython doesn't. So what is special about woody? See below. I have now figured out the source of this problem. | >I started the conversion to utf-8 from main.py, | | I have now removed the latin-1 copyright character in the CVS version. Cool. That will certainly fix all portability problems since ASCII is a common subset of all encodings AFAIK (latin1 and utf-8 for sure). | >... | >The interesting thing about jythonc's source files is that they all | >have the copyright symbol in a comment at the top of the file. In | >'latin1' this is character 0xa9. | | The python source files is read as text files with a InputStreamReader | using the default encoding for the platform. Normally that is a good way | to read text files but a sideeffect is that python source programs with | non-ascii characters isn't portable to other platforms with a different | encoding. | | I don't know what the cause is, but these experiments might help shed | light on it. | | What file encoding is used in your setup of woody? | | >>> import java | >>> java.lang.System.getProperty("file.encoding") | 'Cp1252' | >>> The woody machine I have at work had no problems running jythonc, just my machine at home. I remembered late last night that I had set $LANG to en_US.UTF-8 at home. Now that I am at work, I checked with that machine and it has $LANG set to the default of "C". If I tried "LANG=en_US.UTF-8 jythonc --help" it failed the same as it was doing at home. With LANG=C, the enconding used by java is "ISO-8859-1", with LANG=en_US.UTF-8 the enconding is "UTF-8". | Whatever the encoding used is, it may be unable to handle 0xA9 | correctly: Perhaps, and perhaps java is broken? I created "hello world" with the copyright symbol in a comment. I did this with both latin1 and utf-8. $ LANG=en_US python2.2 hello_latin1.py hello world $ LANG=en_US python2.2 hello_utf-8.py hello world $ LANG=en_US.UTF-8 python2.2 hello_latin1.py hello world $ LANG=en_US.UTF-8 python2.2 hello_utf-8.py hello world $ LANG=en_US jython hello_latin1.py hello world $ LANG=en_US jython hello_utf-8.py hello world $ LANG=en_US.UTF-8 jython hello_latin1.py $ LANG=en_US.UTF-8 jython hello_utf-8.py hello world $ As you can see, CPython (2.2b1) has no problems with the script regardless of environment and file encoding, however Java can't handle a latin1 file with the environment set to UTF-8. I should do some experiments at the Java level and see what it does in that situation. Maybe it causes a problem in Jython's parsing (ie the comments ends up extending to the end of the file) or maybe there is some error that is silenty ignored. | >>> from java import io | >>> s = io.FileOutputStream("foo") | >>> s.write("\xA9") | >>> s.close() | >>> s = io.FileReader("foo") | >>> print hex(s.read()) | 0xa9 | >>> s.close() | >>> I just did a quick test using jython (interactive coding is very cool!) : $ LANG=en_US.UTF-8 jython Jython 2.1a1 on java1.3.1 (JIT: null) >>> from java.io import * >>> f = InputStreamReader( FileInputStream( "hello_latin1.py" ) ) >>> while 1 : print f.read() ... 10 35 Traceback (innermost last): File "<console>", line 1, in ? sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152) at java.io.InputStreamReader.convertInto(InputStreamReader.java:137) at java.io.InputStreamReader.fill(InputStreamReader.java:166) at java.io.InputStreamReader.read(InputStreamReader.java:249) at java.io.InputStreamReader.read(InputStreamReader.java:222) at java.lang.reflect.Method.invoke(Native Method) at org.python.core.PyReflectedFunction.__call__(PyReflectedFunction.java:160) at org.python.core.PyMethod.__call__(PyMethod.java:96) at org.python.core.PyObject.__call__(PyObject.java:262) at org.python.core.PyInstance.invoke(PyInstance.java:244) at org.python.pycode._pyx3.f$0(<console>:1) at org.python.pycode._pyx3.call_function(<console>) at org.python.core.PyTableCode.call(PyTableCode.java:198) at org.python.core.PyCode.call(PyCode.java:13) at org.python.core.Py.runCode(Py.java:1075) at org.python.core.Py.exec(Py.java:1096) at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:145) at org.python.util.InteractiveInterpreter.runcode(InteractiveInterpreter.java:87) at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:68) at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:42) at org.python.util.InteractiveConsole.push(InteractiveConsole.java:83) at org.python.util.InteractiveConsole.interact(InteractiveConsole.java:62) at org.python.util.jython.main(jython.java:183) sun.io.MalformedInputException: sun.io.MalformedInputException >>> I'll attach the file so you can see it for yourself. It looks like jython catches this exception, but silently ignores it. Perhaps it would be a good idea to try and fall back to latin1, then display an error message if that fails too. | >I use (g)vim 6.0 as my editor. As | >you may already know it has two variables, 'enc' and 'fenc'. | | You could change the file encoding of the source files. You would then I did. | have to change the encoding used by java as well. But I strongly doubt It was already changed -- changing the encoding of the files caused them to match the encoding java was using. | that you want to go there. If latin1 is suitable for your country and | language, stick with that. I suppose maybe I should. At least I know what to look for now if it happens again :-). -D -- Even youths grow tired and weary, and young men stumble and fall; but those who hope in the Lord will renew their strength. They will soar on wings like eagles; they will run and not grow weary, they will walk and not be faint. Isaiah 40:31 |
From: <bc...@wo...> - 2001-11-26 19:34:06
|
[dman] >( > Short version : > jython gives no result when running scripts encoded in latin1 with > non-ASCII chars in them. >) >| Whatever the encoding used is, it may be unable to handle 0xA9 >| correctly: > >Perhaps, and perhaps java is broken? Don't think so. The first byte of a multibyte sequence must be in the range 0xC0 to 0xFD. So a file with a latin copyright character is not a valid UTF-8 text file. As an additional information point, my JDK1.2 and JDK1.3 also throws exceptions, but JDK1.4 silently transform the character into the unicode-undefined character. >As you can see, CPython (2.2b1) has no problems with the script >regardless of environment and file encoding, That simplicity will not last. Eventually even CPython will have ways to deal with the encoding of python source files. >$ LANG=en_US.UTF-8 jython >Jython 2.1a1 on java1.3.1 (JIT: null) >>>> from java.io import * >>>> f = InputStreamReader( FileInputStream( "hello_latin1.py" ) ) >>>> while 1 : print f.read() >... >10 >35 >Traceback (innermost last): > File "<console>", line 1, in ? >sun.io.MalformedInputException > at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152) > >I'll attach the file so you can see it for yourself. It looks like >jython catches this exception, but silently ignores it. Yes. The generated tokenmanager catches all IOExceptions (MalformedInputException is a subclass of IOException) and interprets that as eof. >Perhaps it would be a good idea to try and fall back to latin1, Nah, no guessing IMO. >then display an error message if that fails too. That doesn't seem to be as easy as it rightly should have been. regards, finn |
From: dman <ds...@ri...> - 2001-11-26 19:57:37
|
On Mon, Nov 26, 2001 at 07:37:33PM +0000, Finn Bock wrote: | [dman] |=20 | >( | > Short version : | > jython gives no result when running scripts encoded in latin1 with | > non-ASCII chars in them. | >) |=20 | >| Whatever the encoding used is, it may be unable to handle 0xA9 | >| correctly: | > | >Perhaps, and perhaps java is broken? |=20 | Don't think so. The first byte of a multibyte sequence must be in the | range 0xC0 to 0xFD. So a file with a latin copyright character is not a | valid UTF-8 text file. At least someone here has read the spec :-). | As an additional information point, my JDK1.2 and JDK1.3 also throws | exceptions, but JDK1.4 silently transform the character into the | unicode-undefined character. I'm not sure that is a good thing (jdk1.4), but maybe you don't have to deal with it. Consider someone who has some source in latin1 (or something else) and has a=F6c =3D "foo" a=FCc =3D "bar" If java uses UTF-8 as the encoding, then those two names will end up being the same if jython will treat the unicode-undefined character as a regular character. This would be an additional condition that should raise an exception. | >As you can see, CPython (2.2b1) has no problems with the script | >regardless of environment and file encoding,=20 |=20 | That simplicity will not last. Eventually even CPython will have ways t= o | deal with the encoding of python source files. Ok. | >$ LANG=3Den_US.UTF-8 jython=20 | >Jython 2.1a1 on java1.3.1 (JIT: null) | >>>> from java.io import * | >>>> f =3D InputStreamReader( FileInputStream( "hello_latin1.py" ) ) | >>>> while 1 : print f.read() | >...=20 | >10 | >35 | >Traceback (innermost last): | > File "<console>", line 1, in ? | >sun.io.MalformedInputException | > at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152) | > | >I'll attach the file so you can see it for yourself. It looks like | >jython catches this exception, but silently ignores it. |=20 | Yes. The generated tokenmanager catches all IOExceptions | (MalformedInputException is a subclass of IOException) and interprets | that as eof. EOF would certainly explain why I didn't get any output or error message. Jython successfully executed nothing :-). | >Perhaps it would be a good idea to try and fall back to latin1,=20 |=20 | Nah, no guessing IMO. Ok. | >then display an error message if that fails too. |=20 | That doesn't seem to be as easy as it rightly should have been. Couldn't you just catch that exception and print out a message then exit right before catching IOException? It might be better to convert the exception into a different (python) exception. Yeah, for execfile() the interpreter shouldn't exit because the file is encoded wrong. -D --=20 "...the word HACK is used as a verb to indicate a massive amount of nerd-like effort." -Harley Hahn, A Student's Guide to Unix |
From: <bc...@wo...> - 2001-11-26 20:21:45
|
[dman] >| As an additional information point, my JDK1.2 and JDK1.3 also throws >| exceptions, but JDK1.4 silently transform the character into the >| unicode-undefined character. > >I'm not sure that is a good thing (jdk1.4), but maybe you don't have >to deal with it. Consider someone who has some source in latin1 (or >something else) and has > >a=F6c =3D "foo" >a=FCc =3D "bar" [I find it a little ironic that my mail agent can't deal any of the newer mail encodings] >If java uses UTF-8 as the encoding, then those two names Non-ascii chars in identifiers? I know CPython sometimes allow that, but that is not a feature I plan on adding. >will end up >being the same if jython will treat the unicode-undefined character as >a regular character. This would be an additional condition that >should raise an exception. If you put the non-ascii chars inside the quotes then I agree with your example and with your conclusion. >| Yes. The generated tokenmanager catches all IOExceptions >| (MalformedInputException is a subclass of IOException) and interprets >| that as eof. > [...] > >Couldn't you just catch that exception and print out a message then >exit right before catching IOException? There are 43 instances of caught IOException in PythonGrammerTokenManager such as: try { curChar = input_stream.readChar(); } catch(java.io.IOException e) { jjStopStringLiteralDfa_10(0, 0L, active1); return 1; } We probably have to catch the MalformedInputException in the ReaderCharStream and throw something that will get passed most of the catch clauses in the parser. regards, finn |
From: dman <ds...@ri...> - 2001-11-28 17:24:49
|
On Mon, Nov 26, 2001 at 08:25:12PM +0000, Finn Bock wrote: | [dman] | | >| As an additional information point, my JDK1.2 and JDK1.3 also throws | >| exceptions, but JDK1.4 silently transform the character into the | >| unicode-undefined character. | > | >I'm not sure that is a good thing (jdk1.4), but maybe you don't have | >to deal with it. Consider someone who has some source in latin1 (or | >something else) and has | > | >a=F6c =3D "foo" | >a=FCc =3D "bar" | | [I find it a little ironic that my mail agent can't deal any of the | newer mail encodings] I didn't do anything special with my mailer (mutt), but it shows the message as "ISO-8859-1" encoded. I simply picked to characters near the end of the latin1 encoding. They are vowels with some funny decorations (I think they're called umlauts, but I'm really not sure). I use vim 6 (with the less.vim macro) as my pager, and it showed it correctly. Interestingly enough, that copy of vim was built without multibyte support, so the 'enc' and 'fenc' settings weren't available. | >If java uses UTF-8 as the encoding, then those two names | | Non-ascii chars in identifiers? I know CPython sometimes allow that, but | that is not a feature I plan on adding. I thought that would be nice to have for non-english developers, but someone has already said otherwise. | >will end up | >being the same if jython will treat the unicode-undefined character as | >a regular character. This would be an additional condition that | >should raise an exception. | | If you put the non-ascii chars inside the quotes then I agree with your | example and with your conclusion. Yeah, that would do it too. | >| Yes. The generated tokenmanager catches all IOExceptions | >| (MalformedInputException is a subclass of IOException) and interprets | >| that as eof. | > [...] | > | >Couldn't you just catch that exception and print out a message then | >exit right before catching IOException? | | There are 43 instances of caught IOException in | PythonGrammerTokenManager such as: | | try { curChar = input_stream.readChar(); } | catch(java.io.IOException e) { | jjStopStringLiteralDfa_10(0, 0L, active1); | return 1; | } | | We probably have to catch the MalformedInputException in the | ReaderCharStream and throw something that will get passed most of the | catch clauses in the parser. What if the exception gets turned into IOError (the python exception)? I just noticed that you said "generated" parser. That may make it easier or harder to add the proper catches. I should probably file a bug report, right? -D -- (E)ighteen (M)egs (A)nd (C)onstantly (S)wapping |