Thread: [Jython-users] jythonc not working -- solved, but strange

Brought to you by: bckfnn, bwarsaw, bzimmer, cgroves, and 4 others

jython-users

[Jython-users] jythonc not working -- solved, but strange

From: dman <ds...@ri...> - 2001-11-25 01:57:55

On one of my Debian woody boxes jythonc stopped working a while ago.
Jython still worked, but running 'jythonc' would give no output.  I
have now solved the problem, but I think it involves a bug in jython.

I traced through how jythonc was supposed to be run -- it is pretty
straightforward : jython is run with
/usr/share/jython/Tools/jythonc/jythonc.py as the first argument (and
any other arguments are passed to the script).  I added a print to the
top of jythonc.py, but it wouldn't get printed.  It was really strange
because I could create a "hello world" program and it would work.  As
I took a deeper look, looking at main.py I noticed that there were
several Form Feed characters in it.  I removed those (from the other
source files as well) but those had no bearing on my problem.  (I
don't think there is a reason to have form feeds anyways, unless
perhaps one intends to "cat <source> > /dev/lp0" with an old printer)
The solution, as it turned out, was to open each of the source files,
convert them to utf-8 and save them again.

What difference does it make to jython whether a (python) source file
is saved in latin1 or utf-8?  In any case, I think it is a gross error
to simply terminate with no message when encountering a file that it
doesn't like.  I started the conversion to utf-8 from main.py, and
tried running jythonc after each file was changed.  It would give me
"ImportError" or "AttributeError" when an import of a non-converted
file was encountered.  Once I had converted all files jythonc worked
properly.

The interesting thing about jythonc's source files is that they all
have the copyright symbol in a comment at the top of the file.  In
'latin1' this is character 0xa9.  I use (g)vim 6.0 as my editor.  As
you may already know it has two variables, 'enc' and 'fenc'.  'enc' is
the global encoding specifier.  I can set it to "latin1" or "utf-8"
(and probably others, but I haven't tried them).  'fenc' is a setting
that is local to the current buffer and specifies what encoding the
file should be written as.  I can set that to "latin1" or "utf-8"
also.  I created 4 files containing only the copyright symbol, each
file with a different combination of 'enc' and 'fenc' settings.
Interestingly enough, both files with 'fenc' set to "latin1" contained
only  0xa9 0xa0 (when viewed with a hex editor).  The file with
enc=latin1, fenc=utf-8 contained 0xc2 0xa9 0xa0.  The file with
enc=utf-8, fenc=utf-8 contained 0x00 0x70 0xa0.  I think this
copyright character and its encoding may be the source of the whole
problem.  I'll check with the vim folks too regarding the differences
in the two utf-8 files.  Hmm, when I open them again, the utf8-utf8
file is messed up (shows ^@p) but the latin1-utf8 file is correct.  I
used latin1-utf8 as the settings when I converted the jythonc sources.

-D

Re: [Jython-users] jythonc not working -- solved, but strange

From: <bc...@wo...> - 2001-11-26 15:55:11

[dman]

>On one of my Debian woody boxes jythonc stopped working a while ago.
>Jython still worked, but running 'jythonc' would give no output.  I
>have now solved the problem, but I think it involves a bug in jython.
>
>I traced through how jythonc was supposed to be run -- it is pretty
>straightforward : jython is run with
>/usr/share/jython/Tools/jythonc/jythonc.py as the first argument (and
>any other arguments are passed to the script).  I added a print to the
>top of jythonc.py, but it wouldn't get printed.  It was really strange
>because I could create a "hello world" program and it would work.  As
>I took a deeper look, looking at main.py I noticed that there were
>several Form Feed characters in it.  I removed those (from the other
>source files as well) but those had no bearing on my problem.  (I
>don't think there is a reason to have form feeds anyways, unless
>perhaps one intends to "cat <source> > /dev/lp0" with an old printer)
>The solution, as it turned out, was to open each of the source files,
>convert them to utf-8 and save them again.
>
>What difference does it make to jython whether a (python) source file
>is saved in latin1 or utf-8?  In any case, I think it is a gross error
>to simply terminate with no message when encountering a file that it
>doesn't like.

Sure. Normally jython doesn't. So what is special about woody?

>I started the conversion to utf-8 from main.py,

I have now removed the latin-1 copyright character in the CVS version.

>...
>The interesting thing about jythonc's source files is that they all
>have the copyright symbol in a comment at the top of the file.  In
>'latin1' this is character 0xa9.  

The python source files is read as text files with a InputStreamReader
using the default encoding for the platform. Normally that is a good way
to read text files but a sideeffect is that python source programs with
non-ascii characters isn't portable to other platforms with a different
encoding.

I don't know what the cause is, but these experiments might help shed
light on it.

What file encoding is used in your setup of woody?

>>> import java
>>> java.lang.System.getProperty("file.encoding")
'Cp1252'
>>>


Whatever the encoding used is, it may be unable to handle 0xA9
correctly:

>>> from java import io
>>> s = io.FileOutputStream("foo")
>>> s.write("\xA9")
>>> s.close()
>>> s = io.FileReader("foo")
>>> print hex(s.read())
0xa9
>>> s.close()
>>>


>I use (g)vim 6.0 as my editor.  As
>you may already know it has two variables, 'enc' and 'fenc'.

You could change the file encoding of the source files. You would then
have to change the encoding used by java as well. But I strongly doubt
that you want to go there. If latin1 is suitable for your country and
language, stick with that.

regards,
finn

Re: [Jython-users] jythonc not working -- solved, but strange

From: dman <ds...@ri...> - 2001-11-26 17:11:56

Attachments: hello_latin1.py

On Mon, Nov 26, 2001 at 03:58:39PM +0000, Finn Bock wrote:
| [dman]

(
  Short version :
    jython gives no result when running scripts encoded in latin1 with
    non-ASCII chars in them.
)

| >What difference does it make to jython whether a (python) source file
| >is saved in latin1 or utf-8?  In any case, I think it is a gross error
| >to simply terminate with no message when encountering a file that it
| >doesn't like.
| 
| Sure. Normally jython doesn't. So what is special about woody?

See below.  I have now figured out the source of this problem.

| >I started the conversion to utf-8 from main.py,
| 
| I have now removed the latin-1 copyright character in the CVS version.

Cool.  That will certainly fix all portability problems since ASCII is
a common subset of all encodings AFAIK (latin1 and utf-8 for sure).

| >...
| >The interesting thing about jythonc's source files is that they all
| >have the copyright symbol in a comment at the top of the file.  In
| >'latin1' this is character 0xa9.  
| 
| The python source files is read as text files with a InputStreamReader
| using the default encoding for the platform. Normally that is a good way
| to read text files but a sideeffect is that python source programs with
| non-ascii characters isn't portable to other platforms with a different
| encoding.
| 
| I don't know what the cause is, but these experiments might help shed
| light on it.
| 
| What file encoding is used in your setup of woody?
| 
| >>> import java
| >>> java.lang.System.getProperty("file.encoding")
| 'Cp1252'
| >>>

The woody machine I have at work had no problems running jythonc, just
my machine at home.  I remembered late last night that I had set $LANG
to en_US.UTF-8 at home.  Now that I am at work, I checked with that
machine and it has $LANG set to the default of "C".  If I tried
"LANG=en_US.UTF-8 jythonc --help" it failed the same as it was doing
at home.

With LANG=C, the enconding used by java is "ISO-8859-1", with
LANG=en_US.UTF-8 the enconding is "UTF-8".

| Whatever the encoding used is, it may be unable to handle 0xA9
| correctly:

Perhaps, and perhaps java is broken?

I created "hello world" with the copyright symbol in a comment.  I did
this with both latin1 and utf-8.

$ LANG=en_US python2.2 hello_latin1.py 
hello world

$ LANG=en_US python2.2 hello_utf-8.py 
hello world

$ LANG=en_US.UTF-8 python2.2 hello_latin1.py 
hello world

$ LANG=en_US.UTF-8 python2.2 hello_utf-8.py 
hello world

$ LANG=en_US jython hello_latin1.py 
hello world

$ LANG=en_US jython hello_utf-8.py 
hello world

$ LANG=en_US.UTF-8 jython hello_latin1.py

$ LANG=en_US.UTF-8 jython hello_utf-8.py 
hello world

$

As you can see, CPython (2.2b1) has no problems with the script
regardless of environment and file encoding, however Java can't handle
a latin1 file with the environment set to UTF-8.

I should do some experiments at the Java level and see what it does in
that situation.  Maybe it causes a problem in Jython's parsing (ie the
comments ends up extending to the end of the file) or maybe there is
some error that is silenty ignored.

| >>> from java import io
| >>> s = io.FileOutputStream("foo")
| >>> s.write("\xA9")
| >>> s.close()
| >>> s = io.FileReader("foo")
| >>> print hex(s.read())
| 0xa9
| >>> s.close()
| >>>

I just did a quick test using jython (interactive coding is very
cool!) :

$ LANG=en_US.UTF-8 jython 
Jython 2.1a1 on java1.3.1 (JIT: null)
>>> from java.io import *
>>> f = InputStreamReader( FileInputStream( "hello_latin1.py" ) )
>>> while 1 : print f.read()
... 
10
35
Traceback (innermost last):
  File "<console>", line 1, in ?
sun.io.MalformedInputException
        at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152)
        at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)
        at java.io.InputStreamReader.fill(InputStreamReader.java:166)
        at java.io.InputStreamReader.read(InputStreamReader.java:249)
        at java.io.InputStreamReader.read(InputStreamReader.java:222)
        at java.lang.reflect.Method.invoke(Native Method)
        at org.python.core.PyReflectedFunction.__call__(PyReflectedFunction.java:160)
        at org.python.core.PyMethod.__call__(PyMethod.java:96)
        at org.python.core.PyObject.__call__(PyObject.java:262)
        at org.python.core.PyInstance.invoke(PyInstance.java:244)
        at org.python.pycode._pyx3.f$0(<console>:1)
        at org.python.pycode._pyx3.call_function(<console>)
        at org.python.core.PyTableCode.call(PyTableCode.java:198)
        at org.python.core.PyCode.call(PyCode.java:13)
        at org.python.core.Py.runCode(Py.java:1075)
        at org.python.core.Py.exec(Py.java:1096)
        at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:145)
        at org.python.util.InteractiveInterpreter.runcode(InteractiveInterpreter.java:87)
        at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:68)
        at org.python.util.InteractiveInterpreter.runsource(InteractiveInterpreter.java:42)
        at org.python.util.InteractiveConsole.push(InteractiveConsole.java:83)
        at org.python.util.InteractiveConsole.interact(InteractiveConsole.java:62)
        at org.python.util.jython.main(jython.java:183)

sun.io.MalformedInputException: sun.io.MalformedInputException
>>> 

I'll attach the file so you can see it for yourself.  It looks like
jython catches this exception, but silently ignores it.  Perhaps it
would be a good idea to try and fall back to latin1, then display an
error message if that fails too.

| >I use (g)vim 6.0 as my editor.  As
| >you may already know it has two variables, 'enc' and 'fenc'.
| 
| You could change the file encoding of the source files. You would then

I did.

| have to change the encoding used by java as well. But I strongly doubt

It was already changed -- changing the encoding of the files caused
them to match the encoding java was using.

| that you want to go there. If latin1 is suitable for your country and
| language, stick with that.

I suppose maybe I should.  At least I know what to look for now if it
happens again :-).

-D

-- 

Even youths grow tired and weary,
    and young men stumble and fall;
but those who hope in the Lord 
    will renew their strength.
They will soar on wings like eagles;
    they will run and not grow weary,
    they will walk and not be faint.

        Isaiah 40:31

Re: [Jython-users] jythonc not working -- solved, but strange

From: <bc...@wo...> - 2001-11-26 19:34:06

[dman]

>(
>  Short version :
>    jython gives no result when running scripts encoded in latin1 with
>    non-ASCII chars in them.
>)

>| Whatever the encoding used is, it may be unable to handle 0xA9
>| correctly:
>
>Perhaps, and perhaps java is broken?

Don't think so. The first byte of a multibyte sequence must be in the
range 0xC0 to 0xFD. So a file with a latin copyright character is not a
valid UTF-8 text file.

As an additional information point, my JDK1.2 and JDK1.3 also throws
exceptions, but JDK1.4 silently transform the character into the
unicode-undefined character.

>As you can see, CPython (2.2b1) has no problems with the script
>regardless of environment and file encoding, 

That simplicity will not last. Eventually even CPython will have ways to
deal with the encoding of python source files.

>$ LANG=en_US.UTF-8 jython 
>Jython 2.1a1 on java1.3.1 (JIT: null)
>>>> from java.io import *
>>>> f = InputStreamReader( FileInputStream( "hello_latin1.py" ) )
>>>> while 1 : print f.read()
>... 
>10
>35
>Traceback (innermost last):
>  File "<console>", line 1, in ?
>sun.io.MalformedInputException
>        at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152)
>
>I'll attach the file so you can see it for yourself.  It looks like
>jython catches this exception, but silently ignores it.

Yes. The generated tokenmanager catches all IOExceptions
(MalformedInputException is a subclass of IOException) and interprets
that as eof.

>Perhaps it would be a good idea to try and fall back to latin1, 

Nah, no guessing IMO.

>then display an error message if that fails too.

That doesn't seem to be as easy as it rightly should have been.

regards,
finn

Re: [Jython-users] jythonc not working -- solved, but strange

From: dman <ds...@ri...> - 2001-11-26 19:57:37

On Mon, Nov 26, 2001 at 07:37:33PM +0000, Finn Bock wrote:
| [dman]
|=20
| >(
| >  Short version :
| >    jython gives no result when running scripts encoded in latin1 with
| >    non-ASCII chars in them.
| >)
|=20
| >| Whatever the encoding used is, it may be unable to handle 0xA9
| >| correctly:
| >
| >Perhaps, and perhaps java is broken?
|=20
| Don't think so. The first byte of a multibyte sequence must be in the
| range 0xC0 to 0xFD. So a file with a latin copyright character is not a
| valid UTF-8 text file.

At least someone here has read the spec :-).

| As an additional information point, my JDK1.2 and JDK1.3 also throws
| exceptions, but JDK1.4 silently transform the character into the
| unicode-undefined character.

I'm not sure that is a good thing (jdk1.4), but maybe you don't have
to deal with it.  Consider someone who has some source in latin1 (or
something else) and has

a=F6c =3D "foo"
a=FCc =3D "bar"


If java uses UTF-8 as the encoding, then those two names will end up
being the same if jython will treat the unicode-undefined character as
a regular character.  This would be an additional condition that
should raise an exception.

| >As you can see, CPython (2.2b1) has no problems with the script
| >regardless of environment and file encoding,=20
|=20
| That simplicity will not last. Eventually even CPython will have ways t=
o
| deal with the encoding of python source files.

Ok.

| >$ LANG=3Den_US.UTF-8 jython=20
| >Jython 2.1a1 on java1.3.1 (JIT: null)
| >>>> from java.io import *
| >>>> f =3D InputStreamReader( FileInputStream( "hello_latin1.py" ) )
| >>>> while 1 : print f.read()
| >...=20
| >10
| >35
| >Traceback (innermost last):
| >  File "<console>", line 1, in ?
| >sun.io.MalformedInputException
| >        at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152)
| >
| >I'll attach the file so you can see it for yourself.  It looks like
| >jython catches this exception, but silently ignores it.
|=20
| Yes. The generated tokenmanager catches all IOExceptions
| (MalformedInputException is a subclass of IOException) and interprets
| that as eof.

EOF would certainly explain why I didn't get any output or error
message.  Jython successfully executed nothing :-).

| >Perhaps it would be a good idea to try and fall back to latin1,=20
|=20
| Nah, no guessing IMO.

Ok.

| >then display an error message if that fails too.
|=20
| That doesn't seem to be as easy as it rightly should have been.

Couldn't you just catch that exception and print out a message then
exit right before catching IOException?  It might be better to convert
the exception into a different (python) exception.  Yeah, for
execfile() the interpreter shouldn't exit because the file is encoded
wrong.

-D


--=20

"...the word HACK is used as a verb to indicate a massive amount
of nerd-like effort."  -Harley Hahn, A Student's Guide to Unix

Re: [Jython-users] jythonc not working -- solved, but strange

From: <bc...@wo...> - 2001-11-26 20:21:45

[dman]

>| As an additional information point, my JDK1.2 and JDK1.3 also throws
>| exceptions, but JDK1.4 silently transform the character into the
>| unicode-undefined character.
>
>I'm not sure that is a good thing (jdk1.4), but maybe you don't have
>to deal with it.  Consider someone who has some source in latin1 (or
>something else) and has
>
>a=F6c =3D "foo"
>a=FCc =3D "bar"

[I find it a little ironic that my mail agent can't deal any of the
newer mail encodings]

>If java uses UTF-8 as the encoding, then those two names 

Non-ascii chars in identifiers? I know CPython sometimes allow that, but
that is not a feature I plan on adding.

>will end up
>being the same if jython will treat the unicode-undefined character as
>a regular character. This would be an additional condition that
>should raise an exception.

If you put the non-ascii chars inside the quotes then I agree with your
example and with your conclusion.

>| Yes. The generated tokenmanager catches all IOExceptions
>| (MalformedInputException is a subclass of IOException) and interprets
>| that as eof.
> [...]
>
>Couldn't you just catch that exception and print out a message then
>exit right before catching IOException?  

There are 43 instances of caught IOException in
PythonGrammerTokenManager such as:

   try { curChar = input_stream.readChar(); }
   catch(java.io.IOException e) {
      jjStopStringLiteralDfa_10(0, 0L, active1);
      return 1;
   }

We probably have to catch the MalformedInputException in the
ReaderCharStream and throw something that will get passed most of the
catch clauses in the parser.

regards,
finn

Re: [Jython-users] jythonc not working -- solved, but strange

From: dman <ds...@ri...> - 2001-11-28 17:24:49

On Mon, Nov 26, 2001 at 08:25:12PM +0000, Finn Bock wrote:
| [dman]
| 
| >| As an additional information point, my JDK1.2 and JDK1.3 also throws
| >| exceptions, but JDK1.4 silently transform the character into the
| >| unicode-undefined character.
| >
| >I'm not sure that is a good thing (jdk1.4), but maybe you don't have
| >to deal with it.  Consider someone who has some source in latin1 (or
| >something else) and has
| >
| >a=F6c =3D "foo"
| >a=FCc =3D "bar"
| 
| [I find it a little ironic that my mail agent can't deal any of the
| newer mail encodings]

I didn't do anything special with my mailer (mutt), but it shows the
message as "ISO-8859-1" encoded.  I simply picked to characters near
the end of the latin1 encoding.  They are vowels with some funny
decorations (I think they're called umlauts, but I'm really not sure).

I use vim 6 (with the less.vim macro) as my pager, and it showed it
correctly.  Interestingly enough, that copy of vim was built without
multibyte support, so the 'enc' and 'fenc' settings weren't available.

| >If java uses UTF-8 as the encoding, then those two names 
| 
| Non-ascii chars in identifiers? I know CPython sometimes allow that, but
| that is not a feature I plan on adding.

I thought that would be nice to have for non-english developers, but
someone has already said otherwise.
 
| >will end up
| >being the same if jython will treat the unicode-undefined character as
| >a regular character. This would be an additional condition that
| >should raise an exception.
| 
| If you put the non-ascii chars inside the quotes then I agree with your
| example and with your conclusion.

Yeah, that would do it too.

| >| Yes. The generated tokenmanager catches all IOExceptions
| >| (MalformedInputException is a subclass of IOException) and interprets
| >| that as eof.
| > [...]
| >
| >Couldn't you just catch that exception and print out a message then
| >exit right before catching IOException?  
| 
| There are 43 instances of caught IOException in
| PythonGrammerTokenManager such as:
| 
|    try { curChar = input_stream.readChar(); }
|    catch(java.io.IOException e) {
|       jjStopStringLiteralDfa_10(0, 0L, active1);
|       return 1;
|    }
| 
| We probably have to catch the MalformedInputException in the
| ReaderCharStream and throw something that will get passed most of the
| catch clauses in the parser.

What if the exception gets turned into IOError (the python exception)?
I just noticed that you said "generated" parser.  That may make it
easier or harder to add the proper catches.

I should probably file a bug report, right?

-D

-- 

(E)ighteen (M)egs (A)nd (C)onstantly (S)wapping