[Jython-bugs] [ jython-Bugs-1840479 ] coding: utf-8 and PEP 0263?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1840479, was opened at 2007-11-28 19:47
Message generated for change (Comment added) made by otmarhumbel
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1840479&group_id=12867

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Core
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jörg Höhle (hoehle)
Assigned to: Nobody/Anonymous (nobody)
Summary: coding: utf-8 and PEP 0263?

Initial Comment:
Hi,

My understanding of PEP0263 is that the "coding: utf-8" in the first
line should influence the reading of .py files.
Alas, the PEP says: Python-Version: 2.3
whereas jython-2.2 is documented as corresponding to Python 2.2.
http://www.python.org/dev/peps/pep-0263/

So possibly mine is not a bug, but a feature request.

How can I use UTF-8 umlauts in my .py files with Jython?

# foo.py -*- coding: utf-8 -*- http://www.python.org/peps/pep-0263.html
inlineds =  "zäöü!"
inlinedu = u"zäöü!"
explicits=  "z\u00e4\u00f6\u00fc!"
explicitu= u"z\u00e4\u00f6\u00fc!"
all4=[inlineds,inlinedu,explicits,explicitu]
print all4, [len(s) for s in all4]

On a RedHat 5 system this produces:
['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]
Jython 2.2 on java1.6.0_05-ea
uname -a
Linux foo.xy 2.6.9-55.0.9.ELsmp #1 SMP Tue Sep 25 02:16:15 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
LANG=de_DE@UTF-8

Debian produces expected results:
['z\xE4\xF6\xFC!', u'z\xE4\xF6\xFC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [5,5,20,5]
Jython 2.2 on java1.6.0_02
uname -a
Linux debianbasic 2.6.18-5-686 #1 ... i686 GNU/Linux
LANG=de_DE.UTF-8

However, even on the Debian system changing $LANG gives
LANG=C ./jython.sh foo.py
[u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', u'z\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]

All happens as if Jython reads the .py file using Java's default
encoding (which is influenced by $LANG but cannot directly be set AFAIK).

java.nio.charset.Charset.defaultCharset()
java.io.OutputStreamWriter(java.io.ByteArrayOutputStream()).getEncoding()
yields Java's default encoding.

I've now installed 2.2.1 and results change, although still
not satisfactorily. The Debian system now always yields:
['z\xC3\xA4\xC3\xB6\xC3\xBC!', u'z\xC3\xA4\xC3\xB6\xC3\xBC!', 'z\\u00e4\\u00f6\\u00fc!', u'z\xE4\xF6\xFC!'] [8, 8, 20, 5]
like Redhat before, regardless of $LANG.

Thus jython-2.2.1 seems to strictly assume ISO-8859-1 in .py files. At least 2.2.1 behaviour is consistent between the two
Redhat and Debian systems I tested.

Regards,
 Jörg Höhle

----------------------------------------------------------------------

>Comment By: Otmar Humbel (otmarhumbel)
Date: 2007-11-28 22:08

Message:
Logged In: YES 
user_id=105844
Originator: NO

I am pretty sure it is a missing feature, since I've been missing it too.
Standalone mode should not make any difference here.

----------------------------------------------------------------------

Comment By: Jörg Höhle (hoehle)
Date: 2007-11-28 19:51

Message:
Logged In: YES 
user_id=377168
Originator: YES

I should mention that I'm using standalone-mode (for ease of use for my
Java colleagues).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=112867&aid=1840479&group_id=12867