Re: [Jython-users] character encoding issues

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello all,

I've been planning to mail about my weird unicode issues as well but
since I haven't been able to reproduce the actual problem with any
simple example haven't yet bothered. In our case the problem appears
when we read user written content from standard output and convert
everything there to unicode assuming content to be utf-8. We don't
have any control on what is actually given to us and some of our test
trying to verify that nothing breaks even with random binary data fail
pretty mysteriously on Jython but pass on Python.

It seems that there are some issues when converting bytes into unicode
at least in some corner cases. As I said, I haven't been able to
reproduce this with any simple test but I have found at least one
pretty weird behavior in Jython which may or may not have something to
do with this. This issue is illustrated by the example below. At least
I have no idea what are those characters having 'uu' prefix.

Jython 2.2rc1 on java1.5.0_11
Type "copyright", "credits" or "license" for more information.
>>> for c in [1, 80, 127, 128, 255]:
...   chr(c).decode('utf-8', 'replace')
...
u'\x01'
u'P'
u'\x7F'
uu'\uFFFD'
uu'\uFFFD'

In CPython (tested in 2.5 on Win) you get otherwise same output but
those mysterious 'uu' prefixes are turned into normal 'u'.

Cheers,
    .peke

2007/7/2, Arye <ar...@bi...>:
> Dear Alan Kennedy,
>
> Thanks for your help. Indeed, reading a file with codecs.open in jython
> works fine.
>
> However I am still wondering why codecs.open works and
> the regular file open does not as demonstrated in the little
> program below:
>
> I saved this string "(r)=99=B0" (Three characters : (R), TM and degrees")
> in file "myfunnychars.txt" using the utf-8 encoding by using CPython
> #!/usr/bin/env python
> # -*- mode: pymode; coding: latin1; -*-
>
> myfunnychars =3D u"(r)=99=B0"
> my_utf8 =3D myfunnychars.encode("utf-8")
> output_utf8 =3D open("myfunnycharst.txt", "wb")
> output_utf8.write(my_utf8)
> output_utf8.close()
>
>
> ***************************************
> import codecs
>
> input1 =3D open("myfunnychars.txt", "r")
> str1 =3D input1.read()
> print "type(str1)=3D",type(str1)
>
> input1.close()
> for c in str1:
>     print ord(c),
>
>
>
> input2 =3D codecs.open("myfunnychars.txt", "r", 'utf-8')
> str2 =3D input2.read()
> print "\ntype(str2)=3D",type(str2)
> input2.close()
>
> for c in str2:
>     print ord(c),
> ***************************************
>
> output:
> C:\AH\WORK\UTIL>python compare_read_codecs.py
> type(str1)=3D <type 'str'>
> 194 174 194 153 194 176
> type(str2)=3D <type 'unicode'>
> 174 153 176
>
> C:\AH\WORK\UTIL>jython compare_read_codecs.py
> type(str1)=3D <type 'str'>
> 194 174 194 8482 194 176
> type(str2)=3D <type 'unicode'>
> 174 153 176
>
> this "8482" seems mysterious to me.
>
> All the best,
> Arye.
>
>
>
>
> > [Arye]
> > > The little program below behaves differently when run by jython and
> Python:
> > > I am trying to encode in utf-8 a unicode string with 3 characters in =
it:
> > > u"(r)=99=B0" The "Registered", "Trade Mark", and "Degrees" characters=
.
> >
> > The fundamental problem here is that jython does not support PEP 263,
> > which permits you to declare the encoding of your source module.
> >
> > Because you have embedded your funny characters directly in your source=
,
> > jython does not interpret them correctly.
> >
> > Cpython will only interpret them correctly if you have an encoding
> > declaration at the top of your source file, like so
> >
> > # -*- coding: utf-8 -*-
> >
> > (Change the encoding to whatever your text editor produces)
> >
> > Defining Python Source Code Encodings
> > http://www.python.org/dev/peps/pep-0263/
> >
> > There are two solutions to your problem
> >
> > 1. Do not embed the raw characters directly in your source file.
> > Instead, declare them with unicode escapes, like so
> >
> > myfunnychars =3D u"\u00b0\u00ae\u2122"
> >
> > 2. Keep all unicode strings in separate files to your source, and read
> > them with codecs.open.
> >
> > Try running this code in both cpython and jython; you should get
> > identical results from both
> >
> > # -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D
> > import sys
> > print "sys.getdefaultencoding()=3D", sys.getdefaultencoding()
> >
> > myfunnychars =3D u"\u00b0\u00ae\u2122"
> >
> > my_utf8 =3D myfunnychars.encode ("utf-8")
> > print "len(my_utf8)=3D",len(my_utf8)
> > print "my_utf8=3D",my_utf8
> >
> > for c in my_utf8:
> > print ord(c)
> > # -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D
> >
> > Regards,
> >
> > Alan.
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Jython-users mailing list
> Jyt...@li...
> https://lists.sourceforge.net/lists/listinfo/jython-users
>
>