Dear Alan Kennedy,

Thanks for your help. Indeed, reading a file with codecs.open in jython
works fine.

However I am still wondering why codecs.open works and
the regular file open does not as demonstrated in the little
program below:

I saved this string "®" (Three characters : (R), TM and degrees")
in file "myfunnychars.txt" using the utf-8 encoding by using CPython
#!/usr/bin/env python
# -*- mode: pymode; coding: latin1; -*-

myfunnychars = u"®"
my_utf8 = myfunnychars.encode("utf-8")
output_utf8 = open("myfunnycharst.txt", "wb")
output_utf8.write(my_utf8)
output_utf8.close()


***************************************
import codecs

input1 = open("myfunnychars.txt", "r")
str1 = input1.read()
print "type(str1)=",type(str1)

input1.close()
for c in str1:
    print ord(c),



input2 = codecs.open("myfunnychars.txt", "r", 'utf-8')
str2 = input2.read()
print "\ntype(str2)=",type(str2)
input2.close()

for c in str2:
    print ord(c),
***************************************

output:
C:\AH\WORK\UTIL>python compare_read_codecs.py
type(str1)= <type 'str'>
194 174 194 153 194 176
type(str2)= <type 'unicode'>
174 153 176

C:\AH\WORK\UTIL>jython compare_read_codecs.py
type(str1)= <type 'str'>
194 174 194 8482 194 176
type(str2)= <type 'unicode'>
174 153 176

this "8482" seems mysterious to me.

All the best,
Arye.



[Arye]
> The little program below behaves differently when run by jython and Python:
> I am trying to encode in utf-8 a unicode string with 3 characters in it:
> u"(r)" The "Registered", "Trade Mark", and "Degrees" characters.

The fundamental problem here is that jython does not support PEP 263,
which permits you to declare the encoding of your source module.

Because you have embedded your funny characters directly in your source,
jython does not interpret them correctly.

Cpython will only interpret them correctly if you have an encoding
declaration at the top of your source file, like so

# -*- coding: utf-8 -*-

(Change the encoding to whatever your text editor produces)

Defining Python Source Code Encodings
http://www.python.org/dev/peps/pep-0263/

There are two solutions to your problem

1. Do not embed the raw characters directly in your source file.
Instead, declare them with unicode escapes, like so

myfunnychars = u"\u00b0\u00ae\u2122"

2. Keep all unicode strings in separate files to your source, and read
them with codecs.open.

Try running this code in both cpython and jython; you should get
identical results from both

# -=-=-=-=-=-=-=-=-=
import sys
print "sys.getdefaultencoding()=", sys.getdefaultencoding()

myfunnychars = u"\u00b0\u00ae\u2122"

my_utf8 = myfunnychars.encode ("utf-8")
print "len(my_utf8)=",len(my_utf8)
print "my_utf8=",my_utf8

for c in my_utf8:
print ord(c)
# -=-=-=-=-=-=-=-=-=

Regards,

Alan.