Re: [Jython-users] encoded (UTF) strings in Jython

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Иван Чернявский wrote:
> Hi,
>
>
>
>   
>> How about using unicode()?  Or the previous poster's use of 
>>     
>
>   
>> codecs.open() seems to eliminate the need to have both the byte array 
>>     
>
>   
>> and character array in memory.
>>     
>
>
>
> Yeah these seem more reasonable.
>
>
>
>   
>>> By the way, if the file is fully read with f.read() instead of f.readlines(), everything is u''-encoded correctly. Does someone knows why? For Python, there's no difference between read() and readlines() results (I've tried).
>>>       
>
>   
>>>   
>>>       
>
>   
>> Not for me:
>>     
>
>
>
> Do you have 'UTF-8' as your system encoding? I do. Maybe it matters? By the way, as I said on my system Python does read() and readlines() into unicode strings, but you have Jython reading into raw utf-8 strings... strange.
>   
I ran that on a Windows XP system, and the default encoding there (at 
least as far as Java is concerned, via 
java.nio.charset.Charset.defaultCharset()) is cp1252 (called 
"windows-1252").  So yea, that probably explains it.  I'd forgotten you 
have to use "rb" if you want binary mode, so it must be automatically 
decoding to text, most likely using that encoding.  Behavior is 
identical with Python 2.5.2.

So yea, you'd think if your default encoding is UTF-8, then read() and 
readlines() should both result in unicode strings.

Andy