From: Michael C. <chi...@mi...> - 2008-03-21 15:31:17
|
Иван Чернявский wrote: > Hi, > > > > >> How about using unicode()? Or the previous poster's use of >> > > >> codecs.open() seems to eliminate the need to have both the byte array >> > > >> and character array in memory. >> > > > > Yeah these seem more reasonable. > > > > >>> By the way, if the file is fully read with f.read() instead of f.readlines(), everything is u''-encoded correctly. Does someone knows why? For Python, there's no difference between read() and readlines() results (I've tried). >>> > > >>> >>> > > >> Not for me: >> > > > > Do you have 'UTF-8' as your system encoding? I do. Maybe it matters? By the way, as I said on my system Python does read() and readlines() into unicode strings, but you have Jython reading into raw utf-8 strings... strange. > I ran that on a Windows XP system, and the default encoding there (at least as far as Java is concerned, via java.nio.charset.Charset.defaultCharset()) is cp1252 (called "windows-1252"). So yea, that probably explains it. I'd forgotten you have to use "rb" if you want binary mode, so it must be automatically decoding to text, most likely using that encoding. Behavior is identical with Python 2.5.2. So yea, you'd think if your default encoding is UTF-8, then read() and readlines() should both result in unicode strings. Andy |