From: Иван Ч. <cam...@ya...> - 2008-03-19 06:14:31
|
Hi, for me the solution was as follows ('myfile' is a file with a single utf-8 encoded line containing 6-characters Russian word 'Привет' (== 'Hello') ): f = open('myfile', 'r') ll = f.readlines() f.close() print ll ['\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\n'] (so you can see it isn't a Unicode string and it is two-byte-oriented). Now: import codecs for l in ll : uni_l = codecs.utf_8_decode(l) print uni_l (u'\u041F\u0440\u0438\u0432\u0435\u0442\n', 13) Now it's encoded correctly. Looks like kludge IMO. But hope this helps. By the way, if the file is fully read with f.read() instead of f.readlines(), everything is u''-encoded correctly. Does someone knows why? For Python, there's no difference between read() and readlines() results (I've tried). -- Ivan |
From: Christof H. <cs...@t-...> - 2008-03-19 19:43:57
|
hi, like in CPython use the codecs module to actually open the file: >>> import codecs >>> f = codecs.open('filename', 'r', 'utf-16') # or any encoding you need >>> data = f.read() >>> f.close() see the Python stdlib documentation for details. Chris |
From: Michael C. <chi...@mi...> - 2008-03-20 20:34:01
|
Иван Чернявский wrote: > Hi, > > > > for me the solution was as follows ('myfile' is a file with a single utf-8 encoded line containing 6-characters Russian word 'Привет' (== 'Hello') ): > > > > f = open('myfile', 'r') > > ll = f.readlines() > > f.close() > > print ll > > > > ['\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\n'] > > > > (so you can see it isn't a Unicode string and it is two-byte-oriented). Now: > > > > import codecs > > > > for l in ll : > > uni_l = codecs.utf_8_decode(l) > > print uni_l > > > > (u'\u041F\u0440\u0438\u0432\u0435\u0442\n', 13) > > > > Now it's encoded correctly. Looks like kludge IMO. But hope this helps. > How about using unicode()? Or the previous poster's use of codecs.open() seems to eliminate the need to have both the byte array and character array in memory. >>> b '\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\n' >>> u = unicode(b, 'UTF-8') >>> u u'\u041F\u0440\u0438\u0432\u0435\u0442\n' >>> > > > By the way, if the file is fully read with f.read() instead of f.readlines(), everything is u''-encoded correctly. Does someone knows why? For Python, there's no difference between read() and readlines() results (I've tried). > Not for me: Jython 2.2.1 on java1.6.0_04 Type "copyright", "credits" or "license" for more information. >>> f = open(r'\helloru.txt', 'r') >>> b = f.read() >>> f.seek(0,0) >>> ll = f.readlines() >>> f.close() >>> b '\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\n' >>> ll ['\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\n'] >>> Andy |
From: Иван Ч. <cam...@ya...> - 2008-03-20 20:56:44
|
Hi, > How about using unicode()? Or the previous poster's use of > codecs.open() seems to eliminate the need to have both the byte array > and character array in memory. Yeah these seem more reasonable. > > By the way, if the file is fully read with f.read() instead of f.readlines(), everything is u''-encoded correctly. Does someone knows why? For Python, there's no difference between read() and readlines() results (I've tried). > > > Not for me: Do you have 'UTF-8' as your system encoding? I do. Maybe it matters? By the way, as I said on my system Python does read() and readlines() into unicode strings, but you have Jython reading into raw utf-8 strings... strange. -- Ivan |
From: Michael C. <chi...@mi...> - 2008-03-21 15:31:17
|
Иван Чернявский wrote: > Hi, > > > > >> How about using unicode()? Or the previous poster's use of >> > > >> codecs.open() seems to eliminate the need to have both the byte array >> > > >> and character array in memory. >> > > > > Yeah these seem more reasonable. > > > > >>> By the way, if the file is fully read with f.read() instead of f.readlines(), everything is u''-encoded correctly. Does someone knows why? For Python, there's no difference between read() and readlines() results (I've tried). >>> > > >>> >>> > > >> Not for me: >> > > > > Do you have 'UTF-8' as your system encoding? I do. Maybe it matters? By the way, as I said on my system Python does read() and readlines() into unicode strings, but you have Jython reading into raw utf-8 strings... strange. > I ran that on a Windows XP system, and the default encoding there (at least as far as Java is concerned, via java.nio.charset.Charset.defaultCharset()) is cp1252 (called "windows-1252"). So yea, that probably explains it. I'd forgotten you have to use "rb" if you want binary mode, so it must be automatically decoding to text, most likely using that encoding. Behavior is identical with Python 2.5.2. So yea, you'd think if your default encoding is UTF-8, then read() and readlines() should both result in unicode strings. Andy |