From: Martin B. <m....@gm...> - 2011-12-03 22:13:31
|
Hello Günter, I think I have detangled the encoding puzzle concerning Windows and the reported error. Here is a summary of my findings: Interactive Session 1 (Dosbox) ============================== E:\kannweg>python Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. 01 >>> import locale 02 03 >>> locale.getlocale() 04 (None, None) 05 06 >>> locale.getdefaultlocale() 07 ('de_DE', 'cp1252') 08 09 >>> locale.setlocale(locale.LC_ALL,'') 10 'German_Germany.1252' 11 12 >>> locale.getlocale() 13 ('de_DE', 'cp1252') [01] importing locale works [03] Initially no locale is set. [09] http://docs.python.org/library/locale.html """ According to POSIX, a program which has not called setlocale(LC_ALL, '') runs using the portable 'C' locale. Calling setlocale(LC_ALL, '') lets it use the default locale as defined by the LANG variable...""" [12] Now locale.getlocale() returns something != None as value Worth noting: the default encoding is always "windows" = 'cp1252' Interactive Session 2 (Dosbox) ============================== 01 E:\kannweg>echo "" >empty.py 02 E:\kannweg>python -i empty.py grüße 03 >>> import sys 04 >>> import locale 05 >>> locale_encoding = locale.getdefaultlocale()[1] 06 >>> locale_encoding 07 'cp1252' 08 09 >>> sys.argv[1] 10 'gr\xfc\xdfe' 11 12 >>> sys.argv[1].decode(locale_encoding) 13 u'gr\xfc\xdfe' 14 15 >>> print sys.argv[1].decode(locale_encoding) 16 grüße 17 18 >>> u'grüße' 19 u'gr\xfc\xdfe' 20 21 >>> 'grüße' 22 'gr\x81\xe1e' 23 24 >>> print 'grüße' 25 grüße 26 27 >>> print u'grüße'.encode('cp850') 28 grüße 29 30 >>> 'grüße'.decode('cp850') 31 u'gr\xfc\xdfe' 32 33 >>> print 'grüße'.decode('cp850') 34 grüße 35 36 >>> print sys.argv[1] 37 gr³¯e 38 39 >>> sys.argv[1].decode('cp850') 40 u'gr\xb3\u2580e' 41 42 >>> print sys.argv[1].decode('cp850') 43 gr³¯e 44 45 >>> sys.stdin.encoding 46 'cp850' 47 48 >>> sys.stdout.encoding 49 'cp850' Conclusions: [15] The commandline argument 'grüße' = sys.argv[1] is of type 'str' and has 'cp1252' encoding (!!!) [24,33] 'grüße' entered at the interactive prompt has 'cp850' = sys.stdin.encoding. u'grüße' will be correct. str('grüße') will print correctly if the encoding of the string is == sys.stdout.encoding Interactive Session 3 (PythonWin) ================================= # cyrillic filename >>> fname = u'\u043a\u0430\u0440\u0442\u0438\u043d\u0430.jpg' # create file works >>> file(fname,'w').close() # open file works >>> f1 = file(fname) >>> f1.close() Note: File operations (and os.listdir(upath)!) handle unicode filenames if the filename parameter 'fname' or path name 'upath' are of type unicode. Suspected error in test suite ============================= # $Id: test_dependencies.py 7220 2011-11-11 10:31:48Z milde $ lines 38..39: record = docutils.io.FileInput(source_path=recordfile, encoding=sys.getfilesystemencoding()) I suspect 'sys.getfilesystemencoding()' is completely wrong here, as it has the value "mbcs" which is somehow Pythons way of characterizing the method that windows uses to encode FILENAMES. It seems to me that the 'encoding' parameter is going to be applied to the file content. I haven't really understood the test suite yet so I can't come up with a patch. My very summary at the end is: ============================== - decode commandline parameters this way: sys.argv[1].decode(locale.getlocale()[1] or locale.getdefaultlocale()[1]) - working with unicode strings is great - non unicode strings you want to print have to have encoding 'sys.stdout.encoding' - to work with unicode fileNAMES several parameters need to be of type unicode HTH - hoping this helps ... Martin -- http://mbless.de |