|
From: David G. <go...@us...> - 2002-06-28 03:55:18
|
I'm implementing support for Unicode and encodings in Docutils, and have some questions about locales and determining encodings. I want Docutils to be able to handle files using any encoding. Having read Skip Montanero's "Using Unicode in Python" (http://manatee.mojam.com/~skip/unicode/unicode/) and "Introduction to i18n" by Tomohiro KUBOTA (http://www.debian.org/doc/manuals/intro-i18n/), I came up with the following heuristics: - Try the encoding specified by a command-line option, if any. - Try the locale's encoding. - Try UTF-8. - Try platform-specific encodings: CP-1252 on Windows, Mac-Roman on MacOS, perhaps Latin-9 (iso-8859-15) otherwise. Does this look right, or am I missing something? My questions: - Does the application have to call ``locale.setlocale(locale.LC_ALL, '')``, and if so, where? Is it OK to call setlocale from within the decoding function, or should it be left up to the client application? - Should I use the result of ``locale.getlocale()``? On Win2K/Python2.2.1, I get this:: >>> import locale >>> locale.getlocale() (None, None) >>> locale.getdefaultlocale() ('en_US', 'cp1252') Looks good so far. >>> locale.setlocale(locale.LC_ALL, '') 'English_United States.1252' >>> locale.getlocale() ['English_United States', '1252'] "1252"? What happened to the "cp"? >>> s='abcd' >>> s.decode('1252') Traceback (most recent call last): File "<stdin>", line 1, in ? LookupError: unknown encoding How can I use ``locale.getlocale()`` when it doesn't return a known encoding? Or put another way, how can I get a known encoding out of ``locale.getlocale()``? - Does ``locale.getdefaultlocale()[1]`` reliably produce the platform-specific encoding? Here's the decoding code I've written:: def decode(self, data): """ Decode a string, `data`, heuristically into Unicode. Raise UnicodeError if unsuccessful. """ encodings = [self.options.input_encoding, # command-line option locale.getlocale()[1], 'utf-8', locale.getdefaultlocale()[1],] # is locale.getdefaultlocale() platform-specific? for enc in encodings: if not enc: continue try: decoded = unicode(data, enc) return decoded except UnicodeError: pass raise UnicodeError( 'Unable to decode input data. Tried the following encodings:' '%s.' % ', '.join([repr(enc) for enc in encodings if enc])) Suggestions for improvement and/or pointers to other resources would be most appreciated. Thank you. -- David Goodger <go...@us...> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ |