|
From: David G. <go...@us...> - 2002-06-28 03:55:18
|
I'm implementing support for Unicode and encodings in Docutils, and have some questions about locales and determining encodings. I want Docutils to be able to handle files using any encoding. Having read Skip Montanero's "Using Unicode in Python" (http://manatee.mojam.com/~skip/unicode/unicode/) and "Introduction to i18n" by Tomohiro KUBOTA (http://www.debian.org/doc/manuals/intro-i18n/), I came up with the following heuristics: - Try the encoding specified by a command-line option, if any. - Try the locale's encoding. - Try UTF-8. - Try platform-specific encodings: CP-1252 on Windows, Mac-Roman on MacOS, perhaps Latin-9 (iso-8859-15) otherwise. Does this look right, or am I missing something? My questions: - Does the application have to call ``locale.setlocale(locale.LC_ALL, '')``, and if so, where? Is it OK to call setlocale from within the decoding function, or should it be left up to the client application? - Should I use the result of ``locale.getlocale()``? On Win2K/Python2.2.1, I get this:: >>> import locale >>> locale.getlocale() (None, None) >>> locale.getdefaultlocale() ('en_US', 'cp1252') Looks good so far. >>> locale.setlocale(locale.LC_ALL, '') 'English_United States.1252' >>> locale.getlocale() ['English_United States', '1252'] "1252"? What happened to the "cp"? >>> s='abcd' >>> s.decode('1252') Traceback (most recent call last): File "<stdin>", line 1, in ? LookupError: unknown encoding How can I use ``locale.getlocale()`` when it doesn't return a known encoding? Or put another way, how can I get a known encoding out of ``locale.getlocale()``? - Does ``locale.getdefaultlocale()[1]`` reliably produce the platform-specific encoding? Here's the decoding code I've written:: def decode(self, data): """ Decode a string, `data`, heuristically into Unicode. Raise UnicodeError if unsuccessful. """ encodings = [self.options.input_encoding, # command-line option locale.getlocale()[1], 'utf-8', locale.getdefaultlocale()[1],] # is locale.getdefaultlocale() platform-specific? for enc in encodings: if not enc: continue try: decoded = unicode(data, enc) return decoded except UnicodeError: pass raise UnicodeError( 'Unable to decode input data. Tried the following encodings:' '%s.' % ', '.join([repr(enc) for enc in encodings if enc])) Suggestions for improvement and/or pointers to other resources would be most appreciated. Thank you. -- David Goodger <go...@us...> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ |
|
From: martin@v.loewis.de (M. v. Loewis) - 2002-06-29 19:43:14
|
David Goodger <go...@us...> writes:
> - Try the encoding specified by a command-line option, if any.
>
> - Try the locale's encoding.
>
> - Try UTF-8.
>
> - Try platform-specific encodings: CP-1252 on Windows, Mac-Roman on
> MacOS, perhaps Latin-9 (iso-8859-15) otherwise.
>
> Does this look right, or am I missing something?
I'd reorder this: (try command line). Try ASCII first, then UTF-8. If
ASCII passes, it most likely is ASCII. If not, and UTF-8 passes, it
most likely is UTF-8. Then try the locale's encoding.
> - Does the application have to call
> ``locale.setlocale(locale.LC_ALL, '')``, and if so, where? Is it OK
> to call setlocale from within the decoding function, or should it be
> left up to the client application?
Atleast on Solaris, you need this to get nl_langinfo to work correctly.
> - Should I use the result of ``locale.getlocale()``? On
> Win2K/Python2.2.1, I get this::
>
> >>> import locale
> >>> locale.getlocale()
> (None, None)
> >>> locale.getdefaultlocale()
> ('en_US', 'cp1252')
>
> Looks good so far.
No; this is broken beyond repair. On Unix, try nl_langinfo(CODESET)
(requires Python 2.2). On Windows, try _getdefaultlocale. If either
fails, you may then fall-back to getlocale, but expect it to fail with
exceptions, and to err.
> How can I use ``locale.getlocale()`` when it doesn't return a
> known encoding? Or put another way, how can I get a known
> encoding out of ``locale.getlocale()``?
[Don't use getlocale]. If nl_langinfo gives an unknown codeset,
produce a warning message, asking the user to report that as a
bug. Keep a list of additional aliases for codesets that occur in the
wild and are aliases to known codecs, also keep a list of known
unsupported codesets (again, restrict yourself to those occurring in
the wild).
> - Does ``locale.getdefaultlocale()[1]`` reliably produce the
> platform-specific encoding?
No.
Regards,
Martin
|
|
From: David G. <go...@us...> - 2002-07-01 18:22:36
|
Thanks for your reply, Martin. > I'd reorder this: (try command line). Try ASCII first, then UTF-8. If > ASCII passes, it most likely is ASCII. If not, and UTF-8 passes, it > most likely is UTF-8. Then try the locale's encoding. Out of curiosity, is there any point in trying both ASCII and UTF-8? UTF-8 is a strict superset of ASCII, so shouldn't checking UTF-8 alone be enough for both? If we don't care what the original encoding was (we just want Unicode text to process), does explicitly checking for ASCII buy us anything? -- David Goodger <go...@us...> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/ |
|
From: martin@v.loewis.de (M. v. Loewis) - 2002-07-01 20:26:55
|
David Goodger <go...@us...> writes: > Out of curiosity, is there any point in trying both ASCII and UTF-8? UTF-8 > is a strict superset of ASCII, so shouldn't checking UTF-8 alone be enough > for both? If we don't care what the original encoding was (we just want > Unicode text to process), does explicitly checking for ASCII buy us > anything? The answer to the last question is "no". The point in checking ASCII specifically is that you then know that it is strictly ASCII (unless it is iso-2022-jp, that is); if that is not interesting to know, there is no point. Regards, Martin |
|
From: Simon B. <Sim...@un...> - 2002-07-01 20:29:37
|
David Goodger (go...@us...) wrote:
> > I'd reorder this: (try command line). Try ASCII first, then UTF-8. If
> > ASCII passes, it most likely is ASCII. If not, and UTF-8 passes, it
> > most likely is UTF-8. Then try the locale's encoding.
>
> Out of curiosity, is there any point in trying both ASCII and UTF-8? UTF-8
> is a strict superset of ASCII, so shouldn't checking UTF-8 alone be enough
> for both? If we don't care what the original encoding was (we just want
> Unicode text to process), does explicitly checking for ASCII buy us
> anything?
Hmm - I think checkin ASCII first would bring us the explicit knowledge
that it actually is ASCII and we could label the output as ASCII
(which might make it more compatible for older Software that doesn't
know about UTF-8 and might spew out weird errors even when it is
ASCII labelled as UTF-8).
Bye,
Simon
--
Sim...@un... http://www.home.unix-ag.org/simon/
|