From: Guenter M. <mi...@us...> - 2022-06-15 15:32:48
|
Dear Adam, thank you for the update patches. Parts of the patch-set that (IMO) do not require further discussion are now committed to master. Unify naming of the "utf-8" codec --------------------------------- > I propose using UTF-8 (uppercase) in documentation and prose text and > utf-8 (lowercase) in code I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it refers to the Python codec and UTF-8 for the abstract encoding algorithm. r9068 Add encoding arguments ---------------------- Changes: * Don't add encoding when the locale encoding is OK. (We may switch to "locale" after implementing it in `docutils.io`.) * Document changes that may affect users. * Use 'ascii' in "tools/dev/unicode2rstsubs.py". Its a developer tool. The generated files should be usable with any ASCII-compatible encoding. * Break too long lines. r9072 Ensure locale_encoding is lower case ------------------------------------ Some simplifications: * We can use locale.getpreferredencoding() after dropping Python versions where this was problematic. * We can append ``.lower()`` as there is a catchall ``except`` later. TODO: check whether io.locale_encoding is set correctly with every OS and Python version or whether front-end tools would need to call `locale.setlocale()` before importing this module. Handle encoding='locale' for docutils.io.Output ----------------------------------------------- Is uppercase ``encoding='LOCALE'`` supported in the standard function open() in Python >= 3.10? IMO, we need ``encoding='locale'`` support in both, input and output. Should ``encoding='locale' be supported in all Input/Output classes or only in FileInput/FileOutput? Deprecations ------------ Why do you want to deprecate ``io.locale_encoding``? Why do you want to deprecate auto-detection of the input encoding? * ``encoding='locale'`` does not help if my input files are a mix of UTF-8 and latin-1. > Using Python 3.10's ``-X warn_default_encoding`` argument to Python, > we can see a large number of places where the default encoding is > used. On posix systems this is now UTF-8 following PEP 538 [1], but on > Windows a non-unicode codepage can be used. Also on POSIX, the locale encoding is kept unless the locale is "C". Test: After setting up locales de_DE-UTF-8 and de_DE-ISO-8859-1 on my Debian/stable system, I get:: milde@heinz:~ > export LC_ALL=de_DE milde@heinz:~ > python3 Python 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.getpreferredencoding() 'ISO-8859-1' Reading a latin-1 encoded file works:: >>> f = open('/tmp/moff.txt') >>> f.read() 'Grüße\n' while reading the same file with utf-8 fails:: >>> f = open('/tmp/moff.txt', encoding='utf-8') >>> f.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte Günter |