Re: [Docutils-develop] I/O uses default encoding argument

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Adam,

thank you for the update patches.

Parts of the patch-set that (IMO) do not require further discussion are now
committed to master.

Unify naming of the "utf-8" codec
---------------------------------

> I propose using UTF-8 (uppercase) in documentation and prose text and
> utf-8 (lowercase) in code

I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it
refers to the Python codec and UTF-8 for the abstract encoding
algorithm.

r9068

Add encoding arguments
----------------------

Changes:

* Don't add encoding when the locale encoding is OK.
  (We may switch to "locale" after implementing it in `docutils.io`.)

* Document changes that may affect users.

* Use 'ascii' in "tools/dev/unicode2rstsubs.py". 
  Its a developer tool. The generated files should be usable with any
  ASCII-compatible encoding.

* Break too long lines.

r9072

Ensure locale_encoding is lower case
------------------------------------

Some simplifications:

* We can use locale.getpreferredencoding() after dropping Python versions
  where this was problematic.

* We can append ``.lower()`` as there is a catchall ``except`` later.

TODO: check whether io.locale_encoding is set correctly with every OS and
      Python version or whether front-end tools would need to call
      `locale.setlocale()` before importing this module.

Handle encoding='locale' for docutils.io.Output 
-----------------------------------------------

Is uppercase ``encoding='LOCALE'`` supported in the standard
function open() in Python >= 3.10?

IMO, we need ``encoding='locale'`` support in both, input and output.

Should ``encoding='locale' be supported in all Input/Output classes or
only in FileInput/FileOutput?

Deprecations 
------------

Why do you want to deprecate ``io.locale_encoding``?

Why do you want to deprecate auto-detection of the input encoding?

* ``encoding='locale'`` does not help if my input files are a mix of
  UTF-8 and latin-1.

> Using Python 3.10's ``-X warn_default_encoding`` argument to Python,
> we can see a large number of places where the default encoding is
> used. On posix systems this is now UTF-8 following PEP 538 [1], but on
> Windows a non-unicode codepage can be used.

Also on POSIX, the locale encoding is kept unless the locale is "C".

Test:

After setting up locales de_DE-UTF-8 and de_DE-ISO-8859-1 on my
Debian/stable system, I get::

  milde@heinz:~ > export LC_ALL=de_DE
  milde@heinz:~ > python3
  Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
  [GCC 10.2.1 20210110] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import locale
  >>> locale.getpreferredencoding()
  'ISO-8859-1'

Reading a latin-1 encoded file works::

  >>> f = open('/tmp/moff.txt')
  >>> f.read()
  'Grüße\n'

while reading the same file with utf-8 fails::

  >>> f = open('/tmp/moff.txt', encoding='utf-8')
  >>> f.read()
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python3.9/codecs.py", line 322, in decode
      (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte

Günter