Thread: [Docutils-develop] I/O uses default encoding argument

Brought to you by: goodger, grubert, milde, tibs, wiemann

docutils-develop

[Docutils-develop] I/O uses default encoding argument

From: Adam T. <aat...@ou...> - 2022-06-09 11:58:20

Attachments: 0001-Add-encoding-arguments.patch

(Re-sending to the correct list)

Using Python 3.10's ``-X warn_default_encoding`` argument to Python, we can see a large number of places where the default encoding is used. On posix systems this is now UTF-8 following PEP 538 [1], but on Windows a non-unicode codepage can be used.

The attached patch fixes the majority of these instances.

A

[1]: https://peps.python.org/pep-0538/

Re: [Docutils-develop] I/O uses default encoding argument

From: Guenter M. <mi...@us...> - 2022-06-09 22:46:00

On 2022-06-09, Adam Turner wrote:

> Using Python 3.10's ``-X warn_default_encoding`` argument to Python, we
> can see a large number of places where the default encoding is used. On
> posix systems this is now UTF-8 following PEP 538 [1], but on Windows a
> non-unicode codepage can be used.

> The attached patch fixes the majority of these instances.

Thank you for the patch. After reading PEP 597, I agree that we should
specify the intended encoding where appropriate.

This means for every instance of open() without explicit encoding, we have
to decide whether to use "ascii", "utf-8", or `io.locale_encoding`
(the latter is equivalent to the value "locale" introduced in Py 3.10).

Unfortunately, the patch mixes added "encoding" arguments with the
change of "utf8" to "utf-8" in many cases.

* Is there a reason to prefer 'utf-8'?

  We have currently 36 instances of 'utf8' vs. 19 instances of 'utf-8'
  in the library code and tests.

  The "codecs" documentation names "utf8" and "utf-8" as aliases for "utf_8".

* Separating the encoding name normalization from new arguments would
  make it easier to check whether the new-specified encoding is correct.

Günter

Re: [Docutils-develop] I/O uses default encoding argument

From: Adam T. <aat...@ou...> - 2022-06-09 22:55:12

Attachments: 0001-Add-encoding-arguments.patch

> This means for every instance of open() without explicit encoding, we have
> to decide whether to use "ascii", "utf-8", or `io.locale_encoding`
> (the latter is equivalent to the value "locale" introduced in Py 3.10).

My strong suggestion would be that Docutils moves towards defaulting to UTF-8 for all encodings (of course keeping the option to supply explicit other encodings) -- it is compatible with US-ASCII and is the safest sane default. (PEP 686's motivation section [1]_ has some colour on this).

>Unfortunately, the patch mixes added "encoding" arguments with the>change of "utf8" to "utf-8" in many cases.

An updated patch attached (The only 'utf8' -> 'utf-8' were in the LaTex2e writer, but you're right it is better to keep the changes distinct.)

A

_[1]: https://peps.python.org/pep-0686/#motivation

Re: [Docutils-develop] I/O uses default encoding argument

From: Guenter M. <mi...@us...> - 2022-06-10 08:44:45

On 2022-06-09, Adam Turner wrote:

>> This means for every instance of open() without explicit encoding, we have
>> to decide whether to use "ascii", "utf-8", or `io.locale_encoding`
>> (the latter is equivalent to the value "locale" introduced in Py 3.10).

> My strong suggestion would be that Docutils moves towards defaulting to
> UTF-8 for all encodings (of course keeping the option to supply
> explicit other encodings) -- it is compatible with US-ASCII and is the
> safest sane default. (PEP 686's motivation section [1]_ has some colour
> on this).

However, in cases of user-supplied input, this is an API change.
We can fix the cases in the tests now but need due process for cases where
changes may lead to different behaviur for users.

Suggestion:

* backport Python 3.11 behaviour to docutils.io:
  “use locale encoding when encoding="locale" is passed”.

* announce change of default encoding to UTF-8

* keep encoding attribute unspecified for now when reading input
  specified by users or 3rd-party code.

>>Unfortunately, the patch mixes added "encoding" arguments with the>change of "utf8" to "utf-8" in many cases.

> An updated patch attached (The only 'utf8' -> 'utf-8' were in the
> LaTex2e writer, but you're right it is better to keep the changes
> distinct.)

Consistent naming in Docutils code (not only latex2e.py) and
documentation is good.

What is the motivation for 'utf-8'?

* Python's codecs module uses "utf_8"
  (with aliases U8, UTF, utf8, cp65001 and normalizing case and "-/_").

* In LaTeX, it's named "utf8",

* `locale` reports "UTF-8"

* PEP 8 uses uppercase:
  "Code in the core Python distribution should always use UTF-8".

* The `codecs  documentation`__ uses ``encoding='utf-8'`` when documenting
  default arguments for encode() and decode().

__ https://docs.python.org/3/library/codecs.html

Thanks,

Günter

Re: [Docutils-develop] I/O uses default encoding argument

From: Adam T. <aat...@ou...> - 2022-06-11 00:01:09

Attachments: 0008-Update-HISTORY-and-RELEASE-NOTES.patch 0001-Add-encoding-arguments.patch 0002-Canonicalise-UTF-8-references.patch 0003-Additional-utf-8-tests.patch 0004-Ensure-locale_encoding-is-lower-case.patch 0005-Deprecate-docutils.io.locale_encoding.patch 0006-Add-_get_default_encoding-helper.patch 0007-Handle-encoding-locale-for-docutils.io.Output.patch

> However, in cases of user-supplied input, this is an API change.
> We can fix the cases in the tests now but need due process for cases where
> changes may lead to different behaviur for users.

> Suggestion:

> * backport Python 3.11 behaviour to docutils.io:
>   “use locale encoding when encoding="locale" is passed”.

> * announce change of default encoding to UTF-8

> * keep encoding attribute unspecified for now when reading input
>   specified by users or 3rd-party code.

This seems a sensible way forwards. The updated patch set does (1) and (2) and warns on unspecified encoding input in the ``docutils.io.(Input|Output)`` classes.

> Consistent naming in Docutils code (not only latex2e.py) and
> documentation is good.

The updated patch set renames everything to my reccomendation below. (It is a larger change than originally envisaged, so it is 8 patches -- alternativley formatted on the web [1]_.

> What is the motivation for 'utf-8'?

The name of the encoding is UTF-8 [2]_ [3]_. I propose using UTF-8 (uppercase) in documentation and prose text and utf-8 (lowercase) in code (If you'd prefer consistency in case I would pick lowercase everywhere). 

A

_[1]: https://github.com/AA-Turner/docutils/pull/15 and https://github.com/AA-Turner/docutils/pull/15.patch
_[2]: https://www.ietf.org/rfc/rfc3629.html
_[3]: https://encoding.spec.whatwg.org/#names-and-labels

Re: [Docutils-develop] I/O uses default encoding argument

From: Guenter M. <mi...@us...> - 2022-06-15 15:32:48

Dear Adam,

thank you for the update patches.

Parts of the patch-set that (IMO) do not require further discussion are now
committed to master.

Unify naming of the "utf-8" codec
---------------------------------

> I propose using UTF-8 (uppercase) in documentation and prose text and
> utf-8 (lowercase) in code

I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it
refers to the Python codec and UTF-8 for the abstract encoding
algorithm.

r9068


Add encoding arguments
----------------------

Changes:

* Don't add encoding when the locale encoding is OK.
  (We may switch to "locale" after implementing it in `docutils.io`.)

* Document changes that may affect users.

* Use 'ascii' in "tools/dev/unicode2rstsubs.py". 
  Its a developer tool. The generated files should be usable with any
  ASCII-compatible encoding.

* Break too long lines.

r9072


Ensure locale_encoding is lower case
------------------------------------

Some simplifications:

* We can use locale.getpreferredencoding() after dropping Python versions
  where this was problematic.

* We can append ``.lower()`` as there is a catchall ``except`` later.

TODO: check whether io.locale_encoding is set correctly with every OS and
      Python version or whether front-end tools would need to call
      `locale.setlocale()` before importing this module.


Handle encoding='locale' for docutils.io.Output 
-----------------------------------------------

Is uppercase ``encoding='LOCALE'`` supported in the standard
function open() in Python >= 3.10?

IMO, we need ``encoding='locale'`` support in both, input and output.

Should ``encoding='locale' be supported in all Input/Output classes or
only in FileInput/FileOutput?



Deprecations 
------------

Why do you want to deprecate ``io.locale_encoding``?

Why do you want to deprecate auto-detection of the input encoding?

* ``encoding='locale'`` does not help if my input files are a mix of
  UTF-8 and latin-1.


> Using Python 3.10's ``-X warn_default_encoding`` argument to Python,
> we can see a large number of places where the default encoding is
> used. On posix systems this is now UTF-8 following PEP 538 [1], but on
> Windows a non-unicode codepage can be used.

Also on POSIX, the locale encoding is kept unless the locale is "C".

Test:

After setting up locales de_DE-UTF-8 and de_DE-ISO-8859-1 on my
Debian/stable system, I get::

  milde@heinz:~ > export LC_ALL=de_DE
  milde@heinz:~ > python3
  Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
  [GCC 10.2.1 20210110] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import locale
  >>> locale.getpreferredencoding()
  'ISO-8859-1'

Reading a latin-1 encoded file works::

  >>> f = open('/tmp/moff.txt')
  >>> f.read()
  'Grüße\n'

while reading the same file with utf-8 fails::

  >>> f = open('/tmp/moff.txt', encoding='utf-8')
  >>> f.read()
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python3.9/codecs.py", line 322, in decode
      (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte


Günter

Re: [Docutils-develop] I/O uses default encoding argument

From: Adam T. <aat...@ou...> - 2022-06-15 22:57:54

> Parts of the patch-set that (IMO) do not require further discussion are now
> committed to master.

Thank you.


Unify naming of the "utf-8" codec
---------------------------------

> I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it
> refers to the Python codec and UTF-8 for the abstract encoding
> algorithm.

This makes sense, although for specific references to the stdlib implementation of UTF-8 as in the ``encodings.utf_8`` module we could be explicit. I couldn't find anywhere in my patch set that I would change, but I may have missed something -- were there any specific instances you were thinking of?


Add encoding arguments
----------------------

Changes:

> Don't add encoding when the locale encoding is OK.
>  (We may switch to "locale" after implementing it in `docutils.io`.)

Outwith ``FileInput``, where would you want to use 'locale' for the encoding?

This diverges with the custom and practise in the general Python ecosystem (and as far as I can tell encodings in general) -- I would strongly suggest using UTF-8, as it eliminates an entire class of locale/encoding related bugs.

> Document changes that may affect users.
> Use 'ascii' in "tools/dev/unicode2rstsubs.py". 

Makes sense, thanks.

> Break too long lines.

Sorry, I thought I'd done a formatting pass but seemingly not.


Ensure locale_encoding is lower case
------------------------------------

> We can use locale.getpreferredencoding() after dropping Python versions
   where this was problematic.

Great, thanks.


Handle encoding='locale' for docutils.io.Output
-----------------------------------------------

> Is uppercase ``encoding='LOCALE'`` supported in the standard
> function open() in Python >= 3.10?

Good question, I tested and only the exact literal ``locale`` is accepted, so we can drop the ``.lower()`` call.

> IMO, we need ``encoding='locale'`` support in both, input and output.
> Should ``encoding='locale'`` be supported in all Input/Output classes or
> only in FileInput/FileOutput?

The patch set I set last time does, via the default encoding helper method I added.

I don't mind about putting support for ``encoding='locale'`` on just FileInput/FileOutput -- what would your preference be here?


Deprecations
------------

> Why do you want to deprecate ``io.locale_encoding``?

Because after introducing ``encoding='locale'`` there's no use for ``io.locale_encoding`` in Docutils anymore, and to reduce API surface.

> Why do you want to deprecate auto-detection of the input encoding?
> * ``encoding='locale'`` does not help if my input files are a mix of
>   UTF-8 and latin-1.

"auto-guessing" is a poor term -- basically I meant deprecating using the locale encoding as default (as it will change to UTF-8).
I'm not sure I understand the example you gave as Docutils works on a single file basis. Could you add more context please?

> Using Python 3.10's ``-X warn_default_encoding`` argument to Python,
> we can see a large number of places where the default encoding is
> used. On posix systems this is now UTF-8 following PEP 538 [1], but on
> Windows a non-unicode codepage can be used.

> Also on POSIX, the locale encoding is kept unless the locale is "C".

Yes, sorry, I wasn't precise enough.

Thanks,
Adam

Re: [Docutils-develop] I/O uses default encoding argument

From: Adam T. <aat...@ou...> - 2022-06-16 14:39:29

Attachments: 0005-Update-HISTORY-and-RELEASE-NOTES.patch 0001-Canonicalise-UTF-8-references.patch 0002-Ignore-UTF-8-mode-when-detecting-locale-encoding.patch 0003-Deprecate-docutils.io.locale_encoding.patch 0004-Support-encoding-locale.patch

Attached is a set of five patches rebased on current master -- I have updated the language in the deprecation warnings, used the encoding='locale' backport only for 3.7-3.9 (as 3.10 ``builtins.open`` knows about encoding='locale' natively), and updated the ``io.locale_encoding`` detection mechanism to ignore ``-X utf8``, as the system locale encoding doesn't change for the Python UTF-8 mode.

A

Re: [Docutils-develop] I/O uses default encoding argument

From: Guenter M. <mi...@us...> - 2022-06-17 12:28:34

Dear Adam,

On 2022-06-15, Adam Turner wrote:

> Unify naming of the "utf-8" codec
> ---------------------------------

>> I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it
>> refers to the Python codec and UTF-8 for the abstract encoding
>> algorithm.

> [...] I couldn't find anywhere in my patch set that I would change [...]

Sorry, this was replying to an earlier statement ("UTF-8 in documentation"). 
Patch https://github.com/AA-Turner/docutils/pull/15/commits/f7f45addbd8cc728ef03c28d62b6ea981d0fc8ac
states it very well:

  - Use UTF-8 in prose text, error messages, and documentation
  - Use utf-8 in code or when referring to code
  - Use utf8 for LaTeX

I did not apply the changes in the sample SVG images
(generated with Inkscape), though.

> Add encoding arguments
> ----------------------

>> Don't add encoding when the locale encoding is OK.
>>  (We may switch to "locale" after implementing it in `docutils.io`.)

> Outwith ``FileInput``, where would you want to use 'locale' for the encoding?

"quicktest.py" is an old developer diagnostics tool without an option to
select the input/output encodings.  
I suggest keeping the encoding unspecified here, so Python's default is
used and the user can change the encoding via either a locale setting or
starting Python in UTF-8 mode.

...

> Handle encoding='locale' for docutils.io.Output
> -----------------------------------------------

Which encoding is used with ``open('foo', encoding='locale')``
if Python is in UTF-8 mode?

> I don't mind about putting support for ``encoding='locale'`` on just
> FileInput/FileOutput -- what would your preference be here?

We want to drop our 'locale' support when dropping support for Py<3.10.
Does Python support 'locale' also with str.encode()?

Maybe we don't even need backporting "locale" (see below).

> Deprecations
> ------------

>> Why do you want to deprecate ``io.locale_encoding``?

> Because after introducing ``encoding='locale'`` there's no use for
``io.locale_encoding`` in Docutils anymore, and to reduce API surface.

OK. We do not need special deprecation, as `io.locale_encoding` is new in
Docutils 0.19.dev (moved from `utils.error_reporting`).

>> Why do you want to deprecate auto-detection of the input encoding?
>> * ``encoding='locale'`` does not help if my input files are a mix of
>>   UTF-8 and latin-1.

> "auto-guessing" is a poor term -- basically I meant deprecating using
> the locale encoding as default (as it will change to UTF-8). 

> I'm not sure I understand the example you gave as Docutils works on a
> single file basis. Could you add more context please?

What I want to keep/restore is the "auto-detect" default behaviour for
reading/decoding input on Python2 (when opening files under Python 3,
this only kicks in when the first try rises an UnicodeError):

With unspecified `input_encoding` setting, `io.Input.decode` does:

a) Check the BOM mark and top 2 lines of data for an encoding specification
   and use it, else

b) try UTF-8.

c) If this fails, try the locale encoding (if valid).

d) Try latin-1.

e) Give up, report the error.

This allows decoding most input without the need to configure an encoding.

Whether the future default "input-encoding" should be "auto-detect" or
"utf-8" may be decided later. 

In any case I would keep "auto-detect" as an option.

Future (incompatible) changes:

* use `locale.getpreferredencoding()` in c):
  If a user starts Python in UTF-8 mode, we should report decoding errors
  instead of trying a locale encoding.

* maybe drop d)

* warn/info when input encoding is not UTF-8.

Günter

Re: [Docutils-develop] I/O uses default encoding argument

From: Guenter M. <mi...@us...> - 2022-06-17 12:35:11

On 2022-06-16, Adam Turner wrote:

> Attached is a set of five patches rebased on current master

Thanks. I had a look at the first 4 and took them into account in commits
[r9075] to [r9078].

Günter