Re: [Docutils-develop] I/O uses default encoding argument

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Adam,

On 2022-06-15, Adam Turner wrote:

> Unify naming of the "utf-8" codec
> ---------------------------------

>> I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it
>> refers to the Python codec and UTF-8 for the abstract encoding
>> algorithm.

> [...] I couldn't find anywhere in my patch set that I would change [...]

Sorry, this was replying to an earlier statement ("UTF-8 in documentation"). 
Patch https://github.com/AA-Turner/docutils/pull/15/commits/f7f45addbd8cc728ef03c28d62b6ea981d0fc8ac
states it very well:

  - Use UTF-8 in prose text, error messages, and documentation
  - Use utf-8 in code or when referring to code
  - Use utf8 for LaTeX

I did not apply the changes in the sample SVG images
(generated with Inkscape), though.

> Add encoding arguments
> ----------------------

>> Don't add encoding when the locale encoding is OK.
>>  (We may switch to "locale" after implementing it in `docutils.io`.)

> Outwith ``FileInput``, where would you want to use 'locale' for the encoding?

"quicktest.py" is an old developer diagnostics tool without an option to
select the input/output encodings.  
I suggest keeping the encoding unspecified here, so Python's default is
used and the user can change the encoding via either a locale setting or
starting Python in UTF-8 mode.

...

> Handle encoding='locale' for docutils.io.Output
> -----------------------------------------------

Which encoding is used with ``open('foo', encoding='locale')``
if Python is in UTF-8 mode?

> I don't mind about putting support for ``encoding='locale'`` on just
> FileInput/FileOutput -- what would your preference be here?

We want to drop our 'locale' support when dropping support for Py<3.10.
Does Python support 'locale' also with str.encode()?

Maybe we don't even need backporting "locale" (see below).

> Deprecations
> ------------

>> Why do you want to deprecate ``io.locale_encoding``?

> Because after introducing ``encoding='locale'`` there's no use for
``io.locale_encoding`` in Docutils anymore, and to reduce API surface.

OK. We do not need special deprecation, as `io.locale_encoding` is new in
Docutils 0.19.dev (moved from `utils.error_reporting`).

>> Why do you want to deprecate auto-detection of the input encoding?
>> * ``encoding='locale'`` does not help if my input files are a mix of
>>   UTF-8 and latin-1.

> "auto-guessing" is a poor term -- basically I meant deprecating using
> the locale encoding as default (as it will change to UTF-8). 

> I'm not sure I understand the example you gave as Docutils works on a
> single file basis. Could you add more context please?

What I want to keep/restore is the "auto-detect" default behaviour for
reading/decoding input on Python2 (when opening files under Python 3,
this only kicks in when the first try rises an UnicodeError):

With unspecified `input_encoding` setting, `io.Input.decode` does:

a) Check the BOM mark and top 2 lines of data for an encoding specification
   and use it, else

b) try UTF-8.

c) If this fails, try the locale encoding (if valid).

d) Try latin-1.

e) Give up, report the error.

This allows decoding most input without the need to configure an encoding.

Whether the future default "input-encoding" should be "auto-detect" or
"utf-8" may be decided later. 

In any case I would keep "auto-detect" as an option.

Future (incompatible) changes:

* use `locale.getpreferredencoding()` in c):
  If a user starts Python in UTF-8 mode, we should report decoding errors
  instead of trying a locale encoding.

* maybe drop d)

* warn/info when input encoding is not UTF-8.

Günter