From: Adam T. <aat...@ou...> - 2022-06-09 11:58:20
Attachments:
0001-Add-encoding-arguments.patch
|
(Re-sending to the correct list) Using Python 3.10's ``-X warn_default_encoding`` argument to Python, we can see a large number of places where the default encoding is used. On posix systems this is now UTF-8 following PEP 538 [1], but on Windows a non-unicode codepage can be used. The attached patch fixes the majority of these instances. A [1]: https://peps.python.org/pep-0538/ |
From: Guenter M. <mi...@us...> - 2022-06-09 22:46:00
|
On 2022-06-09, Adam Turner wrote: > Using Python 3.10's ``-X warn_default_encoding`` argument to Python, we > can see a large number of places where the default encoding is used. On > posix systems this is now UTF-8 following PEP 538 [1], but on Windows a > non-unicode codepage can be used. > The attached patch fixes the majority of these instances. Thank you for the patch. After reading PEP 597, I agree that we should specify the intended encoding where appropriate. This means for every instance of open() without explicit encoding, we have to decide whether to use "ascii", "utf-8", or `io.locale_encoding` (the latter is equivalent to the value "locale" introduced in Py 3.10). Unfortunately, the patch mixes added "encoding" arguments with the change of "utf8" to "utf-8" in many cases. * Is there a reason to prefer 'utf-8'? We have currently 36 instances of 'utf8' vs. 19 instances of 'utf-8' in the library code and tests. The "codecs" documentation names "utf8" and "utf-8" as aliases for "utf_8". * Separating the encoding name normalization from new arguments would make it easier to check whether the new-specified encoding is correct. Günter |
From: Adam T. <aat...@ou...> - 2022-06-09 22:55:12
Attachments:
0001-Add-encoding-arguments.patch
|
> This means for every instance of open() without explicit encoding, we have > to decide whether to use "ascii", "utf-8", or `io.locale_encoding` > (the latter is equivalent to the value "locale" introduced in Py 3.10). My strong suggestion would be that Docutils moves towards defaulting to UTF-8 for all encodings (of course keeping the option to supply explicit other encodings) -- it is compatible with US-ASCII and is the safest sane default. (PEP 686's motivation section [1]_ has some colour on this). >Unfortunately, the patch mixes added "encoding" arguments with the>change of "utf8" to "utf-8" in many cases. An updated patch attached (The only 'utf8' -> 'utf-8' were in the LaTex2e writer, but you're right it is better to keep the changes distinct.) A _[1]: https://peps.python.org/pep-0686/#motivation |
From: Guenter M. <mi...@us...> - 2022-06-10 08:44:45
|
On 2022-06-09, Adam Turner wrote: >> This means for every instance of open() without explicit encoding, we have >> to decide whether to use "ascii", "utf-8", or `io.locale_encoding` >> (the latter is equivalent to the value "locale" introduced in Py 3.10). > My strong suggestion would be that Docutils moves towards defaulting to > UTF-8 for all encodings (of course keeping the option to supply > explicit other encodings) -- it is compatible with US-ASCII and is the > safest sane default. (PEP 686's motivation section [1]_ has some colour > on this). However, in cases of user-supplied input, this is an API change. We can fix the cases in the tests now but need due process for cases where changes may lead to different behaviur for users. Suggestion: * backport Python 3.11 behaviour to docutils.io: “use locale encoding when encoding="locale" is passed”. * announce change of default encoding to UTF-8 * keep encoding attribute unspecified for now when reading input specified by users or 3rd-party code. >>Unfortunately, the patch mixes added "encoding" arguments with the>change of "utf8" to "utf-8" in many cases. > An updated patch attached (The only 'utf8' -> 'utf-8' were in the > LaTex2e writer, but you're right it is better to keep the changes > distinct.) Consistent naming in Docutils code (not only latex2e.py) and documentation is good. What is the motivation for 'utf-8'? * Python's codecs module uses "utf_8" (with aliases U8, UTF, utf8, cp65001 and normalizing case and "-/_"). * In LaTeX, it's named "utf8", * `locale` reports "UTF-8" * PEP 8 uses uppercase: "Code in the core Python distribution should always use UTF-8". * The `codecs documentation`__ uses ``encoding='utf-8'`` when documenting default arguments for encode() and decode(). __ https://docs.python.org/3/library/codecs.html Thanks, Günter |
From: Adam T. <aat...@ou...> - 2022-06-11 00:01:09
Attachments:
0008-Update-HISTORY-and-RELEASE-NOTES.patch
0001-Add-encoding-arguments.patch
0002-Canonicalise-UTF-8-references.patch
0003-Additional-utf-8-tests.patch
0004-Ensure-locale_encoding-is-lower-case.patch
0005-Deprecate-docutils.io.locale_encoding.patch
0006-Add-_get_default_encoding-helper.patch
0007-Handle-encoding-locale-for-docutils.io.Output.patch
|
> However, in cases of user-supplied input, this is an API change. > We can fix the cases in the tests now but need due process for cases where > changes may lead to different behaviur for users. > Suggestion: > * backport Python 3.11 behaviour to docutils.io: > “use locale encoding when encoding="locale" is passed”. > * announce change of default encoding to UTF-8 > * keep encoding attribute unspecified for now when reading input > specified by users or 3rd-party code. This seems a sensible way forwards. The updated patch set does (1) and (2) and warns on unspecified encoding input in the ``docutils.io.(Input|Output)`` classes. > Consistent naming in Docutils code (not only latex2e.py) and > documentation is good. The updated patch set renames everything to my reccomendation below. (It is a larger change than originally envisaged, so it is 8 patches -- alternativley formatted on the web [1]_. > What is the motivation for 'utf-8'? The name of the encoding is UTF-8 [2]_ [3]_. I propose using UTF-8 (uppercase) in documentation and prose text and utf-8 (lowercase) in code (If you'd prefer consistency in case I would pick lowercase everywhere). A _[1]: https://github.com/AA-Turner/docutils/pull/15 and https://github.com/AA-Turner/docutils/pull/15.patch _[2]: https://www.ietf.org/rfc/rfc3629.html _[3]: https://encoding.spec.whatwg.org/#names-and-labels |
From: Guenter M. <mi...@us...> - 2022-06-15 15:32:48
|
Dear Adam, thank you for the update patches. Parts of the patch-set that (IMO) do not require further discussion are now committed to master. Unify naming of the "utf-8" codec --------------------------------- > I propose using UTF-8 (uppercase) in documentation and prose text and > utf-8 (lowercase) in code I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it refers to the Python codec and UTF-8 for the abstract encoding algorithm. r9068 Add encoding arguments ---------------------- Changes: * Don't add encoding when the locale encoding is OK. (We may switch to "locale" after implementing it in `docutils.io`.) * Document changes that may affect users. * Use 'ascii' in "tools/dev/unicode2rstsubs.py". Its a developer tool. The generated files should be usable with any ASCII-compatible encoding. * Break too long lines. r9072 Ensure locale_encoding is lower case ------------------------------------ Some simplifications: * We can use locale.getpreferredencoding() after dropping Python versions where this was problematic. * We can append ``.lower()`` as there is a catchall ``except`` later. TODO: check whether io.locale_encoding is set correctly with every OS and Python version or whether front-end tools would need to call `locale.setlocale()` before importing this module. Handle encoding='locale' for docutils.io.Output ----------------------------------------------- Is uppercase ``encoding='LOCALE'`` supported in the standard function open() in Python >= 3.10? IMO, we need ``encoding='locale'`` support in both, input and output. Should ``encoding='locale' be supported in all Input/Output classes or only in FileInput/FileOutput? Deprecations ------------ Why do you want to deprecate ``io.locale_encoding``? Why do you want to deprecate auto-detection of the input encoding? * ``encoding='locale'`` does not help if my input files are a mix of UTF-8 and latin-1. > Using Python 3.10's ``-X warn_default_encoding`` argument to Python, > we can see a large number of places where the default encoding is > used. On posix systems this is now UTF-8 following PEP 538 [1], but on > Windows a non-unicode codepage can be used. Also on POSIX, the locale encoding is kept unless the locale is "C". Test: After setting up locales de_DE-UTF-8 and de_DE-ISO-8859-1 on my Debian/stable system, I get:: milde@heinz:~ > export LC_ALL=de_DE milde@heinz:~ > python3 Python 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.getpreferredencoding() 'ISO-8859-1' Reading a latin-1 encoded file works:: >>> f = open('/tmp/moff.txt') >>> f.read() 'Grüße\n' while reading the same file with utf-8 fails:: >>> f = open('/tmp/moff.txt', encoding='utf-8') >>> f.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte Günter |
From: Adam T. <aat...@ou...> - 2022-06-15 22:57:54
|
> Parts of the patch-set that (IMO) do not require further discussion are now > committed to master. Thank you. Unify naming of the "utf-8" codec --------------------------------- > I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it > refers to the Python codec and UTF-8 for the abstract encoding > algorithm. This makes sense, although for specific references to the stdlib implementation of UTF-8 as in the ``encodings.utf_8`` module we could be explicit. I couldn't find anywhere in my patch set that I would change, but I may have missed something -- were there any specific instances you were thinking of? Add encoding arguments ---------------------- Changes: > Don't add encoding when the locale encoding is OK. > (We may switch to "locale" after implementing it in `docutils.io`.) Outwith ``FileInput``, where would you want to use 'locale' for the encoding? This diverges with the custom and practise in the general Python ecosystem (and as far as I can tell encodings in general) -- I would strongly suggest using UTF-8, as it eliminates an entire class of locale/encoding related bugs. > Document changes that may affect users. > Use 'ascii' in "tools/dev/unicode2rstsubs.py". Makes sense, thanks. > Break too long lines. Sorry, I thought I'd done a formatting pass but seemingly not. Ensure locale_encoding is lower case ------------------------------------ > We can use locale.getpreferredencoding() after dropping Python versions where this was problematic. Great, thanks. Handle encoding='locale' for docutils.io.Output ----------------------------------------------- > Is uppercase ``encoding='LOCALE'`` supported in the standard > function open() in Python >= 3.10? Good question, I tested and only the exact literal ``locale`` is accepted, so we can drop the ``.lower()`` call. > IMO, we need ``encoding='locale'`` support in both, input and output. > Should ``encoding='locale'`` be supported in all Input/Output classes or > only in FileInput/FileOutput? The patch set I set last time does, via the default encoding helper method I added. I don't mind about putting support for ``encoding='locale'`` on just FileInput/FileOutput -- what would your preference be here? Deprecations ------------ > Why do you want to deprecate ``io.locale_encoding``? Because after introducing ``encoding='locale'`` there's no use for ``io.locale_encoding`` in Docutils anymore, and to reduce API surface. > Why do you want to deprecate auto-detection of the input encoding? > * ``encoding='locale'`` does not help if my input files are a mix of > UTF-8 and latin-1. "auto-guessing" is a poor term -- basically I meant deprecating using the locale encoding as default (as it will change to UTF-8). I'm not sure I understand the example you gave as Docutils works on a single file basis. Could you add more context please? > Using Python 3.10's ``-X warn_default_encoding`` argument to Python, > we can see a large number of places where the default encoding is > used. On posix systems this is now UTF-8 following PEP 538 [1], but on > Windows a non-unicode codepage can be used. > Also on POSIX, the locale encoding is kept unless the locale is "C". Yes, sorry, I wasn't precise enough. Thanks, Adam |
From: Adam T. <aat...@ou...> - 2022-06-16 14:39:29
|
Attached is a set of five patches rebased on current master -- I have updated the language in the deprecation warnings, used the encoding='locale' backport only for 3.7-3.9 (as 3.10 ``builtins.open`` knows about encoding='locale' natively), and updated the ``io.locale_encoding`` detection mechanism to ignore ``-X utf8``, as the system locale encoding doesn't change for the Python UTF-8 mode. A |
From: Guenter M. <mi...@us...> - 2022-06-17 12:28:34
|
Dear Adam, On 2022-06-15, Adam Turner wrote: > Unify naming of the "utf-8" codec > --------------------------------- >> I'd prefer 'utf-8' (lowercase, in quotes) also in documentation, if it >> refers to the Python codec and UTF-8 for the abstract encoding >> algorithm. > [...] I couldn't find anywhere in my patch set that I would change [...] Sorry, this was replying to an earlier statement ("UTF-8 in documentation"). Patch https://github.com/AA-Turner/docutils/pull/15/commits/f7f45addbd8cc728ef03c28d62b6ea981d0fc8ac states it very well: - Use UTF-8 in prose text, error messages, and documentation - Use utf-8 in code or when referring to code - Use utf8 for LaTeX I did not apply the changes in the sample SVG images (generated with Inkscape), though. > Add encoding arguments > ---------------------- >> Don't add encoding when the locale encoding is OK. >> (We may switch to "locale" after implementing it in `docutils.io`.) > Outwith ``FileInput``, where would you want to use 'locale' for the encoding? "quicktest.py" is an old developer diagnostics tool without an option to select the input/output encodings. I suggest keeping the encoding unspecified here, so Python's default is used and the user can change the encoding via either a locale setting or starting Python in UTF-8 mode. ... > Handle encoding='locale' for docutils.io.Output > ----------------------------------------------- Which encoding is used with ``open('foo', encoding='locale')`` if Python is in UTF-8 mode? > I don't mind about putting support for ``encoding='locale'`` on just > FileInput/FileOutput -- what would your preference be here? We want to drop our 'locale' support when dropping support for Py<3.10. Does Python support 'locale' also with str.encode()? Maybe we don't even need backporting "locale" (see below). > Deprecations > ------------ >> Why do you want to deprecate ``io.locale_encoding``? > Because after introducing ``encoding='locale'`` there's no use for ``io.locale_encoding`` in Docutils anymore, and to reduce API surface. OK. We do not need special deprecation, as `io.locale_encoding` is new in Docutils 0.19.dev (moved from `utils.error_reporting`). >> Why do you want to deprecate auto-detection of the input encoding? >> * ``encoding='locale'`` does not help if my input files are a mix of >> UTF-8 and latin-1. > "auto-guessing" is a poor term -- basically I meant deprecating using > the locale encoding as default (as it will change to UTF-8). > I'm not sure I understand the example you gave as Docutils works on a > single file basis. Could you add more context please? What I want to keep/restore is the "auto-detect" default behaviour for reading/decoding input on Python2 (when opening files under Python 3, this only kicks in when the first try rises an UnicodeError): With unspecified `input_encoding` setting, `io.Input.decode` does: a) Check the BOM mark and top 2 lines of data for an encoding specification and use it, else b) try UTF-8. c) If this fails, try the locale encoding (if valid). d) Try latin-1. e) Give up, report the error. This allows decoding most input without the need to configure an encoding. Whether the future default "input-encoding" should be "auto-detect" or "utf-8" may be decided later. In any case I would keep "auto-detect" as an option. Future (incompatible) changes: * use `locale.getpreferredencoding()` in c): If a user starts Python in UTF-8 mode, we should report decoding errors instead of trying a locale encoding. * maybe drop d) * warn/info when input encoding is not UTF-8. Günter |
From: Guenter M. <mi...@us...> - 2022-06-17 12:35:11
|
On 2022-06-16, Adam Turner wrote: > Attached is a set of five patches rebased on current master Thanks. I had a look at the first 4 and took them into account in commits [r9075] to [r9078]. Günter |