When running the distutils tests with PYTHONWARNDEFAULTENCODING=1
, two warnings are emitted:
distutils/tests/test_check.py::TestCheck::test_check_restructuredtext
/Users/jaraco/code/pypa/distutils/.tox/py/lib/python3.12/site-packages/docutils/io.py:381: EncodingWarning: 'encoding' argument not specified
self.source = open(source_path, mode,
distutils/tests/test_check.py::TestCheck::test_check_restructuredtext
/Users/jaraco/code/pypa/distutils/.tox/py/lib/python3.12/site-packages/docutils/io.py:151: EncodingWarning: UTF-8 Mode affects locale.getpreferredencoding(). Consider locale.getencoding() instead.
fallback = locale.getpreferredencoding(do_setlocale=False)
Docutils should honor PEP 597 and address these warnings (and possibly others). In my experience, adding encoding='utf-8'
to any io operation is the best approach - it's straight-up compatible with the default on non-Windows systems and usually honoring the Unix convention is suitable if not preferable on Windows. Not only that, but that behavior will become the default in Python 3.15 or so.
Thank you for the feedback. The problem is worked on:
[r9772] changes the default encoding from
None
(auto-detect) to "utf-8" indocutils.io.Input
anddocutils.io.FileInput
.Related
Commit: [r9772]
[r9772] breaks tests for non-UTF locales on both Linux and Windows (e.g. ISO 88591), when not using Python's UTF-8 mode.
See the following failures (from GitHub Actions, scroll up to the first section
Run test suite (pytest ./test)
):I'm not sure what the right behaviour here should be.
There's also a problem on the same non-UTF-8 locales when not in UTF-8 mode:
(On Windows it says
ValueError: Encoding of <file> (cp1252) differs
instead).This failure only happens with
alltests.py
. Now that bothpytest
andunittest
work with our test suite, we could consider removingalltests.py
.A
Related
Commit: [r9772]
I tested reverting and re-applying [r9772] in this PR -- note that the 'alltest' failure occurs in the before and after, but pytest and unittest fail when [r9772] is reapplied.
A
Related
Commit: [r9772]
In my opinion, the project should stop honoring the "preferred encoding" and instead expect UTF-8 unless otherwise specified, as that's going to become the default behavior in Python 3.14 for most IO operations. I'm unsure of compatibility implications. It does appear as if this test (
test_fallback_no_utf8
) would no longer be relevant in that regime, so I'd just delete it.Regarding the non-UTF8 mode, that does sound more complicated, although maybe that functionality too should be deprecated/removed. That is, IMHO, the user should be offered UTF-8 mode by default and an option to specify an encoding, maybe with "locale" as one option, but otherwise remove the implied "locale" behavior.
I have a very weak understanding of docutils, however, so take my advice with a grain of salt.
I agree, however...
There was fairly extensive discussion of this in April last year. The core issue is that Docutils serialises to formats that have internal charset/encoding declarations (e.g. TeX, HTML, XML). If everything is UTF-8 then all is fine and simple, but if the user wants e.g. latin1 encoding then Docutils 'should' encode that in the relevant places in the output documents. Docutils also chooses whether to embed a Unicode character directly vs using an escape or macro (e.g. the dagger † footnote symbol) based on the chosen encoding.
I am of the opinion that Docutils should remove support for encodings other than Unicode (UTF-8) in text mode for both input and output. UTF-8 is so ubiquitous that anyone running a modern enough Python to use this version of Docutils will either support UTF-8 or know how to work-around any problems.
The only writer that makes runtime use of the output encoding setting is LaTeX. LuaTex and XeTeX have always supported UTF-8 in source files, and LaTeX has since 2018.
To your original point, running with
PYTHONWARNDEFAULTENCODING=1
should now produce no warnings. If you still get warnings please let us know as it would mean we are missing test coverage (Docutils' tests pass with -Werror and -Xwarn_default_encoding).A
Output encoding defaults to "utf-8" for all writers since several years.
Most writers honour the "output-encoding" setting and encode the output file accordingly. So, you may use "rst2html5 --output-encoding=ASCII:xmlcharrefreplace" to have a pure ASCII file.
The HTML, XML, and LaTeX writers also specify the used encoding in the file, the LaTeX writer also provides replacements for Unicode characters that are not encodable if a legacy output encoding is selected. There should be no cases of
output_encoding == None
.Input encoding was "auto-select" with fallback to utf-8 and "locale encoding" until 0.21. After the discussion last year the transition to utf-8 started: 0.22 uses "utf-8" as input encoding default,
we will remove the input encoding auto-detection code in Docutils 1.0.
The offending test case, "test_fallback_no_utf8()" is more trouble than help and removed in [r9864].
Related
Commit: [r9864]