Python 3 uses utf-8 as the encoding for Python source files, there is no longer a compelling use-case for the support, which adds complexity to the IO implementation.
I propose deprecating support for removal in 1.0, but 2.0 might be a better option.
Support was added in [r4506].
A
I see still a reason to keep (and properly document) a way to specify the
encoding of an rST source in the document itself.
Use cases:
A collection of files, where one file for whatever reason must be in a
different encoding. Compilation with "buildhtml.py".
Documents in an 8-bit or 16-bit encoding intended for compilation
anywhere. Avoids shipping a separate configuration file.
The "coding slug" might become obsoleted by a more generic "in-document configuration"
(cf. TODO item misc.settings directive
but this is still a long way off.
The underlying thrust of my argument is that this is very fragile -- for any encoding that is not compatible with ASCII (e.g. UTF-16) the current coding slug test fails::
Similarly any in-document metadata would suffer the same fate -- Unicode codepoints (which make up
str
objects) cannot be assumed to have a correspondence to bytes on disk. Better to fail loudly than have silent data corruption.If the need arises for this, we would accept a feature request for
buildhtml.py
to have some enumeration of files and their input encodings.Not sure I understand this one fully, but such a file would likely come with compilation instructions that included the input encoding.
Annecdotally, I looked through the ~70 results for the following search [1]_ on "grep.app"
coding[:=]( \t)*(([^u\W]|u[^t\W]|ut[^f\W]|utf-?[^8\W])[-\w.]+)
(lookahead/lookbehinds aren't supported) and no file had a coding slug that occured in the first two lines. Whilst obviously only a fraction of extant reST files are indexed by that provider, if it was a pattern in common usage I would expect to see more than 0.
One of my longer-term goals is to simplify
docutils.io
quite a lot, as I think there is a lot of duplicated code that the current (Python 3) stdlib provides automatically for us. Making our file parsing more vanilla/standard is a step towards this larger goal, although I do believe this change stands alone on its merits.A
_[1]:
https://grep.app/search?current=7&q=coding%5B%3A%3D%5D%28%20%5Ct%29%2A%28%28%5B%5Eu%5CW%5D%7Cu%5B%5Et%5CW%5D%7Cut%5B%5Ef%5CW%5D%7Cutf-%3F%5B%5E8%5CW%5D%29%5B-%5Cw.%5D%2B%29®exp=true&filter[lang][0]=reStructuredText
True, this method only works with ASCII compatible encodings.
(This is one of the reasons why Docutils as well as PEP 263 complement it
with BOM mark recognition.)
...
IMO, it is more safe keep "source code encoding both visible and
changeable on a per-source file basis". [PEP 263]
Python3 still supports the encoding slug.
I vote to keep this option as well.
Fair enough, I will put this on hold for now. [bugs:#450] is more important to resolve at the moment before 0.19.0b1 release.
A
Related
Bugs:
#450Ticket moved from /p/docutils/bugs/451/
Last edit: Günter Milde 2022-06-14
OTOH, this feature does not need to be implemented in docutils.io.
The attached "inspecting_codecs" package is a first try to implement the current default behaviour as a codec -- allowing Docutils to use standard io tools.
BTW: It also recognizes PEP-263-like encoding declarations in UTF-16 and UTF-32.
With a replacement sufficiently stable and available either inside Docutils or as separate package,
deprecating the current encoding handling is OK.
The (still provisional) "inspecting_codecs" package is now available on
https://codeberg.org/milde/inspecting-codecs.
Last edit: Günter Milde 2023-05-19
See also the Docutils Enhancement Proposal at https://docutils.sourceforge.io/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt
The attached patch set implements the changes announced in the RELEASE_NOTES.
The way forward is now specified in the RELEASE-NOTES.