[Docutils-develop] [docutils:bugs] Re: #451 Deprecate PEP 263 coding slugs support

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> I see still a reason to keep (and properly document) a way to specify the
> encoding of an rST source in the document itself.

The underlying thrust of my argument is that this is very fragile -- for any encoding that is not compatible with ASCII (e.g. UTF-16) the current coding slug test fails::

```pycon
>>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: utf-16".encode("utf-16-le"))
>>> m is None
True
>>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: latin-1".encode("latin-1"))
>>> m.group(1).decode("ascii")
'latin-1'
```

Similarly any in-document metadata would suffer the same fate -- Unicode codepoints (which make up `str` objects) cannot be assumed to have a correspondence to bytes on disk. Better to fail loudly than have silent data corruption.

> A collection of files, where one file for whatever reason must be in a different encoding. Compilation with "buildhtml.py".

If the need arises for this, we would accept a feature request for `buildhtml.py` to have some enumeration of files and their input encodings.

> Documents in an 8-bit or 16-bit encoding intended for compilation anywhere. Avoids shipping a separate configuration file.

Not sure I understand this one fully, but such a file would likely come with compilation instructions that included the input encoding.

Annecdotally, I looked through the ~70 results for the following search [1]_ on "grep.app" 
`coding[:=]( \t)*(([^u\W]|u[^t\W]|ut[^f\W]|utf-?[^8\W])[-\w.]+)`
(lookahead/lookbehinds aren't supported) and no file had a coding slug that occured in the first two lines. Whilst obviously only a fraction of extant reST files are indexed by that provider, if it was a pattern in common usage I would expect to see more than 0.

One of my longer-term goals is to simplify `docutils.io` quite a lot, as I think there is a lot of duplicated code that the current (Python 3) stdlib provides automatically for us. Making our file parsing more vanilla/standard is a step towards this larger goal, although I do believe this change stands alone on its merits.

A

_[1]: `https://grep.app/search?current=7&q=coding%5B%3A%3D%5D%28%20%5Ct%29%2A%28%28%5B%5Eu%5CW%5D%7Cu%5B%5Et%5CW%5D%7Cut%5B%5Ef%5CW%5D%7Cutf-%3F%5B%5E8%5CW%5D%29%5B-%5Cw.%5D%2B%29®exp=true&filter[lang][0]=reStructuredText`

---

** [bugs:#451] Deprecate PEP 263 coding slugs support**

**Status:** open
**Created:** Thu Jun 09, 2022 10:48 PM UTC by Adam  Turner
**Last Updated:** Thu Jun 09, 2022 10:48 PM UTC
**Owner:** nobody
**Attachments:**

- [0001-Deprecate-PEP-263-coding-slugs.patch](https://sourceforge.net/p/docutils/bugs/451/attachment/0001-Deprecate-PEP-263-coding-slugs.patch) (5.5 kB; application/octet-stream)

Python 3 uses utf-8 as the encoding for Python source files, there is no longer a compelling use-case for the support, which adds complexity to the IO implementation.

I propose deprecating support for removal in 1.0, but 2.0 might be a better option.

Support was added in [r4506].

A

---

Sent from sourceforge.net because doc...@li... is subscribed to https://sourceforge.net/p/docutils/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.