From: Adam T. <aa-...@us...> - 2022-06-11 01:13:58
|
> I see still a reason to keep (and properly document) a way to specify the > encoding of an rST source in the document itself. The underlying thrust of my argument is that this is very fragile -- for any encoding that is not compatible with ASCII (e.g. UTF-16) the current coding slug test fails:: ```pycon >>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: utf-16".encode("utf-16-le")) >>> m is None True >>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: latin-1".encode("latin-1")) >>> m.group(1).decode("ascii") 'latin-1' ``` Similarly any in-document metadata would suffer the same fate -- Unicode codepoints (which make up `str` objects) cannot be assumed to have a correspondence to bytes on disk. Better to fail loudly than have silent data corruption. > A collection of files, where one file for whatever reason must be in a different encoding. Compilation with "buildhtml.py". If the need arises for this, we would accept a feature request for `buildhtml.py` to have some enumeration of files and their input encodings. > Documents in an 8-bit or 16-bit encoding intended for compilation anywhere. Avoids shipping a separate configuration file. Not sure I understand this one fully, but such a file would likely come with compilation instructions that included the input encoding. Annecdotally, I looked through the ~70 results for the following search [1]_ on "grep.app" `coding[:=]( \t)*(([^u\W]|u[^t\W]|ut[^f\W]|utf-?[^8\W])[-\w.]+)` (lookahead/lookbehinds aren't supported) and no file had a coding slug that occured in the first two lines. Whilst obviously only a fraction of extant reST files are indexed by that provider, if it was a pattern in common usage I would expect to see more than 0. One of my longer-term goals is to simplify `docutils.io` quite a lot, as I think there is a lot of duplicated code that the current (Python 3) stdlib provides automatically for us. Making our file parsing more vanilla/standard is a step towards this larger goal, although I do believe this change stands alone on its merits. A _[1]: `https://grep.app/search?current=7&q=coding%5B%3A%3D%5D%28%20%5Ct%29%2A%28%28%5B%5Eu%5CW%5D%7Cu%5B%5Et%5CW%5D%7Cut%5B%5Ef%5CW%5D%7Cutf-%3F%5B%5E8%5CW%5D%29%5B-%5Cw.%5D%2B%29®exp=true&filter[lang][0]=reStructuredText` --- ** [bugs:#451] Deprecate PEP 263 coding slugs support** **Status:** open **Created:** Thu Jun 09, 2022 10:48 PM UTC by Adam Turner **Last Updated:** Thu Jun 09, 2022 10:48 PM UTC **Owner:** nobody **Attachments:** - [0001-Deprecate-PEP-263-coding-slugs.patch](https://sourceforge.net/p/docutils/bugs/451/attachment/0001-Deprecate-PEP-263-coding-slugs.patch) (5.5 kB; application/octet-stream) Python 3 uses utf-8 as the encoding for Python source files, there is no longer a compelling use-case for the support, which adds complexity to the IO implementation. I propose deprecating support for removal in 1.0, but 2.0 might be a better option. Support was added in [r4506]. A --- Sent from sourceforge.net because doc...@li... is subscribed to https://sourceforge.net/p/docutils/bugs/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/docutils/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list. |