Docutils: Documentation Utilities / Patches / #194 Deprecate PEP 263 coding slugs support

Günter Milde - 2022-06-10

Python 3 uses utf-8 as the encoding for Python source files, there is
no longer a compelling use-case for the support, which adds complexity
to the IO implementation.

I see still a reason to keep (and properly document) a way to specify the
encoding of an rST source in the document itself.

Use cases:

A collection of files, where one file for whatever reason must be in a
different encoding. Compilation with "buildhtml.py".

Documents in an 8-bit or 16-bit encoding intended for compilation
anywhere. Avoids shipping a separate configuration file.

The "coding slug" might become obsoleted by a more generic "in-document configuration"
(cf. TODO item misc.settings directive
but this is still a long way off.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Adam Turner - 2022-06-11
  
  I see still a reason to keep (and properly document) a way to specify the
  encoding of an rST source in the document itself.
  
  The underlying thrust of my argument is that this is very fragile -- for any encoding that is not compatible with ASCII (e.g. UTF-16) the current coding slug test fails::
  
  >>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: utf-16".encode("utf-16-le")) >>> m is None True >>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: latin-1".encode("latin-1")) >>> m.group(1).decode("ascii") 'latin-1'
  
  Similarly any in-document metadata would suffer the same fate -- Unicode codepoints (which make up str objects) cannot be assumed to have a correspondence to bytes on disk. Better to fail loudly than have silent data corruption.
  
  A collection of files, where one file for whatever reason must be in a different encoding. Compilation with "buildhtml.py".
  
  If the need arises for this, we would accept a feature request for buildhtml.py to have some enumeration of files and their input encodings.
  
  Documents in an 8-bit or 16-bit encoding intended for compilation anywhere. Avoids shipping a separate configuration file.
  
  Not sure I understand this one fully, but such a file would likely come with compilation instructions that included the input encoding.
  
  Annecdotally, I looked through the ~70 results for the following search [1]_ on "grep.app"
  coding[:=]( \t)*(([^u\W]|u[^t\W]|ut[^f\W]|utf-?[^8\W])[-\w.]+)
  (lookahead/lookbehinds aren't supported) and no file had a coding slug that occured in the first two lines. Whilst obviously only a fraction of extant reST files are indexed by that provider, if it was a pattern in common usage I would expect to see more than 0.
  
  One of my longer-term goals is to simplify docutils.io quite a lot, as I think there is a lot of duplicated code that the current (Python 3) stdlib provides automatically for us. Making our file parsing more vanilla/standard is a step towards this larger goal, although I do believe this change stands alone on its merits.
  
  A
  
  _[1]: https://grep.app/search?current=7&q=coding%5B%3A%3D%5D%28%20%5Ct%29%2A%28%28%5B%5Eu%5CW%5D%7Cu%5B%5Et%5CW%5D%7Cut%5B%5Ef%5CW%5D%7Cutf-%3F%5B%5E8%5CW%5D%29%5B-%5Cw.%5D%2B%29&regexp=true&filter[lang][0]=reStructuredText
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Günter Milde - 2022-06-12
    
    The underlying thrust of my argument is that this is very fragile --
    
    True, this method only works with ASCII compatible encodings.
    (This is one of the reasons why Docutils as well as PEP 263 complement it
    with BOM mark recognition.)
    
    ...
    
    If the need arises [...], we would accept a feature request for
    buildhtml.py to have some enumeration of files and their input
    encodings.
    
    IMO, it is more safe keep "source code encoding both visible and
    changeable on a per-source file basis". [PEP 263]
    
    Python3 still supports the encoding slug.
    I vote to keep this option as well.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Adam Turner - 2022-06-12
      
      Fair enough, I will put this on hold for now. [bugs:#450] is more important to resolve at the moment before 0.19.0b1 release.
      
      A
      
      Related
      
      Bugs: ~~#450~~
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2022-06-14

Ticket moved from /p/docutils/bugs/451/

Last edit: Günter Milde 2022-06-14

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2022-07-15

IMO, it is more safe keep "source code encoding both visible and changeable on a per-source file basis". [PEP 263]

OTOH, this feature does not need to be implemented in docutils.io.
The attached "inspecting_codecs" package is a first try to implement the current default behaviour as a codec -- allowing Docutils to use standard io tools.
BTW: It also recognizes PEP-263-like encoding declarations in UTF-16 and UTF-32.
With a replacement sufficiently stable and available either inside Docutils or as separate package,
deprecating the current encoding handling is OK.

inspecting_codecs-0.1.0.tar.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2022-07-24

The (still provisional) "inspecting_codecs" package is now available on
https://codeberg.org/milde/inspecting-codecs.

Last edit: Günter Milde 2023-05-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2022-12-02

See also the Docutils Enhancement Proposal at https://docutils.sourceforge.io/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt

The attached patch set implements the changes announced in the RELEASE_NOTES.

0001-Read-binary-data-and-decode-with-heuristics-if-input.patch

0002-After-detecting-a-BOM-leave-handling-it-to-Python-s-.patch

0003-Document-input-encoding-auto-detection.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2023-04-18

status: open --> closed-accepted

Group: --> None
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2023-04-18

The way forward is now specified in the RELEASE-NOTES.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Deprecate PEP 263 coding slugs support

Group

Searches

Help

#194 Deprecate PEP 263 coding slugs support

Related

Discussion

Related