Menu

#194 Deprecate PEP 263 coding slugs support

None
closed-accepted
nobody
None
5
2023-04-18
2022-06-09
Adam Turner
No

Python 3 uses utf-8 as the encoding for Python source files, there is no longer a compelling use-case for the support, which adds complexity to the IO implementation.

I propose deprecating support for removal in 1.0, but 2.0 might be a better option.

Support was added in [r4506].

A

1 Attachments

Related

Commit: [r4506]

Discussion

  • Günter Milde

    Günter Milde - 2022-06-10

    Python 3 uses utf-8 as the encoding for Python source files, there is
    no longer a compelling use-case for the support, which adds complexity
    to the IO implementation.

    I see still a reason to keep (and properly document) a way to specify the
    encoding of an rST source in the document itself.

    Use cases:

    • A collection of files, where one file for whatever reason must be in a
      different encoding. Compilation with "buildhtml.py".

    • Documents in an 8-bit or 16-bit encoding intended for compilation
      anywhere. Avoids shipping a separate configuration file.

    The "coding slug" might become obsoleted by a more generic "in-document configuration"
    (cf. TODO item misc.settings directive
    but this is still a long way off.

     
    • Adam  Turner

      Adam Turner - 2022-06-11

      I see still a reason to keep (and properly document) a way to specify the
      encoding of an rST source in the document itself.

      The underlying thrust of my argument is that this is very fragile -- for any encoding that is not compatible with ASCII (e.g. UTF-16) the current coding slug test fails::

      >>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: utf-16".encode("utf-16-le"))
      >>> m is None
      True
      >>> m = re.search(br"coding[:=]\s*([-\w.]+)", "coding: latin-1".encode("latin-1"))
      >>> m.group(1).decode("ascii")
      'latin-1'
      

      Similarly any in-document metadata would suffer the same fate -- Unicode codepoints (which make up str objects) cannot be assumed to have a correspondence to bytes on disk. Better to fail loudly than have silent data corruption.

      A collection of files, where one file for whatever reason must be in a different encoding. Compilation with "buildhtml.py".

      If the need arises for this, we would accept a feature request for buildhtml.py to have some enumeration of files and their input encodings.

      Documents in an 8-bit or 16-bit encoding intended for compilation anywhere. Avoids shipping a separate configuration file.

      Not sure I understand this one fully, but such a file would likely come with compilation instructions that included the input encoding.

      Annecdotally, I looked through the ~70 results for the following search [1]_ on "grep.app"
      coding[:=]( \t)*(([^u\W]|u[^t\W]|ut[^f\W]|utf-?[^8\W])[-\w.]+)
      (lookahead/lookbehinds aren't supported) and no file had a coding slug that occured in the first two lines. Whilst obviously only a fraction of extant reST files are indexed by that provider, if it was a pattern in common usage I would expect to see more than 0.

      One of my longer-term goals is to simplify docutils.io quite a lot, as I think there is a lot of duplicated code that the current (Python 3) stdlib provides automatically for us. Making our file parsing more vanilla/standard is a step towards this larger goal, although I do believe this change stands alone on its merits.

      A

      _[1]: https://grep.app/search?current=7&q=coding%5B%3A%3D%5D%28%20%5Ct%29%2A%28%28%5B%5Eu%5CW%5D%7Cu%5B%5Et%5CW%5D%7Cut%5B%5Ef%5CW%5D%7Cutf-%3F%5B%5E8%5CW%5D%29%5B-%5Cw.%5D%2B%29&regexp=true&filter[lang][0]=reStructuredText

       
      • Günter Milde

        Günter Milde - 2022-06-12

        The underlying thrust of my argument is that this is very fragile --

        True, this method only works with ASCII compatible encodings.
        (This is one of the reasons why Docutils as well as PEP 263 complement it
        with BOM mark recognition.)

        ...

        If the need arises [...], we would accept a feature request for
        buildhtml.py to have some enumeration of files and their input
        encodings.

        IMO, it is more safe keep "source code encoding both visible and
        changeable on a per-source file basis". [PEP 263]

        Python3 still supports the encoding slug.
        I vote to keep this option as well.

         
        • Adam  Turner

          Adam Turner - 2022-06-12

          Fair enough, I will put this on hold for now. [bugs:#450] is more important to resolve at the moment before 0.19.0b1 release.

          A

           

          Related

          Bugs: #450

  • Günter Milde

    Günter Milde - 2022-06-14

    Ticket moved from /p/docutils/bugs/451/

     

    Last edit: Günter Milde 2022-06-14
  • Günter Milde

    Günter Milde - 2022-07-15

    IMO, it is more safe keep "source code encoding both visible and changeable on a per-source file basis". [PEP 263]

    OTOH, this feature does not need to be implemented in docutils.io.
    The attached "inspecting_codecs" package is a first try to implement the current default behaviour as a codec -- allowing Docutils to use standard io tools.
    BTW: It also recognizes PEP-263-like encoding declarations in UTF-16 and UTF-32.
    With a replacement sufficiently stable and available either inside Docutils or as separate package,
    deprecating the current encoding handling is OK.

     
  • Günter Milde

    Günter Milde - 2022-07-24

    The (still provisional) "inspecting_codecs" package is now available on
    https://codeberg.org/milde/inspecting-codecs.

     

    Last edit: Günter Milde 2023-05-19
  • Günter Milde

    Günter Milde - 2023-04-18
    • status: open --> closed-accepted
    • Group: --> None
     
  • Günter Milde

    Günter Milde - 2023-04-18

    The way forward is now specified in the RELEASE-NOTES.

     

Log in to post a comment.