Menu

#490 EncodingWarnings in io module

open-fixed
nobody
None
5
2024-08-07
2024-06-28
No

When running the distutils tests with PYTHONWARNDEFAULTENCODING=1, two warnings are emitted:

distutils/tests/test_check.py::TestCheck::test_check_restructuredtext
  /Users/jaraco/code/pypa/distutils/.tox/py/lib/python3.12/site-packages/docutils/io.py:381: EncodingWarning: 'encoding' argument not specified
    self.source = open(source_path, mode,

distutils/tests/test_check.py::TestCheck::test_check_restructuredtext
  /Users/jaraco/code/pypa/distutils/.tox/py/lib/python3.12/site-packages/docutils/io.py:151: EncodingWarning: UTF-8 Mode affects locale.getpreferredencoding(). Consider locale.getencoding() instead.
    fallback = locale.getpreferredencoding(do_setlocale=False)

Docutils should honor PEP 597 and address these warnings (and possibly others). In my experience, adding encoding='utf-8' to any io operation is the best approach - it's straight-up compatible with the default on non-Windows systems and usually honoring the Unix convention is suitable if not preferable on Windows. Not only that, but that behavior will become the default in Python 3.15 or so.

Discussion

  • Günter Milde

    Günter Milde - 2024-06-29

    Thank you for the feedback. The problem is worked on:

    • In the repository version, the default encoding is "utf-8".
    • In Docutils 1.0, the input encoding detection will be removed. This removal includes the fallback "getpreferredencoding()" in line 151.
     
  • Günter Milde

    Günter Milde - 2024-07-22
    • status: open --> open-fixed
     
  • Günter Milde

    Günter Milde - 2024-07-22

    [r9772] changes the default encoding from None (auto-detect) to "utf-8" in docutils.io.Input and docutils.io.FileInput.

     

    Related

    Commit: [r9772]

  • Adam  Turner

    Adam Turner - 2024-08-01

    [r9772] breaks tests for non-UTF locales on both Linux and Windows (e.g. ISO 88591), when not using Python's UTF-8 mode.

    See the following failures (from GitHub Actions, scroll up to the first section Run test suite (pytest ./test)):

    _____________________ FileInputTests.test_fallback_no_utf8 _____________________
    
    self = <test.test_io.FileInputTests testMethod=test_fallback_no_utf8>
    
        @unittest.skipIf(preferredencoding in (None, 'ascii', 'utf-8'),
                         'locale encoding not set or UTF-8')
        def test_fallback_no_utf8(self):
            # If  no encoding is given and decoding with 'utf-8' fails,
            # use the locale's preferred encoding (if not None).
            # Provisional: the default will become 'utf-8'
            # (without auto-detection and fallback) in Docutils 0.22.
            source = du_io.FileInput(
                source_path=os.path.join(DATA_ROOT, 'latin1.txt'))
    >       data = source.read()
    
    test/test_io.py:321: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    docutils/io.py:412: in read
        data = self.source.read()
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <encodings.utf_8.IncrementalDecoder object at 0x7f023f9906f0>
    input = b'Gr\xfc\xdfe\n', final = True
    
    >   ???
    E   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte
    
    <frozen codecs>:322: UnicodeDecodeError
    

    I'm not sure what the right behaviour here should be.


    There's also a problem on the same non-UTF-8 locales when not in UTF-8 mode:

    ======================================================================
    ERROR: test_publish_cmdline (test_publisher.ConvenienceFunctionTests.test_publish_cmdline)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/home/runner/work/docutils/docutils/docutils/docutils/io.py", line 525, in write
        self.destination.write(data)
      File "/home/runner/work/docutils/docutils/docutils/test/alltests.py", line 63, in write
        self.stream.write(string)
    TypeError: write() argument must be str, not bytes
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/runner/work/docutils/docutils/docutils/docutils/io.py", line 529, in write
        self.destination.buffer.write(data)
        ^^^^^^^^^^^^^^^^^^^^^^^
    AttributeError: 'Tee' object has no attribute 'buffer'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/runner/work/docutils/docutils/docutils/test/test_publisher.py", line 160, in test_publish_cmdline
        core.publish_cmdline(writer_name='null',
      File "/home/runner/work/docutils/docutils/docutils/docutils/core.py", line 431, in publish_cmdline
        output = publisher.publish(
                 ^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/docutils/docutils/docutils/docutils/core.py", line 261, in publish
        output = self.writer.write(self.document, self.destination)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/docutils/docutils/docutils/docutils/writers/__init__.py", line 81, in write
        return self.destination.write(self.output)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/runner/work/docutils/docutils/docutils/docutils/io.py", line 533, in write
        raise ValueError(
    ValueError: Encoding of <file> (iso8859-1) differs 
      from specified encoding (utf-8)
    
    ----------------------------------------------------------------------
    Ran 1872 tests in 4.888s
    
    FAILED (errors=1, skipped=2)
    Elapsed time: 5.079 seconds
    

    (On Windows it says ValueError: Encoding of <file> (cp1252) differs instead).

    This failure only happens with alltests.py. Now that both pytest and unittest work with our test suite, we could consider removing alltests.py.

    A

     

    Related

    Commit: [r9772]

  • Adam  Turner

    Adam Turner - 2024-08-01

    I tested reverting and re-applying [r9772] in this PR -- note that the 'alltest' failure occurs in the before and after, but pytest and unittest fail when [r9772] is reapplied.

    A

     

    Related

    Commit: [r9772]

  • Jason R. Coombs

    Jason R. Coombs - 2024-08-01

    I'm not sure what the right behaviour here should be.

    In my opinion, the project should stop honoring the "preferred encoding" and instead expect UTF-8 unless otherwise specified, as that's going to become the default behavior in Python 3.14 for most IO operations. I'm unsure of compatibility implications. It does appear as if this test (test_fallback_no_utf8) would no longer be relevant in that regime, so I'd just delete it.

    Regarding the non-UTF8 mode, that does sound more complicated, although maybe that functionality too should be deprecated/removed. That is, IMHO, the user should be offered UTF-8 mode by default and an option to specify an encoding, maybe with "locale" as one option, but otherwise remove the implied "locale" behavior.

    I have a very weak understanding of docutils, however, so take my advice with a grain of salt.

     
  • Adam  Turner

    Adam Turner - 2024-08-01

    In my opinion, the project should stop honoring the "preferred encoding" and instead expect UTF-8 unless otherwise specified, as that's going to become the default behavior in Python 3.14 for most IO operations.

    I agree, however...

    I'm unsure of compatibility implications.

    There was fairly extensive discussion of this in April last year. The core issue is that Docutils serialises to formats that have internal charset/encoding declarations (e.g. TeX, HTML, XML). If everything is UTF-8 then all is fine and simple, but if the user wants e.g. latin1 encoding then Docutils 'should' encode that in the relevant places in the output documents. Docutils also chooses whether to embed a Unicode character directly vs using an escape or macro (e.g. the dagger † footnote symbol) based on the chosen encoding.

    I am of the opinion that Docutils should remove support for encodings other than Unicode (UTF-8) in text mode for both input and output. UTF-8 is so ubiquitous that anyone running a modern enough Python to use this version of Docutils will either support UTF-8 or know how to work-around any problems.

    The only writer that makes runtime use of the output encoding setting is LaTeX. LuaTex and XeTeX have always supported UTF-8 in source files, and LaTeX has since 2018.


    To your original point, running with PYTHONWARNDEFAULTENCODING=1 should now produce no warnings. If you still get warnings please let us know as it would mean we are missing test coverage (Docutils' tests pass with -Werror and -Xwarn_default_encoding).

    A

     
  • Günter Milde

    Günter Milde - 2024-08-07

    The only writer that makes runtime use of the output encoding setting is LaTeX.

    Output encoding defaults to "utf-8" for all writers since several years.
    Most writers honour the "output-encoding" setting and encode the output file accordingly. So, you may use "rst2html5 --output-encoding=ASCII:xmlcharrefreplace" to have a pure ASCII file.
    The HTML, XML, and LaTeX writers also specify the used encoding in the file, the LaTeX writer also provides replacements for Unicode characters that are not encodable if a legacy output encoding is selected. There should be no cases of output_encoding == None.

    Input encoding was "auto-select" with fallback to utf-8 and "locale encoding" until 0.21. After the discussion last year the transition to utf-8 started: 0.22 uses "utf-8" as input encoding default,
    we will remove the input encoding auto-detection code in Docutils 1.0.

    The offending test case, "test_fallback_no_utf8()" is more trouble than help and removed in [r9864].

     

    Related

    Commit: [r9864]


Log in to post a comment.