Docutils: Documentation Utilities / Bugs / #490 EncodingWarnings in io module

Günter Milde - 2024-06-29

Thank you for the feedback. The problem is worked on:

In the repository version, the default encoding is "utf-8".

In Docutils 1.0, the input encoding detection will be removed. This removal includes the fallback "getpreferredencoding()" in line 151.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2024-07-22

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2024-07-22

[r9772] changes the default encoding from None (auto-detect) to "utf-8" in docutils.io.Input and docutils.io.FileInput.

Related

Commit: [r9772]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[r9772] breaks tests for non-UTF locales on both Linux and Windows (e.g. ISO 88591), when not using Python's UTF-8 mode.

See the following failures (from GitHub Actions, scroll up to the first section Run test suite (pytest ./test)):

_____________________ FileInputTests.test_fallback_no_utf8 _____________________

self = <test.test_io.FileInputTests testMethod=test_fallback_no_utf8>

    @unittest.skipIf(preferredencoding in (None, 'ascii', 'utf-8'),
                     'locale encoding not set or UTF-8')
    def test_fallback_no_utf8(self):
        # If  no encoding is given and decoding with 'utf-8' fails,
        # use the locale's preferred encoding (if not None).
        # Provisional: the default will become 'utf-8'
        # (without auto-detection and fallback) in Docutils 0.22.
        source = du_io.FileInput(
            source_path=os.path.join(DATA_ROOT, 'latin1.txt'))
>       data = source.read()

test/test_io.py:321: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
docutils/io.py:412: in read
    data = self.source.read()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <encodings.utf_8.IncrementalDecoder object at 0x7f023f9906f0>
input = b'Gr\xfc\xdfe\n', final = True

>   ???
E   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 2: invalid start byte

<frozen codecs>:322: UnicodeDecodeError

I'm not sure what the right behaviour here should be.

There's also a problem on the same non-UTF-8 locales when not in UTF-8 mode:

======================================================================
ERROR: test_publish_cmdline (test_publisher.ConvenienceFunctionTests.test_publish_cmdline)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/docutils/docutils/docutils/docutils/io.py", line 525, in write
    self.destination.write(data)
  File "/home/runner/work/docutils/docutils/docutils/test/alltests.py", line 63, in write
    self.stream.write(string)
TypeError: write() argument must be str, not bytes

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/docutils/docutils/docutils/docutils/io.py", line 529, in write
    self.destination.buffer.write(data)
    ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Tee' object has no attribute 'buffer'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/docutils/docutils/docutils/test/test_publisher.py", line 160, in test_publish_cmdline
    core.publish_cmdline(writer_name='null',
  File "/home/runner/work/docutils/docutils/docutils/docutils/core.py", line 431, in publish_cmdline
    output = publisher.publish(
             ^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/docutils/docutils/docutils/docutils/core.py", line 261, in publish
    output = self.writer.write(self.document, self.destination)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/docutils/docutils/docutils/docutils/writers/__init__.py", line 81, in write
    return self.destination.write(self.output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/docutils/docutils/docutils/docutils/io.py", line 533, in write
    raise ValueError(
ValueError: Encoding of <file> (iso8859-1) differs 
  from specified encoding (utf-8)

----------------------------------------------------------------------
Ran 1872 tests in 4.888s

FAILED (errors=1, skipped=2)
Elapsed time: 5.079 seconds

(On Windows it says ValueError: Encoding of <file> (cp1252) differs instead).

This failure only happens with alltests.py. Now that both pytest and unittest work with our test suite, we could consider removing alltests.py.

Commit: [r9772]

Adam Turner - 2024-08-01

I tested reverting and re-applying [r9772] in this PR -- note that the 'alltest' failure occurs in the before and after, but pytest and unittest fail when [r9772] is reapplied.

A

Related

Commit: [r9772]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason R. Coombs - 2024-08-01

I'm not sure what the right behaviour here should be.

In my opinion, the project should stop honoring the "preferred encoding" and instead expect UTF-8 unless otherwise specified, as that's going to become the default behavior in Python 3.14 for most IO operations. I'm unsure of compatibility implications. It does appear as if this test (test_fallback_no_utf8) would no longer be relevant in that regime, so I'd just delete it.

Regarding the non-UTF8 mode, that does sound more complicated, although maybe that functionality too should be deprecated/removed. That is, IMHO, the user should be offered UTF-8 mode by default and an option to specify an encoding, maybe with "locale" as one option, but otherwise remove the implied "locale" behavior.

I have a very weak understanding of docutils, however, so take my advice with a grain of salt.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adam Turner - 2024-08-01

In my opinion, the project should stop honoring the "preferred encoding" and instead expect UTF-8 unless otherwise specified, as that's going to become the default behavior in Python 3.14 for most IO operations.

I agree, however...

I'm unsure of compatibility implications.

There was fairly extensive discussion of this in April last year. The core issue is that Docutils serialises to formats that have internal charset/encoding declarations (e.g. TeX, HTML, XML). If everything is UTF-8 then all is fine and simple, but if the user wants e.g. latin1 encoding then Docutils 'should' encode that in the relevant places in the output documents. Docutils also chooses whether to embed a Unicode character directly vs using an escape or macro (e.g. the dagger † footnote symbol) based on the chosen encoding.

I am of the opinion that Docutils should remove support for encodings other than Unicode (UTF-8) in text mode for both input and output. UTF-8 is so ubiquitous that anyone running a modern enough Python to use this version of Docutils will either support UTF-8 or know how to work-around any problems.

The only writer that makes runtime use of the output encoding setting is LaTeX. LuaTex and XeTeX have always supported UTF-8 in source files, and LaTeX has since 2018.

To your original point, running with PYTHONWARNDEFAULTENCODING=1 should now produce no warnings. If you still get warnings please let us know as it would mean we are missing test coverage (Docutils' tests pass with -Werror and -Xwarn_default_encoding).

A

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2024-08-07

The only writer that makes runtime use of the output encoding setting is LaTeX.

Output encoding defaults to "utf-8" for all writers since several years.
Most writers honour the "output-encoding" setting and encode the output file accordingly. So, you may use "rst2html5 --output-encoding=ASCII:xmlcharrefreplace" to have a pure ASCII file.
The HTML, XML, and LaTeX writers also specify the used encoding in the file, the LaTeX writer also provides replacements for Unicode characters that are not encodable if a legacy output encoding is selected. There should be no cases of output_encoding == None.

Input encoding was "auto-select" with fallback to utf-8 and "locale encoding" until 0.21. After the discussion last year the transition to utf-8 started: 0.22 uses "utf-8" as input encoding default,
we will remove the input encoding auto-detection code in Docutils 1.0.

The offending test case, "test_fallback_no_utf8()" is more trouble than help and removed in [r9864].

Related

Commit: [r9864]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

EncodingWarnings in io module

Searches

Help

#490 EncodingWarnings in io module

Discussion

Related

Related

Related

Related