Re: [Docutils-develop] Recent commit activity

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Adam,

a follow up with some more discoveries and thoughts...

On 2022-11-03, Guenter Milde via Docutils-develop wrote:
> On 2022-11-02, Adam Turner wrote:

>>>>> r9167
>> Partially reverted in [r9202]
...

> Regarding the "core.publish_string()" function ...

Alternatives forward:

> 1. Just keep as is.

>    This is core API behaviour, so we need a very strong indication
>    that the advantages outweigh the hassle.

> 2. Add a new boolean argument: "encode". 

>    With ``publish_string(encode=False)``, the "output_encoding" setting
>    would be ignored and the return value is a `bytes` object.

>    With ``publish_string(encode=True)``, the behaviour is as currently,
>    encoding with "output_encoding" or returning `bytes` or `str` object
>    (the latter for ``output_encoding == "unicode"``).

>    The default value could change from True to False in Docutils 2.0.

> 3. Deprecate¹ "publish_string()" in favour of new, separate
>    "publish_unicode_str()" and "publish_bytes()" functions.

>    ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0.

  4. Keep "publish_string()" as-is.
     New function "publish_str_instance()", say.

     The "output_encoding" setting will be used for encoding declarations
     and to determine required character replacements in the LaTeX writer
     but the return value is guaranteed to be a `str` instance.
     
     
My current favourites are 1 or 4.


New discovered bug with ``output_encoding == "unicode"``.

r9202 contains a "quick and dirty fix" for a hithero unrecognized problem
with the pseudo-encoding name "unicode": 

This non-standard name is used in encoding declarations in XML and HTML and
in a package call in LaTeX.

The problem is, that we don't know which encoding an application calling
`publish_string()` will finally use for storing/transferring the output
outside the realms of Python.
Given this, an encoding declaration with hard-coded "utf-8" is worse than no
encoding declaration.

A fix for LaTeX would be

diff --git a/docutils/docutils/writers/latex2e/__init__.py b/docutils/docutils/writers/latex2e/__init__.py
index 45086ea09..a306fef9a 100644
--- a/docutils/docutils/writers/latex2e/__init__.py
+++ b/docutils/docutils/writers/latex2e/__init__.py
@@ -1304,7 +1304,8 @@ class LaTeXTranslator(nodes.NodeVisitor):
         # ~~~~~~~~~~~~~~~~
         # Encodings:
         # Docutils' output-encoding => TeX input encoding
-        if self.latex_encoding != 'ascii':
+        if self.latex_encoding not in ('ascii', 'unicode'):
+            # TODO: also don't insert for 'utf8' (cf. RELEASE-NOTES)
             self.requirements['_inputenc'] = (r'\usepackage[%s]{inputenc}'
                                               % self.latex_encoding)
         # TeX font encoding
@@ -1459,8 +1460,7 @@
               # 'iso-8859-7': ''   # greek
               # 'iso-8859-8': ''   # hebrew
               # 'iso-8859-10': ''  # latin6, more complete iso-8859-4
-              'unicode': 'utf8',  # TEMPORARY, remove in Docutils 0.21
               }
         encoding = docutils_encoding.lower()


In the unit tests, I suggest decoding the `bytes` returned by
`publish_string()` over the "unicode" pseudo-encoding, e.g.

diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py
index 496e68dd7..2df660d85 100644
--- a/docutils/test/DocutilsTestSupport.py
+++ b/docutils/test/DocutilsTestSupport.py
@@ -493,7 +493,7 @@ class WriterPublishTestCase(CustomTestCase, docutils.SettingsSpec):
 
     settings_default_overrides = {'_disable_config': True,
                                   'strict_visitor': True,
-                                  'output_encoding': 'unicode'}
+                                  }
     writer_name = ''  # set in subclasses or constructor
 
     def __init__(self, *args, writer_name='', **kwargs):
@@ -509,7 +509,11 @@ class WriterPublishTestCase(CustomTestCase, docutils.SettingsSpec):
               writer_name=self.writer_name,
               settings_spec=self,
               settings_overrides=self.suite_settings)
-        self.assertEqual(str(output), str(self.expected))
+        try:
+            output = output.decode()
+        except AttributeError:
+            pass
+        self.assertEqual(output, self.expected)
 
 
 class PublishTestSuite(CustomTestSuite):



As an aside:

The `docutils.io.StringInput` class can transparently handle input from
`str` and `bytes` instances. 
The following patch makes this more explicit:

diff --git a/docutils/docutils/io.py b/docutils/docutils/io.py
index 6714ca22b..53e93886a 100644
--- a/docutils/docutils/io.py
+++ b/docutils/docutils/io.py
@@ -576,15 +576,15 @@ class BytesOutput(Output):
 
 
 class StringInput(Input):
-
-    """
-    Direct string input.
-    """
+    """Input from a `str` or `bytes` instance."""
 
     default_source_path = '<string>'
 
     def read(self):
-        """Decode and return the source string."""
+        """Return the source as `str` instance.
+
+        Decode, if required (see `Input.decode`).
+        """
         return self.decode(self.source)
 

The only effect of setting ``input_encoding = 'unicode'`` is a warning if
the input object is not a `str` instance. Given that 'unicode' is a
non-standard encoding name that may be deprecated later, it is, IMV, not
worth recommending this in the docstring to `publish_string`:

diff --git a/docutils/docutils/core.py b/docutils/docutils/core.py
index 4d50e7fc7..798ffc5a1 100644
--- a/docutils/docutils/core.py
+++ b/docutils/docutils/core.py
@@ -437,10 +437,6 @@ def publish_string(source, source_path=None, destination_path=None,
 
         publish_string(..., settings_overrides={'output_encoding': 'unicode'})
 
-    Similarly for Unicode string input (`source`)::
-
-        publish_string(..., settings_overrides={'input_encoding': 'unicode'})
-
     Parameters: see `publish_programmatically`.
     """
     warnings.warn('The return type of publish_string will change to '


hope to hear from you soon,

Günter