From: Guenter M. <mi...@us...> - 2022-11-04 15:54:37
|
Dear Adam, a follow up with some more discoveries and thoughts... On 2022-11-03, Guenter Milde via Docutils-develop wrote: > On 2022-11-02, Adam Turner wrote: >>>>> r9167 >> Partially reverted in [r9202] ... > Regarding the "core.publish_string()" function ... Alternatives forward: > 1. Just keep as is. > This is core API behaviour, so we need a very strong indication > that the advantages outweigh the hassle. > 2. Add a new boolean argument: "encode". > With ``publish_string(encode=False)``, the "output_encoding" setting > would be ignored and the return value is a `bytes` object. > With ``publish_string(encode=True)``, the behaviour is as currently, > encoding with "output_encoding" or returning `bytes` or `str` object > (the latter for ``output_encoding == "unicode"``). > The default value could change from True to False in Docutils 2.0. > 3. Deprecate¹ "publish_string()" in favour of new, separate > "publish_unicode_str()" and "publish_bytes()" functions. > ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0. 4. Keep "publish_string()" as-is. New function "publish_str_instance()", say. The "output_encoding" setting will be used for encoding declarations and to determine required character replacements in the LaTeX writer but the return value is guaranteed to be a `str` instance. My current favourites are 1 or 4. New discovered bug with ``output_encoding == "unicode"``. r9202 contains a "quick and dirty fix" for a hithero unrecognized problem with the pseudo-encoding name "unicode": This non-standard name is used in encoding declarations in XML and HTML and in a package call in LaTeX. The problem is, that we don't know which encoding an application calling `publish_string()` will finally use for storing/transferring the output outside the realms of Python. Given this, an encoding declaration with hard-coded "utf-8" is worse than no encoding declaration. A fix for LaTeX would be diff --git a/docutils/docutils/writers/latex2e/__init__.py b/docutils/docutils/writers/latex2e/__init__.py index 45086ea09..a306fef9a 100644 --- a/docutils/docutils/writers/latex2e/__init__.py +++ b/docutils/docutils/writers/latex2e/__init__.py @@ -1304,7 +1304,8 @@ class LaTeXTranslator(nodes.NodeVisitor): # ~~~~~~~~~~~~~~~~ # Encodings: # Docutils' output-encoding => TeX input encoding - if self.latex_encoding != 'ascii': + if self.latex_encoding not in ('ascii', 'unicode'): + # TODO: also don't insert for 'utf8' (cf. RELEASE-NOTES) self.requirements['_inputenc'] = (r'\usepackage[%s]{inputenc}' % self.latex_encoding) # TeX font encoding @@ -1459,8 +1460,7 @@ # 'iso-8859-7': '' # greek # 'iso-8859-8': '' # hebrew # 'iso-8859-10': '' # latin6, more complete iso-8859-4 - 'unicode': 'utf8', # TEMPORARY, remove in Docutils 0.21 } encoding = docutils_encoding.lower() In the unit tests, I suggest decoding the `bytes` returned by `publish_string()` over the "unicode" pseudo-encoding, e.g. diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py index 496e68dd7..2df660d85 100644 --- a/docutils/test/DocutilsTestSupport.py +++ b/docutils/test/DocutilsTestSupport.py @@ -493,7 +493,7 @@ class WriterPublishTestCase(CustomTestCase, docutils.SettingsSpec): settings_default_overrides = {'_disable_config': True, 'strict_visitor': True, - 'output_encoding': 'unicode'} + } writer_name = '' # set in subclasses or constructor def __init__(self, *args, writer_name='', **kwargs): @@ -509,7 +509,11 @@ class WriterPublishTestCase(CustomTestCase, docutils.SettingsSpec): writer_name=self.writer_name, settings_spec=self, settings_overrides=self.suite_settings) - self.assertEqual(str(output), str(self.expected)) + try: + output = output.decode() + except AttributeError: + pass + self.assertEqual(output, self.expected) class PublishTestSuite(CustomTestSuite): As an aside: The `docutils.io.StringInput` class can transparently handle input from `str` and `bytes` instances. The following patch makes this more explicit: diff --git a/docutils/docutils/io.py b/docutils/docutils/io.py index 6714ca22b..53e93886a 100644 --- a/docutils/docutils/io.py +++ b/docutils/docutils/io.py @@ -576,15 +576,15 @@ class BytesOutput(Output): class StringInput(Input): - - """ - Direct string input. - """ + """Input from a `str` or `bytes` instance.""" default_source_path = '<string>' def read(self): - """Decode and return the source string.""" + """Return the source as `str` instance. + + Decode, if required (see `Input.decode`). + """ return self.decode(self.source) The only effect of setting ``input_encoding = 'unicode'`` is a warning if the input object is not a `str` instance. Given that 'unicode' is a non-standard encoding name that may be deprecated later, it is, IMV, not worth recommending this in the docstring to `publish_string`: diff --git a/docutils/docutils/core.py b/docutils/docutils/core.py index 4d50e7fc7..798ffc5a1 100644 --- a/docutils/docutils/core.py +++ b/docutils/docutils/core.py @@ -437,10 +437,6 @@ def publish_string(source, source_path=None, destination_path=None, publish_string(..., settings_overrides={'output_encoding': 'unicode'}) - Similarly for Unicode string input (`source`):: - - publish_string(..., settings_overrides={'input_encoding': 'unicode'}) - Parameters: see `publish_programmatically`. """ warnings.warn('The return type of publish_string will change to ' hope to hear from you soon, Günter |