[Docutils-develop] the `core.publish_string()` API function (was: Recent commit activity)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Adam, dear Docutils developers,

On 2022-11-13, Adam Turner wrote:

>> * `output-encoding`__ is a *general* setting defined as
>>   "The text encoding for output". This raises the expectation that all
>>   Docutils output "has" the specified encoding
...
>> * The behaviour of `publish_string()` has been stable for many years.
>>   Long-time users are familiar with it and expect it to remain stable.

> There are 1283 projects published on PyPI that depend on Docutils. I
> have gone through each of these projects, there are 35 (2.7%) that use
> the ``docutils.core.publish_string`` function.

> Of those:

> * 8 set ``'output_encoding': 'unicode'``, so would be unaffected by the
> eventual change to return strings. (traitsui, odoo-tools, CodeChat,
> benchmarkstt, pyfda, pyretis, anna, resplendent)
> * 14 use the "UTF-8" encoding and call ``.decode()`` straight after, so
> would have to refactor but want to use Python strings.
> (Orange-Canvas-Core, galaxy-util, vb2py, gluetool, madgui, pydoc-fork,
> ScopeSim, bluewhale-canvas-core, rstdoc, galaxy-lib,
> orange-canvas-core-ml, meditor, jarn.viewdoc, formiko)
> * 4 are agnostic to output type, as they pass the output of
> ``publish_string`` straight into ``BeautifulSoup()`` or
> ``xml.etree.ElementTree.fromstring``, both of which accept either bytes
> or str. (doc-warden, testimony, pyLanguagetool, turq)

Good news for Docutils: we are not alone. 
"xml.etree" also uses:

"string" or "string constant"
  as a superordinate term for a sequence of characters, either
  encoded (`bytes`) or as Unicode code points (`str`).

"unicode" 
   as a pseudo-encoding name for "no encoding" (i.e. "return as
   `str` instance").

Maybe we can agree with the etree team on a compatible terminology and way
forward.

> * 3 use custom writers or the document tree, and don't use the returned
> output (pydoctor, restview, fairy-slipper)
> * 2 are broken by calling ``str()`` on a bytes instance without an
> ``encoding`` argument. (prettyqt, cornice_sphinx)
> * 1 ignores output and just uses the call to check it doesn't raise any
> exceptions (rstcheck-core)
> * 3 expect ``bytes`` and could use the proposed ``publish_bytes()``
>   function (awscli, bugrest, quorachallenge)

One more, `pyreport`_, is currently unmaintained and Python2 only...

.. _pyreport: https://github.com/joblib/pyreport

There may be more use cases in unpublished packages/modules/scripts
or helper scripts in non-Python projects.

> I am happy to work with the ~17 (14 + 3) that would be affected to help
> them to refactor, should we agree on a way forwards.

I am quite confident that we will find a consensus.

I would still want to revert the FutureWarnings until there is a stable
alternative in place.

> Out of interest, none used an ``output_encoding`` setting other than
> "unicode" or "utf-8".

Did you also check the configuration files?
Docutils also defaults to "output_encoding: utf-8" but allows users to
change this to any valid encoding via settings_spec or settings_overrides
or in a configuration file.
(Just checked: publish_string() respects the "output_encoding" set in a
docutils.conf file.)

> I would be content to delay the switch-over of return type from
> ``publish_string`` to Docutils 1.0 or 2.0 should more time be needed,
> but I suppose I see the other scenarios as sub-optimal for the
> long-term in one way or another -- e.g. ``publish_str_instance`` is
> unweidly to use regularly, and using e.g. ``publish_str`` instead would
> be confusing when ``publish_string`` still exists.

OTOH, `publish_string()` vs. `publish_bytes()` mismatch: one uses a
Python3 datatype name while the other an overloaded general term.

>>>> Regarding the "core.publish_string()" function, I see three possibilities:

>>>> Alternatives forward:

>>>> 1. [Revert to Docutils 0.19 behaviour, with clearer documentation].

>>>> 2. Add a new boolean argument: "encode".

>>>> 3. Deprecate "publish_string()" in favour of new, separate
        "publish_str()" and "publish_bytes()" functions.

During the transition period, editors with name completion will show both
`publish_str()` and `publish_string()` in the expansion list and coders
will likely look up the docstring for the difference.

>> If the documentation is clear about possible return values (and even
>> more after adding type hints) users should be able to live with the
>> unfortunate naming.

> Unfortunatley as far as I am aware type hints are unable to code for a
> setting within a dictionary affecting the return type of the function.
> I agree the documentation should be made clearer.

Type hints should support documenting the fact that a function accepts or
returns any of a set of data types (e.g. "`int` or `str`", "`int` or
`float`", or "`str` or `bytes`"). The details/conditions should be given
in the docstring.

>> OTOH, I see a use-case for a convenience function returning a `str`
>> instance also in cases where an "intended encoding" of the output is
>> given in the "output-encoding" setting. This way, a program using this
>> function can export a HTML, XML or LaTeX with an encoding declaration
>> as `str` instance, post-process and finally encode it before handing
>> it to storage or a non-Python processor.

> Yes, this is my general view too -- I see ``publish_string`` as a
> function to be called from other Python programmes.

>> If we are going to change the core API functionality regarding the
>> convenience function(s) to publish the output as `str` or `bytes`
>> instance, then we should:

>> * do not start this in the middle of a major refactoring of the test suite
>>   (where it is hard to spot the changes in expected output from
>>   "cosmetic" changes in the test code).

> I agree with this, in retrospect it was a poor choice.

>> * do it in a "quasi static" manner: both, old and the new behaviour must
>>   be accessible over a sequence of two or more stable releases.

> This of course makes sense.

>> This means that if we want to introduce an explicit `publish_bytes()`
>> convenience function, a corresponding `BytesOutput` class is
>> appropriate.

> OK, though it seems this is dependent on the outcome of the
> ``publish_string`` decision.

Maybe we can also directly support `str` as `destination_class` value in
core.Publisher and publish_programmatically().

The whole problem looks like a candidate for one more enhancement proposal ;)

Günter