Re: [Docutils-develop] Recent commit activity

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Günter, 

>> -----

>>>>> The naming of the `core.publish_string()` API function

> I see your point. OTOH:

> * `output-encoding`__ is a *general* setting defined as
>   "The text encoding for output". 
>   This raises the expectation that all Docutils output "has" the
>   specified encoding and makes the "most obvious" return value a bit less
>   obvious.
>     
>   __  https://docutils.sourceforge.io/docs/user/config.html#output-encoding 

> * The behaviour of `publish_string()` has been stable for many years.
>   Long-time users are familiar with it and expect it to remain stable.

There are 1283 projects published on PyPI that depend on Docutils. I have gone through each of these projects, there are 35 (2.7%) that use the ``docutils.core.publish_string`` function.

Of those: 

* 8 set ``'output_encoding': 'unicode'``, so would be unaffected by the eventual change to return strings. (traitsui, odoo-tools, CodeChat, benchmarkstt, pyfda, pyretis, anna, resplendent)
* 14 use the "UTF-8" encoding and call ``.decode()`` straight after, so would have to refactor but want to use Python strings. (Orange-Canvas-Core, galaxy-util, vb2py, gluetool, madgui, pydoc-fork, ScopeSim, bluewhale-canvas-core, rstdoc, galaxy-lib, orange-canvas-core-ml, meditor, jarn.viewdoc, formiko)
* 4 are agnostic to output type, as they pass the output of ``publish_string`` straight into ``BeautifulSoup()`` or ``xml.etree.ElementTree.fromstring``, both of which accept either bytes or str. (doc-warden, testimony, pyLanguagetool, turq)
* 3 use custom writers or the document tree, and don't use the returned output (pydoctor, restview, fairy-slipper)
* 2 are broken by calling ``str()`` on a bytes instance without an ``encoding`` argument. (prettyqt, cornice_sphinx)
* 1 ignores output and just uses the call to check it doesn't raise any exceptions (rstcheck-core)
* 3 expect ``bytes`` and could use the proposed ``publish_bytes()`` function (awscli, bugrest, quorachallenge)

I am happy to work with the ~17 (14 + 3) that would be affected to help them to refactor, should we agree on a way forwards.

Out of interest, none used an ``output_encoding`` setting other than "unicode" or "utf-8".

I would be content to delay the switch-over of return type from ``publish_string`` to Docutils 1.0 or 2.0 should more time be needed, but I suppose I see the other scenarios as sub-optimal for the long-term in one way or another -- e.g. ``publish_str_instance`` is unweidly to use regularly, and using e.g. ``publish_str`` instead would be confusing when ``publish_string`` still exists.

>>> Regarding the "core.publish_string()" function, I see three possibilities:

>>> Alternatives forward:

>>> 1. [Revert to Docutils 0.19 behaviour, with clearer documentation].
>>> 2. Add a new boolean argument: "encode". 
>>> 3. Deprecate "publish_string()" in favour of new, separate
>>>    "publish_unicode_str()" and "publish_bytes()" functions.
>>> 4. [Revert] "publish_string()" [to Docutils 0.19 behaviour].
>>>     New function "publish_str_instance()", say.

> If the documentation is clear about possible return values (and even more after adding type hints) users should be able to live with the unfortunate naming.

Unfortunatley as far as I am aware type hints are unable to code for a setting within a dictionary affecting the return type of the function. I agree the documentation should be made clearer.

> OTOH, I see a use-case for a convenience function returning a `str` instance also in cases where an "intended encoding" of the output is given in the "output-encoding" setting. This way, a program using this function can export a HTML, XML or LaTeX with an encoding declaration as `str` instance, post-process and finally encode it before handing it to storage or a non-Python processor.

Yes, this is my general view too -- I see ``publish_string`` as a function to be called from other Python programmes.

> If we are going to change the core API functionality regarding the convenience function(s) to publish the output as `str` or `bytes` instance, then we should:

> * do not start this in the middle of a major refactoring of the test suite
>   (where it is hard to spot the changes in expected output from
>   "cosmetic" changes in the test code).

I agree with this, in retrospect it was a poor choice.

> * do it in a "quasi static" manner: both, old and the new behaviour must
>   be accessible over a sequence of two or more stable releases.

This of course makes sense.

> This means that if we want to introduce an explicit `publish_bytes()` convenience function, a corresponding `BytesOutput` class is appropriate.

OK, though it seems this is dependent on the outcome of the ``publish_string`` decision.

-----

Thanks,
Adam